Convolutional neural networks (CNNs) have achieved high performance in several visual recognition tasks, including object classification, detection, and segmentation [19, 17, 5]. These networks learn representations by performing convolutional operations on an input tensor along its spatial dimensions using a sliding window approach. Because of the limited receptive field of a convolutional kernel, convolutional layers are not able to utilize the full encoding capacity of a tensor and therefore, we need to stack multiple convolutional layers to encode more information. This increases the complexity of a CNN.
width). (c) compares the top-1 accuracy on the ImageNet dataset at two different settings. Dimension-wise convolutions maximize the use of information contained in the tensor; thus leading to more accurate solutions.
To improve the efficiency of standard convolutions, most recent attempts have focused on reducing redundant operations and parameters in the convolutions [30, 20]. Recently, depth-wise separable convolutions  have been proposed to reduce CNNs complexity in terms of floating-point operations (FLOPs) and network parameters. These convolutions factorize the convolutional operation in two steps: (1) A light-weight filter is applied to each spatial plane using depth-wise convolutions  to learn spatial representations and (2) a point-wise () convolution is then applied to fuse channels and learn combinations between spatial representations. Similar to standard convolutions, depth-wise convolutions are not able to utilize the full encoding capacity of a tensor and therefore, point-wise convolutions are used to encode more information in depth-wise separable convolutions. Since depth-wise convolutions are computationally very efficient, this leaves a significant computational load on point-wise convolutions, which causes a computational bottleneck. As an example, point-wise convolutions account for about 90% of total operations in ShuffleNetv2  and MobileNetv2 .
In this paper, we introduce DiCENet, Dimension-wise Convolutions for Efficient Networks. Our dimension-wise convolution (DimConv) convolves separate filters across each dimension, including the channel (or depth111We will use “depth” and “channel” interchangeably in this paper.) dimension of the input tensor. Unlike standard and depth-wise convolutions, DimConv leverages the spatial locality of various channels and is thus, in a sense made precise in Section 3, not invariant to channel-wise permutations of an input tensor. Moreover, learning representations across all dimensions leads to an encoding mechanism that increases the use of information contained in the tensor. In particular, the increased learning capacity due to DimConv allows us to further reduce the computational burden from expensive point-wise convolutions. We introduce an Efficient channel Fusion (EFuse) to fuse dimension-wise representations efficiently. With DimConv and EFuse, we are able to build a more efficient CNN while maintaining accuracy (see Figure 1). We highlight that the introduced module is novel and is not a part of any architecture search methods [68, 56, 3].
We have evaluated the performance of DiCENet on several visual recognition tasks including object classification, detection, and semantic segmentation. Compared to existing methods, DiCENet shows a significant improvement across all tasks. When compared with existing efficient architecture designs, such as MobileNetv2  and ShuffleNetv2 , on the ImageNet dataset , DiCENet (1) is 1-2% more accurate for extremely small models (7-15 MFLOPs), (2) delivers similar performance with 15-20% fewer FLOPs for small models (38-160 MFLOPs), and (3) outperforms existing efficient designs with fewer FLOPs for medium-size (270-600 MFLOPs) models. When DiCENet is used as a base feature extractor on tasks such as semantic segmentation and object detection, it delivers a competitive performances and outperforms some of the task-specific tailored networks. For semantic segmentation on the PASCAL VOC 2012 dataset , DiCENet is 4.3% more accurate and has fewer FLOPs than ESPNet , a recent efficient segmentation network. DiCENet is 4.5% more accurate and has fewer FLOPs than YOLOv2  on the MS-COCO object detection .
2 Related Work
CNN architecture designs:
Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [29, 53, 19, 28, 55, 25]. To make these network designs more efficient, they have been extended with efficient and sparse forms of convolutions, such as depth-wise and group convolutions [22, 24, 64, 37, 51, 40]. In this paper, we introduce dimension-wise convolutions that generalize depth-wise convolutions to all dimensions of the input tensor.
Neural architecture search:
[68, 62, 45, 69, 56, 33]. These methods search over a huge network space (e.g. MNASNet  searches over 8K different design choices) using a dictionary of pre-defined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually non-homogeneous, that satisfies optimization constraints, such as inference time. Recent search-based methods [56, 3, 60] use MobileNetv2  as a basic search block for automatic network design. Since the proposed unit delivers better performance than MobileNetv2 (see Section 4), we believe that neural architecture search with our proposed unit would enable finding a better network design.
Other alternatives for efficient CNNs:
These approaches include network quantization [54, 44, 61, 8, 67, 26, 1], compression [14, 15, 59, 30, 57, 20], and distillation [21, 49, 13, 36, 63]. Network quantization-based approaches approximate convolution operations with fewer bits instead of using 32-bit full precision floating points. This improves speed at inference and also, reduces memory required for storing network weights. Network compression-based approaches improve the efficiency of a network by removing redundant weights and connections. Unlike network quantization and compression, distillation-based approaches improve the accuracy of (usually shallow) networks by supervising the training with large pre-trained networks.
Though these approaches have shown to be very effective for improving either the efficiency or the accuracy of a network, they fail to underline architectural design benefits. For a fair comparison with existing design-based methods such as ShuffleNets [37, 64], MobileNets [22, 51], and ESPNetv2 , we do not use these approaches.
In this section, we elaborate on the details of our model, DiCENet. We first discuss the building block of our model, DiCENet unit, and then describe DiCENet architecture.
3.1 DiCENet unit
The main building block of our model, the DiCENet unit, is based on a factorization principle and is primarily composed of two steps: (1) Dimension-wise Convolution (DimConv) convolves separate filters across each dimension of the input tensor to learn rich representations and (2) Efficient channel Fusion (EFuse) efficiently combines channels of the output tensor from the DimConv. Figure 2 shows an overview of the DiCENet unit. Dimension-wise convolutions encode information from all dimensions of the tensor. This increases the learning capacity of a convolutional layer and reduces the computational burden on other layers. Therefore, the proposed efficient channel fusion will work effectively with dimension-wise convolutions.
Dimension-wise convolution (DimConv):
The input to a convolutional layer is a three-dimensional tensor defined by width , height , and depth (or channels) . A standard depth-wise convolutional layer applies convolutional kernels , with each convolutional kernel processing a single channel along the spatial dimensions (width and height), across the depth dimension (channels) to produce an output ; where denotes the kernel size. This type of convolution encodes information within the order of the elements in spatial dimensions. In other words, for any reordering of the channels in the input tensor, there exist a reordering of weight filters in the model that can obtain the same output tensor. To encode information further along the channel dimension, we extend depth-wise convolutions to all dimensions of the input tensor and call this operation as Dimension-wise Convolution (DimConv). As illustrated in Figure 2, DimConv has three branches. These branches apply depth-wise convolutional kernels along depth, width-wise convolutional kernels along width, and height-wise convolutional kernels kernels along height to produce outputs , , that encode information from all dimensions of the input tensor. The output of these independent branches are concatenated along the channel dimension to produce the output . To facilitate learning of inter-dimension representations, we group the spatial planes (or feature maps) in such that each group has a spatial plane from each dimension. We then combine these dimension-wise grouped representations with point-wise convolutional kernels to produce weighted average output .
Efficient channel fusion (EFuse):
Recent efficient CNN designs [22, 51, 64, 37] use point-wise convolutions for channel fusion. To produce an output tensor of size from an input tensor of size , a point-wise convolution performs operations. This is a computationally expensive operation and is the main computational bottleneck in recent efficient CNN designs. We propose an Efficient channel Fusion (EFuse) as a substitute for channel-fusion with point-wise convolution. This allows us to skip majority of point-wise convolutions in a CNN architecture, thus reducing the computational burden from point-wise convolutions by a large margin (Figure 4).
In EFuse, instead of fusing the channels across all elements in the spatial dimensions, we squeeze spatial dimensions of using a global average pooling operation to extract a global vector descriptor along the spatial dimensions. We choose the sigmoid function to prevent gradient overflow during training. The complexity of our
using a global average pooling operation to extract a global vector descriptor. Then we fuse the channels within the global vector using a fully connected layer to produce a fusion vector . In parallel, we encode spatial representations by applying depth-wise convolutional kernels to to produce an output . Next, we propagate the fusion vector on along the spatial dimensions by applying a sigmoid on and then multiplying it with all elements of
along the spatial dimensions. We choose the sigmoid function to prevent gradient overflow during training. The complexity of ourEFuse layer is . For , , and , EFuse requires fewer operations than the standard point-wise convolution.
If we stack DimConv and EFuse, we obtain an efficient block structure, the DiCENet unit, that has fewer FLOPs for different input tensor sizes than other state-of-the-art efficient CNNs. Figure 0(b) provides a comparison between FLOPs and different input tensor sizes for different CNN units. We can see that DiCENet unit has about fewer FLOPs than existing units, such as ShuffleNetv2  and MobileNetv2  units, a major improvement over state-of-the-art efficient designs.
3.2 DiCENet architecture
The architecture design of DiCENet is motivated by ShuffleNetv2 , a state-of-the-art efficient network that uses channel split and channel shuffle to learn representations efficiently. Figure 3 contrasts building blocks of ShuffleNetv2 and DiCENet while overall architecture is shown in Table 1. Except for the first two layers, we replace all point-wise and depth-wise convolutions with the DiCENet unit. The first layer is a standard  and a PReLU non-linear activation layer , except for the last layer that feeds into a softmax for classification. Following previous work [22, 51, 37, 64, 40], we scale the number of output channels by a width scaling factor to construct networks at different FLOPs. We initialize weights of our network using the same method as in .
|Layer||Output size||Kernel size||Stride||Repeat||Output channels (network width scaling parameter )|
|Grouped FC ||1||1||512||1024||1024||1280||2048|
|FLOPs||6.5 M||12 M||24-240 M||298 M||553 M|
Dynamic input scaling:
A standard practice in computer vision community is to train a CNN on large-scale classification dataset (e.g. the ImageNet ) and then transfer learned representations to other tasks, including object detection and semantic segmentation, via fine-tuning. In general, visual recognition tasks, such as semantic segmentation and object detection, have images of higher spatial dimensions (e.g.
A standard practice in computer vision community is to train a CNN on large-scale classification dataset (e.g. the ImageNet
) and then transfer learned representations to other tasks, including object detection and semantic segmentation, via fine-tuning. In general, visual recognition tasks, such as semantic segmentation and object detection, have images of higher spatial dimensions (e.g.for the PASCAL VOC segmentation dataset ) than the ones on which classification network is trained (e.g. for the ImageNet dataset). Since DiCENet has two branches corresponding to width and height dimensions of the input, a natural question arises: “Can we use DiCENet with images that have spatial dimensions different than the one used for training on the classification dataset?”. To make DiCENet invariant to spatial dimensions of the input image, we dynamically scale (either up-sample or down-sample) the height or width dimension of the input tensor to the height or width of the input tensor used in the pretrained network. The resultant tensors are then scaled (either down-sampled or up-sampled) back to their original size before being fed to the weighted average function; this makes DiCENet invariant to an input image size. We note that this dynamic scaling allows DiCENet to learn scale-invariant features because each convolutional layer (depth- or width- or height-wise) receives input with different dimensions. Figure 5 sketches the DiCENet block with dynamic input scaling.
4 Experimental Results
In this section, we demonstrate the performance of DiCENet on three different visual recognition tasks: (1) object classification, (2) semantic segmentation, and (3) object detection. To showcase the strength of our network, we also compare it with state-of-the-art efficient networks, including task-specific networks such as YOLOv2  and ESPNet .
DiCENet uses ShuffleNetv2-style blocks (see Figure 3). For a fair comparison, we compare the performance of DiCENet with our implementation of ShuffleNetv2.
4.1 Object classification on the ImageNet
Implementation and dataset details:
Following a common practice [37, 64, 22, 51, 40], we evaluate the performance of DiCENet on the ImageNet 1000 way classification dataset  at different complexity levels, ranging from 6 MFLOPs to 500+ MFLOPs. The dataset consists of 1.28M training samples and 50K validation samples. We optimize our networks by minimizing a cross-entropy loss using SGD. We use the same learning rate and data augmentation policy as in  . We use PyTorch
. We use PyTorch as our framework for training these networks because of its ability to handle dynamic graphs.
We evaluate our networks performance using a single crop top-1 accuracy on the validation set.
Table 1(a) and 1(b) compares the performance of DiCENet with state-of-the-art efficient architectures at different complexity levels. For extremely small models (FLOP range: 6-20 M), DiCENet outperforms MobilNetv2  and ShuffleNetv2  by about 1.5%. Note that our model at 14 MFLOPs delivers the similar performance as MobileNetv1  at 41 MFLOPs (48% top-1 accuracy). For small models (FLOPs range: 30-160 MFLOPs), DiCENet achieves either the best accuracy with fewer FLOPs. At 41 MFLOPs, DiCENet outperforms MobileNetv2 and ShuffleNetv2 by about 3% and 1.5%. For medium-size models (FLOP range: 270 MFLOPs - 600 MFLOPs), DiCENet delivers the best performance. Notably, DiCENet delivers a top-1 accuracy of 75.1 with 551 MFLOPs, outperforming all previous efficient designs, such as ShuffleNets [37, 64] and MobileNets [22, 51], with fewer FLOPs. We would like to highlight that at 551 MFLOPs, DiCENet is about less accurate than ResNet-50 , but has fewer FLOPs.
Recently, neural architecture search (NAS)-based methods have gained attention. Recent NAS-based methods rely on architecture innovations to deliver better performance. For example, MNASNet  and FBNet  have shown remarkable improvements using MobileNetv2 as basic search block. For the sake of completeness, we have included a comparison with such methods in Table 1(c). DiCENet delivers competitive performance to networks derived using NAS. In particular, DiCENet outperforms FBNet and MNASNet-based small models (FLOPs range: 70-100 MFLOPs). We note that NAS-based methods are not directly comparable to DiCENet because they rely on existing CNN units, such as MobileNetv2, for network search. However, we believe that adding DiCENet unit in the search dictionary of NAS-based methods will deliver a better performance because DiCENet learns rich representations from all dimensions which should help it to outperform MobileNetv2 across different complexity-levels.
4.2 Multi-label classification on the MS-COCO
The goal of multi-label classification is to predict multiple labels per image. This is in contrast to the ImageNet classification task that aims to predict one label per image.
Implementation and dataset details:
Figure 6 summarizes quantitative results for multi-label classification. For similar FLOPs, DiCENet is 3.5% and 2% more accurate than ShuffleNetv2  and ESPNetv2 , respectively. Also, DiCENet is 3-4% less accurate and has fewer FLOPs when compared to heavy weight networks, including Elastic versions .
4.3 Semantic segmentation
Implementation and dataset details:
For the task of efficient semantic segmentation, we use encoder-decoder architecture; a widely studied efficient segmentation architecture [39, 40, 41, 48]. We initialize the encoder with the ImageNet pretrained model and then fine-tune it by minimizing a cross-entropy loss using SGD. We evaluate the performance on two segmentation datasets: (1) the Cityscapes  and (2) the PASCAL VOC 2012 .
The Cityscapes dataset has been widely studied for bench-marking the performance of efficient segmentation networks because of its application in urban scene understanding, especially for self-driving cars
The Cityscapes dataset has been widely studied for bench-marking the performance of efficient segmentation networks because of its application in urban scene understanding, especially for self-driving cars[39, 40, 41, 48]. The dataset provides pixel-wise fine annotations (20 classes including background) for 5K high resolution images () that are captured across 50 cities. We follow the same splits for training and validation as in [39, 40, 41, 48].
The PASCAL VOC 2012 dataset is a well-known segmentation dataset that provide annotations for 20 foreground objects. It has 1.4K training images, 1.4K validation images, and 1.4K test images. Following a standard convention [4, 66, 5], we also use additional images from  and  for training our networks.
We measure the accuracy in terms of mean intersection over union (mIOU). For both datasets, we use official scripts for evaluation on the validation set while we use an online server for evaluation on the test set because the test set is private. We do not use multi-scale testing strategy for evaluation.
 uses additional data from the MS-COCO dataset.
Figure 7 summarizes the quantitative results of DiCENet on both datasets, including comparison with state-of-the-art methods222Most of the efficient networks have not reported performance on the PASCAL VOC dataset, with an exception to ESPNet . Therefore, we compare the performance of DiCENet on this dataset with other networks that are not necessarily efficient.. We can see that: (1) The accuracy of DiCENet drops off smoothly when we decrease the image resolution. Note that even at low resolutions, DiCENet delivers a good accuracy. For example, with an input image size of , DiCENet has fewer FLOPs than SegNet  while delivering the same accuracy on the PASCAL VOC dataset. (2) DiCENet delivers better accuracy than existing efficient networks such as ENet , ESPNet , and RTSeg  on the Cityscapes dataset while requiring fewer FLOPs. Note that DiCENet is the most efficient network while delivering competitive performance to state-of-the-art efficient networks; thus making it a potential candidate for resource constrained devices.
4.4 Object detection
Implementation and dataset details:
For the task of efficient object detection, we use single shot object detector (SSD) . We use DiCENet pretrained on the ImageNet as base feature extractor instead of VGG  in SSD. We finetune our network using SGD with smooth L1 loss for object localization and cross-entropy loss for object classification. We evaluate the performance on two widely-studied object detection datasets: (1) the PASCAL VOC 2007  and (2) the MS-COCO dataset . Following a standard convention for training on the PASCAL VOC 2007 dataset, we also use the union of the PASCAL VOC 2012  and PASCAL VOC 2007 trainval set for training and evaluate the performance on the PASCAL VOC 2007 test set.
We measure the accuracy in terms of mean Average Precision (mAP). For the COCO dataset, we report mAP @ IoU of 0.50:0.95.
Figure 4 summarizes the quantitative results of DiCENet on the PASCAL VOC 2007 and the MS-COCO dataset. It also provides a comparison with existing methods, including MobileNets [22, 51], YOLOv2 , and SSD . We do not compare with other methods such as Fast RCNN  and Faster RCNN  because our focus is on efficient network designs. We can see that the accuracy of DiCENet improves with the image resolution on both datasets. DiCENet delivers competitive performance to existing methods on this task. Notably, DiCENet is 4.5% more accurate and requires fewer FLOPs than YOLOv2 on the MS-COCO dataset.
In this section, we study the significance of two main components of DiCENet, i.e. DimConv and EFuse, on the ImageNet 1000-way classification dataset.
Importance of DimConv:
Learning representations with depth-wise convolutions have been studied widely for efficient networks [51, 37, 40]. The new result reported in this paper is that learning representations along all dimensions of the input tensor helps better encode the information contained in the input tensor, thus leading to better accuracy (Table 3). Note that DimConv does not increase the total number of operations (FLOPs) and parameters significantly.
|✓||2.29 M||84 M||64.2|
|✓||✓||2.31 M||88 M||64.9|
|✓||✓||✓||2.32 M||91 M||66.4|
Importance of EFuse:
EFuse efficiently fuses the representations from different dimensions. To demonstrate the superior performance of EFuse, we replace it with (1) a standard point-wise convolution and (2) a squeeze and excitation unit (SE) , which squeezes spatial dimensions to encode channel-wise representations. Table 4 summarizes our findings. Point-wise convolution delivers the same performance as EFuse, but less efficiently. On the other hand, the SE unit drops the performance of DiCENet significantly. The drop in accuracy is likely because the SE unit does not encode the spatial information as EFuse does.
|Point-wise conv||2.30 M||126 M||66.6|
|SE unit ||2.29 M||89 M||62.7|
|EFuse (Ours)||2.32 M||91 M||66.4|
In this paper, we introduce a novel convolutional unit, the DiCENet unit, that increases the learning capacity of a CNN by utilizing information from all dimensions of the tensor. Our experimental results suggest that the proposed unit has superior performance in comparison to existing state-of-the-art networks across several computer vision tasks and datasets. The introduced convolutional unit is generic and can be applied across higher dimensionality data, such as 3D (or volumetric) images.
In this work, we study a homogeneous network structure. In future, we will study the performance of DiCENet with neural search methods.
Acknowledgement: This research was supported by the NSF III (1703166), Allen Distinguished Investigator Award, Samsung GRO award, and gifts from Google, Amazon, and Bloomberg. We also thank PRIOR team at AI2 for their helpful comments.
-  R. Andri, L. Cavigelli, D. Rossi, and L. Benini. Yodann: An architecture for ultralow power binary-weight cnn acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017.
-  H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
Xception: Deep learning with depthwise separable convolutions.In CVPR, 2017.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830, 2016.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
W. Ge, S. Yang, and Y. Yu.
Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning.In CVPR, 2018.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
S. Gupta, J. Hoffman, and J. Malik.
Cross modal distillation for supervision transfer.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
-  S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
-  B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
-  G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. In CVPR, 2018.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
-  C. Li and C. R. Shi. Constrained optimization based low-rank approximation of deep neural networks. In ECCV, 2018.
-  Y. Li, Y. Song, and J. Luo. Improving pairwise ranking for multi-label image classification. In CVPR, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
-  N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
S. Mehta, R. Koncel-Kedziorski, M. Rastegari, and H. Hajishirzi.
Pyramidal recurrent unit for language modeling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
-  S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018.
-  S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In CVPR, 2019.
-  A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
-  R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach. Contextnet: Exploring context and detail for semantic segmentation in real-time. In BMVC, 2018.
-  PyTorch. Tensors and Dynamic neural networks in Python with strong GPU acceleration. http://pytorch.org/. Accessed: 2018-11-15.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and
Large-scale evolution of image classifiers.In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2902–2911. JMLR. org, 2017.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 2018.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
-  M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, and M. Jagersand. rtseg: Real-time semantic segmentation comparative study. In 2018 25th IEEE International Conference on Image Processing (ICIP).
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
D. Soudry, I. Hubara, and R. Meir.
Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights.In NIPS, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
-  A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018.
-  H. Wang, A. Kembhavi, A. Farhadi, A. Yuille, and M. Rastegari. Elastic: Improving cnns with instance specific scaling policies. In CVPR, 2019.
-  W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
-  B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
-  J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
-  L. Xie and A. Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
-  J. Yim, D. Joo, J. Bae, and J. Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133–4141, 2017.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
-  H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 405–420, 2018.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
-  B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.