DiCENet: Dimension-wise Convolutions for Efficient Networks

06/08/2019 ∙ by Sachin Mehta, et al. ∙ 4

In this paper, we propose a new CNN model DiCENet, that is built using: (1) dimension-wise convolutions and (2) efficient channel fusion. The introduced blocks maximize the use of information in the input tensor by learning representations across all dimensions while simultaneously reducing the complexity of the network and achieving high accuracy. Our model shows significant improvements over state-of-the-art models across various visual recognition tasks, including image classification, object detection, and semantic segmentation. Our model delivers either the same or better performance than existing models with fewer FLOPs, including task-specific models. Notably, DiCENet delivers competitive performance to neural architecture search-based methods at fewer FLOPs (70-100 MFLOPs). On the MS-COCO object detection, DiCENet is 4.5 PASCAL VOC 2012 semantic segmentation dataset, DiCENet is 4.3 and has 3.2 times fewer FLOPs than a recent efficient semantic segmentation network, ESPNet. Our source code is available at <https://github.com/sacmehta/EdgeNets>

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have achieved high performance in several visual recognition tasks, including object classification, detection, and segmentation [19, 17, 5]. These networks learn representations by performing convolutional operations on an input tensor along its spatial dimensions using a sliding window approach. Because of the limited receptive field of a convolutional kernel, convolutional layers are not able to utilize the full encoding capacity of a tensor and therefore, we need to stack multiple convolutional layers to encode more information. This increases the complexity of a CNN.

(a) Different type of convolutions
(b) Tensor size vs. FLOPs
(c) Accuracy vs. FLOPs
Figure 1: DiCENet vs. state-of-the-art methods. (a) compares different type of convolutional kernels (highlighted in color) with our dimension-wise convolutions. (b) compares the complexity of different efficient convolutional units with DiCENet at different tensor sizes (depth height

width). (c) compares the top-1 accuracy on the ImageNet dataset at two different settings. Dimension-wise convolutions maximize the use of information contained in the tensor; thus leading to more accurate solutions.

To improve the efficiency of standard convolutions, most recent attempts have focused on reducing redundant operations and parameters in the convolutions [30, 20]. Recently, depth-wise separable convolutions [22] have been proposed to reduce CNNs complexity in terms of floating-point operations (FLOPs) and network parameters. These convolutions factorize the convolutional operation in two steps: (1) A light-weight filter is applied to each spatial plane using depth-wise convolutions [6] to learn spatial representations and (2) a point-wise () convolution is then applied to fuse channels and learn combinations between spatial representations. Similar to standard convolutions, depth-wise convolutions are not able to utilize the full encoding capacity of a tensor and therefore, point-wise convolutions are used to encode more information in depth-wise separable convolutions. Since depth-wise convolutions are computationally very efficient, this leaves a significant computational load on point-wise convolutions, which causes a computational bottleneck. As an example, point-wise convolutions account for about 90% of total operations in ShuffleNetv2 [37] and MobileNetv2 [51].

In this paper, we introduce DiCENet, Dimension-wise Convolutions for Efficient Networks. Our dimension-wise convolution (DimConv) convolves separate filters across each dimension, including the channel (or depth111We will use “depth” and “channel” interchangeably in this paper.) dimension of the input tensor. Unlike standard and depth-wise convolutions, DimConv leverages the spatial locality of various channels and is thus, in a sense made precise in Section 3, not invariant to channel-wise permutations of an input tensor. Moreover, learning representations across all dimensions leads to an encoding mechanism that increases the use of information contained in the tensor. In particular, the increased learning capacity due to DimConv allows us to further reduce the computational burden from expensive point-wise convolutions. We introduce an Efficient channel Fusion (EFuse) to fuse dimension-wise representations efficiently. With DimConv and EFuse, we are able to build a more efficient CNN while maintaining accuracy (see Figure 1). We highlight that the introduced module is novel and is not a part of any architecture search methods [68, 56, 3].

We have evaluated the performance of DiCENet on several visual recognition tasks including object classification, detection, and semantic segmentation. Compared to existing methods, DiCENet shows a significant improvement across all tasks. When compared with existing efficient architecture designs, such as MobileNetv2 [51] and ShuffleNetv2 [37], on the ImageNet dataset [50], DiCENet (1) is 1-2% more accurate for extremely small models (7-15 MFLOPs), (2) delivers similar performance with 15-20% fewer FLOPs for small models (38-160 MFLOPs), and (3) outperforms existing efficient designs with fewer FLOPs for medium-size (270-600 MFLOPs) models. When DiCENet is used as a base feature extractor on tasks such as semantic segmentation and object detection, it delivers a competitive performances and outperforms some of the task-specific tailored networks. For semantic segmentation on the PASCAL VOC 2012 dataset [10], DiCENet is 4.3% more accurate and has fewer FLOPs than ESPNet [39], a recent efficient segmentation network. DiCENet is 4.5% more accurate and has fewer FLOPs than YOLOv2 [46] on the MS-COCO object detection [32].

2 Related Work

CNN architecture designs:

Recent successes in visual recognition tasks, including object classification, detection, and segmentation, can be attributed to exploration of different CNN designs [29, 53, 19, 28, 55, 25]. To make these network designs more efficient, they have been extended with efficient and sparse forms of convolutions, such as depth-wise and group convolutions [22, 24, 64, 37, 51, 40]. In this paper, we introduce dimension-wise convolutions that generalize depth-wise convolutions to all dimensions of the input tensor.

Neural architecture search:

Recently, neural search methods, including reinforcement learning and genetic algorithms, have been proposed to automatically construct network architectures

[68, 62, 45, 69, 56, 33]. These methods search over a huge network space (e.g. MNASNet [56] searches over 8K different design choices) using a dictionary of pre-defined search space parameters, including different types of convolutional layers and kernel sizes, to identify a network structure, usually non-homogeneous, that satisfies optimization constraints, such as inference time. Recent search-based methods [56, 3, 60] use MobileNetv2 [51] as a basic search block for automatic network design. Since the proposed unit delivers better performance than MobileNetv2 (see Section 4), we believe that neural architecture search with our proposed unit would enable finding a better network design.

Other alternatives for efficient CNNs:

These approaches include network quantization [54, 44, 61, 8, 67, 26, 1], compression [14, 15, 59, 30, 57, 20], and distillation [21, 49, 13, 36, 63]. Network quantization-based approaches approximate convolution operations with fewer bits instead of using 32-bit full precision floating points. This improves speed at inference and also, reduces memory required for storing network weights. Network compression-based approaches improve the efficiency of a network by removing redundant weights and connections. Unlike network quantization and compression, distillation-based approaches improve the accuracy of (usually shallow) networks by supervising the training with large pre-trained networks.

Though these approaches have shown to be very effective for improving either the efficiency or the accuracy of a network, they fail to underline architectural design benefits. For a fair comparison with existing design-based methods such as ShuffleNets [37, 64], MobileNets [22, 51], and ESPNetv2 [40], we do not use these approaches.

Figure 2: Overview of dimension-wise convolutions for efficient networks (DiCENet). See Section 3 for more details.

3 DiCENet

In this section, we elaborate on the details of our model, DiCENet. We first discuss the building block of our model, DiCENet unit, and then describe DiCENet architecture.

3.1 DiCENet unit

The main building block of our model, the DiCENet unit, is based on a factorization principle and is primarily composed of two steps: (1) Dimension-wise Convolution (DimConv) convolves separate filters across each dimension of the input tensor to learn rich representations and (2) Efficient channel Fusion (EFuse) efficiently combines channels of the output tensor from the DimConv. Figure 2 shows an overview of the DiCENet unit. Dimension-wise convolutions encode information from all dimensions of the tensor. This increases the learning capacity of a convolutional layer and reduces the computational burden on other layers. Therefore, the proposed efficient channel fusion will work effectively with dimension-wise convolutions.

Dimension-wise convolution (DimConv):

The input to a convolutional layer is a three-dimensional tensor defined by width , height , and depth (or channels) . A standard depth-wise convolutional layer applies convolutional kernels , with each convolutional kernel processing a single channel along the spatial dimensions (width and height), across the depth dimension (channels) to produce an output ; where denotes the kernel size. This type of convolution encodes information within the order of the elements in spatial dimensions. In other words, for any reordering of the channels in the input tensor, there exist a reordering of weight filters in the model that can obtain the same output tensor. To encode information further along the channel dimension, we extend depth-wise convolutions to all dimensions of the input tensor and call this operation as Dimension-wise Convolution (DimConv). As illustrated in Figure 2, DimConv has three branches. These branches apply depth-wise convolutional kernels along depth, width-wise convolutional kernels along width, and height-wise convolutional kernels kernels along height to produce outputs , , that encode information from all dimensions of the input tensor. The output of these independent branches are concatenated along the channel dimension to produce the output . To facilitate learning of inter-dimension representations, we group the spatial planes (or feature maps) in such that each group has a spatial plane from each dimension. We then combine these dimension-wise grouped representations with point-wise convolutional kernels to produce weighted average output .

(a) ShuffleNetv2 Unit
(b)

ShuffleNetv2 unit with stride

(c) DiCENet unit
(d) DiCENet unit with stride
Figure 3: Building blocks of ShuffleNetv2 [37] and DiCENet. We have highlighted differences in color.
Figure 4: Convolution-wise distribution of FLOPs for different networks with the similar accuracy. The size of pie charts are scaled with respect to MobileNetv2’s FLOPs. In DiCENet, efficient conv’s correspond to depth-, width-, and height-wise convolutions while in other networks, they correspond to depth-wise convolutions.

Efficient channel fusion (EFuse):

Recent efficient CNN designs [22, 51, 64, 37] use point-wise convolutions for channel fusion. To produce an output tensor of size from an input tensor of size , a point-wise convolution performs operations. This is a computationally expensive operation and is the main computational bottleneck in recent efficient CNN designs. We propose an Efficient channel Fusion (EFuse) as a substitute for channel-fusion with point-wise convolution. This allows us to skip majority of point-wise convolutions in a CNN architecture, thus reducing the computational burden from point-wise convolutions by a large margin (Figure 4).

In EFuse, instead of fusing the channels across all elements in the spatial dimensions, we squeeze spatial dimensions of

using a global average pooling operation to extract a global vector descriptor

. Then we fuse the channels within the global vector using a fully connected layer to produce a fusion vector . In parallel, we encode spatial representations by applying depth-wise convolutional kernels to to produce an output . Next, we propagate the fusion vector on along the spatial dimensions by applying a sigmoid on and then multiplying it with all elements of

along the spatial dimensions. We choose the sigmoid function to prevent gradient overflow during training. The complexity of our

EFuse layer is . For , , and , EFuse requires fewer operations than the standard point-wise convolution.

If we stack DimConv and EFuse, we obtain an efficient block structure, the DiCENet unit, that has fewer FLOPs for different input tensor sizes than other state-of-the-art efficient CNNs. Figure  0(b) provides a comparison between FLOPs and different input tensor sizes for different CNN units. We can see that DiCENet unit has about fewer FLOPs than existing units, such as ShuffleNetv2 [37] and MobileNetv2 [51] units, a major improvement over state-of-the-art efficient designs.

3.2 DiCENet architecture

The architecture design of DiCENet is motivated by ShuffleNetv2 [37], a state-of-the-art efficient network that uses channel split and channel shuffle to learn representations efficiently. Figure 3 contrasts building blocks of ShuffleNetv2 and DiCENet while overall architecture is shown in Table 1. Except for the first two layers, we replace all point-wise and depth-wise convolutions with the DiCENet unit. The first layer is a standard

convolution with a stride of two while the second layer is a max pooling layer. All convolutional layers are followed by a batch normalization layer

[27] and a PReLU non-linear activation layer [18], except for the last layer that feeds into a softmax for classification. Following previous work [22, 51, 37, 64, 40], we scale the number of output channels by a width scaling factor to construct networks at different FLOPs. We initialize weights of our network using the same method as in [18].

Layer Output size Kernel size Stride Repeat Output channels (network width scaling parameter )
Image 3 3 3 3 3
Conv1 2 1 8 16 24 24 48
Max Pool 2 8 16 24 24 48
Stage2 2 1 16 32 116 278 384
1 3 16 32 116 278 384
Stage3 2 1 32 64 232 556 768
1 7 32 64 232 556 768
Stage4 2 1 64 128 464 1112 1536
1 3 64 128 464 1112 1536
Global Pool 512 1024 1024 1280 2048
Grouped FC [38] 1 1 512 1024 1024 1280 2048
FC 1000 1000 1000 1000 1000
FLOPs 6.5 M 12 M 24-240 M 298 M 553 M
Table 1: Overall architecture of DiCENet at different network complexities for the ImageNet classification. We use 4 groups in grouped fully connected (FC) layer.

Dynamic input scaling:

A standard practice in computer vision community is to train a CNN on large-scale classification dataset (e.g. the ImageNet

[50]

) and then transfer learned representations to other tasks, including object detection and semantic segmentation, via fine-tuning. In general, visual recognition tasks, such as semantic segmentation and object detection, have images of higher spatial dimensions (e.g.

for the PASCAL VOC segmentation dataset [10]) than the ones on which classification network is trained (e.g. for the ImageNet dataset). Since DiCENet has two branches corresponding to width and height dimensions of the input, a natural question arises: “Can we use DiCENet with images that have spatial dimensions different than the one used for training on the classification dataset?”. To make DiCENet invariant to spatial dimensions of the input image, we dynamically scale (either up-sample or down-sample) the height or width dimension of the input tensor to the height or width of the input tensor used in the pretrained network. The resultant tensors are then scaled (either down-sampled or up-sampled) back to their original size before being fed to the weighted average function; this makes DiCENet invariant to an input image size. We note that this dynamic scaling allows DiCENet to learn scale-invariant features because each convolutional layer (depth- or width- or height-wise) receives input with different dimensions. Figure 5 sketches the DiCENet block with dynamic input scaling.

Figure 5: Dynamic input scaling that makes the DiCENet block invariant to spatial dimensions of the input when tested with different resolution than the training.

4 Experimental Results

In this section, we demonstrate the performance of DiCENet on three different visual recognition tasks: (1) object classification, (2) semantic segmentation, and (3) object detection. To showcase the strength of our network, we also compare it with state-of-the-art efficient networks, including task-specific networks such as YOLOv2 [46] and ESPNet [39].

(a)
FLOPs MobileNetv2 [51] ShuffleNetv2 [37] ESPNetv2 [40] DiCENet (ours)
6-10 M 39.05 (8 M) 40.57 (6.5 M)
11-20 M 45.5 (11 M) 46.2 (14 M)
30-50 M 61.0 (50 M) 59.69 (41 M) 61.26 (41 M)
60.3 (41 M) 62.80 (46 M)
50 - 100 M 63.9 (71 M) 66.1 (86 M) 66.45 (70 M)
66.4 (107 M) 67.77 (98 M)
124-153 M 68.7 (153 M) 68.14 (142 M) 67.9 (124 M) 69.51 (139 M)
69.4 (146 M)
270-300 M 71.8 (300 M) 71.8 (292 M) 72.1 (284 M) 72.9 (298 M)
72.6 (299 M)
500+ M 75 (582 M) 74.9 (591 M) 74.9 (602 M) 75.1 (553 M)
(b)
Network FLOPs top-1
MNASNet [56] 76 M 62.4
103 M 67.3
317 M 74
FBNet [60] 72 M 65.3
92 M 67.0
295 M 74.1
ResNet-50 [19] 3.8 B 76.15
DiCENet (ours) 70 M 66.45
98 M 67.77
298 M 72.9
553 M 75.1
(c)
Table 2: Results on the ImageNet validation set. (a, b) provides a comparison between DiCENet and other state-of-the-art efficient designs. Each entry in (b) denotes top-1 accuracy and network FLOPs in brackets. In (b), denotes our implementation. (c) provides a comparison between DiCENet and other networks, including neural architecture search. We do not compare inference time because cuDNN optimized height- and width-wise convolutions are not yet available.
DiCENet uses ShuffleNetv2-style blocks (see Figure 3). For a fair comparison, we compare the performance of DiCENet with our implementation of ShuffleNetv2.

4.1 Object classification on the ImageNet

Implementation and dataset details:

Following a common practice [37, 64, 22, 51, 40], we evaluate the performance of DiCENet on the ImageNet 1000 way classification dataset [50] at different complexity levels, ranging from 6 MFLOPs to 500+ MFLOPs. The dataset consists of 1.28M training samples and 50K validation samples. We optimize our networks by minimizing a cross-entropy loss using SGD. We use the same learning rate and data augmentation policy as in [40]

. We use PyTorch

[43] as our framework for training these networks because of its ability to handle dynamic graphs.

Evaluation metric:

We evaluate our networks performance using a single crop top-1 accuracy on the validation set.

Results:

Table 1(a) and 1(b) compares the performance of DiCENet with state-of-the-art efficient architectures at different complexity levels. For extremely small models (FLOP range: 6-20 M), DiCENet outperforms MobilNetv2 [51] and ShuffleNetv2 [37] by about 1.5%. Note that our model at 14 MFLOPs delivers the similar performance as MobileNetv1 [22] at 41 MFLOPs (48% top-1 accuracy). For small models (FLOPs range: 30-160 MFLOPs), DiCENet achieves either the best accuracy with fewer FLOPs. At 41 MFLOPs, DiCENet outperforms MobileNetv2 and ShuffleNetv2 by about 3% and 1.5%. For medium-size models (FLOP range: 270 MFLOPs - 600 MFLOPs), DiCENet delivers the best performance. Notably, DiCENet delivers a top-1 accuracy of 75.1 with 551 MFLOPs, outperforming all previous efficient designs, such as ShuffleNets [37, 64] and MobileNets [22, 51], with fewer FLOPs. We would like to highlight that at 551 MFLOPs, DiCENet is about less accurate than ResNet-50 [19], but has fewer FLOPs.

Recently, neural architecture search (NAS)-based methods have gained attention. Recent NAS-based methods rely on architecture innovations to deliver better performance. For example, MNASNet [56] and FBNet [60] have shown remarkable improvements using MobileNetv2 as basic search block. For the sake of completeness, we have included a comparison with such methods in Table 1(c). DiCENet delivers competitive performance to networks derived using NAS. In particular, DiCENet outperforms FBNet and MNASNet-based small models (FLOPs range: 70-100 MFLOPs). We note that NAS-based methods are not directly comparable to DiCENet because they rely on existing CNN units, such as MobileNetv2, for network search. However, we believe that adding DiCENet unit in the search dictionary of NAS-based methods will deliver a better performance because DiCENet learns rich representations from all dimensions which should help it to outperform MobileNetv2 across different complexity-levels.

4.2 Multi-label classification on the MS-COCO

The goal of multi-label classification is to predict multiple labels per image. This is in contrast to the ImageNet classification task that aims to predict one label per image.

Implementation and dataset details:

For multi-object classification, we use the MS-COCO dataset [32] that has 2.9 labels (on an average) per image. We use the same training and validation splits as in [58, 40]. We finetune our models trained on the ImageNet dataset using the binary cross-entropy loss.

Evaluation metrics:

Following a standard convention in multi-label classification tasks [31, 11], we evaluate the performance using macro/micro F1 score that measures the overall performance and per-class performance.

Figure 6: This figure compares the overall and the class-wise F1 scores of DiCENet with state-of-the-art networks for multi-label object classification on the MS-COCO dataset. The results for ShuffleNetv2 are reported in [40] while for other networks (DenseNet [25] and ResNext [62]), the results are reported in [58].

Results:

Figure 6 summarizes quantitative results for multi-label classification. For similar FLOPs, DiCENet is 3.5% and 2% more accurate than ShuffleNetv2 [37] and ESPNetv2 [40], respectively. Also, DiCENet is 3-4% less accurate and has fewer FLOPs when compared to heavy weight networks, including Elastic versions [58].

4.3 Semantic segmentation

Implementation and dataset details:

For the task of efficient semantic segmentation, we use encoder-decoder architecture; a widely studied efficient segmentation architecture [39, 40, 41, 48]. We initialize the encoder with the ImageNet pretrained model and then fine-tune it by minimizing a cross-entropy loss using SGD. We evaluate the performance on two segmentation datasets: (1) the Cityscapes [7] and (2) the PASCAL VOC 2012 [10].

The Cityscapes dataset has been widely studied for bench-marking the performance of efficient segmentation networks because of its application in urban scene understanding, especially for self-driving cars

[39, 40, 41, 48]. The dataset provides pixel-wise fine annotations (20 classes including background) for 5K high resolution images () that are captured across 50 cities. We follow the same splits for training and validation as in [39, 40, 41, 48].

The PASCAL VOC 2012 dataset is a well-known segmentation dataset that provide annotations for 20 foreground objects. It has 1.4K training images, 1.4K validation images, and 1.4K test images. Following a standard convention [4, 66, 5], we also use additional images from [16] and [32] for training our networks.

Evaluation metrics:

We measure the accuracy in terms of mean intersection over union (mIOU). For both datasets, we use official scripts for evaluation on the validation set while we use an online server for evaluation on the test set because the test set is private. We do not use multi-scale testing strategy for evaluation.

(a) Impact of image size on the Cityscapes
(b) Impact of image size on the PASCAL VOC 2012
Network FLOPs mIOU
SegNet [2] 82 B 57
ContextNet [42] 33 B 68.7
ICNet [65] 31 B 69.5
ERFNet [48] 26 B 68.0
MobileNetv2 [51] 21 B 70.7
RTSeg- MobileNet [52] 13.8 B 62.4
RTSeg-ShuffleNet [52] 6.2 B 58.3
ESPNet [39] 4.5 B 61.4
ENet [41] 3.8 B 58.3
DiCENet (Ours) 2.2 B 63.4
Cityscapes
Network FLOPs mIOU
FCN-8s [35] 181 B 62.2
DeepLabv3 [5] 81 B 80.49
SegNet [2] 31 B 59.1
MobileNetv1 [22] 14 B 75.29
MobileNetv2 [51] 5.8 B 75.7
ESPNet [39] 2.2 B 63.01
DiCENet - val 0.24 B 59.01
DiCENet - val 0.68 B 66.50
DiCENet - test 0.68 B 67.31
PASCAL VOC 2012
(c) Comparison with state-of-the-art networks
Figure 7: Semantic segmentation results on the Cityscapes and the PASCAL VOC 2012 datasets. In (a, b), we compare the mIOU of DiCENet at different image sizes. In (c), we compare the performance of DiCENet with state-of-the-art segmentation networks. DiCENet delivers competitive performance on both datasets while being very efficient. For a fair comparison, we report FLOPs at the same image resolution which is used for computing the accuracy.
[51] uses additional data from the MS-COCO dataset.
http://host.robots.ox.ac.uk:8080/anonymous/XWF8QJ.html
http://host.robots.ox.ac.uk:8080/anonymous/T44DHQ.html

Results:

Figure 7 summarizes the quantitative results of DiCENet on both datasets, including comparison with state-of-the-art methods222Most of the efficient networks have not reported performance on the PASCAL VOC dataset, with an exception to ESPNet [39]. Therefore, we compare the performance of DiCENet on this dataset with other networks that are not necessarily efficient.. We can see that: (1) The accuracy of DiCENet drops off smoothly when we decrease the image resolution. Note that even at low resolutions, DiCENet delivers a good accuracy. For example, with an input image size of , DiCENet has fewer FLOPs than SegNet [2] while delivering the same accuracy on the PASCAL VOC dataset. (2) DiCENet delivers better accuracy than existing efficient networks such as ENet [41], ESPNet [39], and RTSeg [52] on the Cityscapes dataset while requiring fewer FLOPs. Note that DiCENet is the most efficient network while delivering competitive performance to state-of-the-art efficient networks; thus making it a potential candidate for resource constrained devices.

4.4 Object detection

Implementation and dataset details:

For the task of efficient object detection, we use single shot object detector (SSD) [34]. We use DiCENet pretrained on the ImageNet as base feature extractor instead of VGG [53] in SSD. We finetune our network using SGD with smooth L1 loss for object localization and cross-entropy loss for object classification. We evaluate the performance on two widely-studied object detection datasets: (1) the PASCAL VOC 2007 [9] and (2) the MS-COCO dataset [32]. Following a standard convention for training on the PASCAL VOC 2007 dataset, we also use the union of the PASCAL VOC 2012 [10] and PASCAL VOC 2007 trainval set for training and evaluate the performance on the PASCAL VOC 2007 test set.

Evaluation metrics:

We measure the accuracy in terms of mean Average Precision (mAP). For the COCO dataset, we report mAP @ IoU of 0.50:0.95.

(a) Impact of image size
Network VOC07 COCO
FLOPs mAP FLOPs mAP
SSD-512 [34] 90.2 B 74.9 99.5 B 26.8
SSD-300 [34] 31.3 B 72.4 35.2 B 23.2
YOLOv2 [46] 6.8 B 69 17.5 B 21.6
MobileNetv1-320 [22] 1.3 B 22.2
MobileNetv2-320 [51] 0.8 B 22.1
DiCENet-512 (Ours) 2.5 B 68.4 3.2 B 26.1
DiCENet-300 (Ours) 0.9 B 65.2 1.2 B 23.4
(b) Comparison with state-of-the-art networks
Figure 8: Object detection results on the PASCAL VOC 2007 and the MS-COCO dataset. (a) compares mAP of DiCENet  at different image sizes444 Unlike our experiments in Figure 7, the number of parameters are not the same for DiCENet  at different resolutions in this experiment because we add one more convolution layer to DiCENet-300 to construct DiCENet-512, similar to SSD [34]. Therefore, we do not compare network parameters vs. image size.. (b) compares the performance of DiCENet with state-of-the-art methods.

Results:

Figure 4 summarizes the quantitative results of DiCENet on the PASCAL VOC 2007 and the MS-COCO dataset. It also provides a comparison with existing methods, including MobileNets [22, 51], YOLOv2 [46], and SSD [34]. We do not compare with other methods such as Fast RCNN [12] and Faster RCNN [47] because our focus is on efficient network designs. We can see that the accuracy of DiCENet improves with the image resolution on both datasets. DiCENet delivers competitive performance to existing methods on this task. Notably, DiCENet is 4.5% more accurate and requires fewer FLOPs than YOLOv2 on the MS-COCO dataset.

5 Ablations

In this section, we study the significance of two main components of DiCENet, i.e. DimConv and EFuse, on the ImageNet 1000-way classification dataset.

Importance of DimConv:

Learning representations with depth-wise convolutions have been studied widely for efficient networks [51, 37, 40]. The new result reported in this paper is that learning representations along all dimensions of the input tensor helps better encode the information contained in the input tensor, thus leading to better accuracy (Table 3). Note that DimConv does not increase the total number of operations (FLOPs) and parameters significantly.

Channel Width Height # Params FLOPs top-1
2.29 M 84 M 64.2
2.31 M 88 M 64.9
2.32 M 91 M 66.4
Table 3: Learning representations from all dimensions of the input tensor improves performance.

Importance of EFuse:

EFuse efficiently fuses the representations from different dimensions. To demonstrate the superior performance of EFuse, we replace it with (1) a standard point-wise convolution and (2) a squeeze and excitation unit (SE) [23], which squeezes spatial dimensions to encode channel-wise representations. Table 4 summarizes our findings. Point-wise convolution delivers the same performance as EFuse, but less efficiently. On the other hand, the SE unit drops the performance of DiCENet significantly. The drop in accuracy is likely because the SE unit does not encode the spatial information as EFuse does.

Operation # Params FLOPs top-1
Point-wise conv 2.30 M 126 M 66.6
SE unit [23] 2.29 M 89 M 62.7
EFuse (Ours) 2.32 M 91 M 66.4
Table 4: Impact of different fusion methods.

6 Conclusion

In this paper, we introduce a novel convolutional unit, the DiCENet unit, that increases the learning capacity of a CNN by utilizing information from all dimensions of the tensor. Our experimental results suggest that the proposed unit has superior performance in comparison to existing state-of-the-art networks across several computer vision tasks and datasets. The introduced convolutional unit is generic and can be applied across higher dimensionality data, such as 3D (or volumetric) images.

In this work, we study a homogeneous network structure. In future, we will study the performance of DiCENet with neural search methods.

Acknowledgement: This research was supported by the NSF III (1703166), Allen Distinguished Investigator Award, Samsung GRO award, and gifts from Google, Amazon, and Bloomberg. We also thank PRIOR team at AI2 for their helpful comments.

References