Res2Net: A New Multi-scale Backbone Architecture

04/02/2019 ∙ by Shang-Hua Gao, et al. ∙ 0

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models will be made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual patterns occur at multi-scales in natural senses as shown in Fig. 1. First, objects may appear with different sizes in a single image, e.g. , the sofa and cup are of different sizes. Second, essential contextual information of an object may occupy a much larger area than the object itself. For instance, we need to rely on the big table as context to better tell whether the small black blob placed on it is a cup or a pen holder. Third, perceiving information from different scales is essential for understanding parts as well as objects for tasks such as fine-grained classification and semantic segmentation. Thus, it is of critical importance to design good features for multi-scale stimuli for visual cognition tasks, including image classification [22], object detection [33], attention prediction [35], target tracking [50], action recognition [36], semantic segmentation [3], salient object detection [18].

Fig. 1: Multi-scale representations are essential for various vision tasks, such as perceiving boundaries, regions, and semantic categories of the target objects. Even for the simplest recognition tasks, perceiving information from very different scales is essential to understand parts, objects (e.g. , sofa, table, and cup in this example), and their surrounding context (e.g. , ‘on the table’ context contributes to recognizing the black blob).

Unsurprisingly, multi-scale features have been widely used in both conventional feature design [1, 31] and deep learning [39, 44, 29, 18, 6, 52]

. Obtaining multi-scale representations in vision tasks requires feature extractors to use a large range of receptive fields to describe objects/parts/context at different scales. Convolutional neural networks (CNNs) naturally learn coarse-to-fine multi-scale features through a stack of convolutional operators. Such inherent multi-scale feature extraction ability of CNNs leads to effective representations for solving numerous vision tasks. How to design a more efficient network architecture is the key to further improving the performance of CNNs.

In the past few years, several backbone networks, e.g. , [22, 37, 39, 17, 20, 9, 43, 6, 47, 19], have made significant advances in numerous vision tasks with state-of-the-art performance. Earlier architectures such as AlexNet [22] and VGGNet [37] stack convolutional operators, making the data-driven learning of multi-scale features feasible. The efficiency of multi-scale ability was subsequently improved by using conv layers with different kernel size (e.g. , InceptionNets [39, 40, 38]), residual modules (e.g. , ResNet [17]), shortcut connections (e.g. , DenseNet [20]), and hierarchical layer aggregation (e.g. , DLA [47]). The advances in backbone CNN architectures have demonstrated a trend towards more effective and efficient multi-scale representations.

[width=]figures/structure.pdf (a) Bottleneck block(b) Res2Net module

Fig. 2: Comparison between the bottleneck block and the proposed Res2Net module (the scale dimension ).

In this work, we propose a simple yet efficient multi-scale processing approach. Unlike most existing methods that enhance the layer-wise multi-scale representation strength of CNNs, we improve the multi-scale representation ability at a more granular level. To achieve this goal, we replace the filters111 Convolutional operators and filters are used interchangeably. of channels, with a set of smaller filter groups, each with channels (without loss of generality we use ). As shown in Fig. 2, these smaller filter groups are connected in a hierarchical residual-like style to increase the number of scales that the output features can represent. Specifically, we divide input feature maps into several groups. A group of filters first extract features from a group of input feature maps. Output features of the previous group are then sent to the next group of filters alone with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, feature maps from all groups are concatenated and sent to another group of filters to fuse information altogether. Along with any possible path that input features transformed to output features, the equivalent receptive field increases whenever it passes a filter, resulting in many equivalent feature scales due to combination effects.

The Res2Net strategy exposes a new dimension, namely scale (the number of feature groups in the Res2Net block), as an essential factor in addition to existing dimensions of depth [37], width222 Width refers to the number of channels in a layer as in [48]., and cardinality [43]. We state in Sec. 4.4 that increasing scale is more effective than increasing other dimensions.

Note that the proposed approach exploits the multi-scale potential at a more granular level, which is orthogonal to existing methods that utilize layer-wise operations. Thus, the proposed building block, namely Res2Net module, can be easily plugged into many existing CNN architectures. Extensive experimental results show that the Res2Net module can further improve the performance of state-of-the-art CNNs, e.g. , ResNet [17], ResNeXt [43], and DLA [47].

2 Related Work

2.1 Backbone Networks

Recent years have witnessed numerous backbone networks [22, 37, 39, 17, 20, 9, 43, 47], achieving state-of-the-art performance in various vision tasks with stronger multi-scale representations. As designed, CNNs are equipped with basic multi-scale feature representation ability since the input information follows a coarse-to-fine fashion. The AlexNet [22] stacks filters sequentially and achieves significant performance gain over traditional methods for visual recognition. However, due to the limited network depth and kernel size of filters, the AlexNet has only a relatively small receptive field. The VGGNet [37] increases the network depth and uses filters with smaller kernel size. A deeper structure can expand the receptive fields, which is useful for extracting features from a larger scale. It is easier to enlarge the receptive field by stacking more layers than using large kernels. As such, the VGGNet provides a stronger multi-scale representation model than AlexNet, with fewer parameters. However, both AlexNet and VGGNet stack filters in a linear topology, which means these networks can only have a relatively inflexible receptive fields and are optimal at handling objects with a small range of scales.

The GoogLeNet [39] utilizes parallel filters with different kernel sizes to enhance the multi-scale representation capability. However, due to the constraint of computational resources, kernels in GoogLeNet cannot easily be further enriched. Thus, the multi-scale representation scheme of the GoogLeNet still cannot cover a large range of receptive fields. The Inception Nets [40, 38] stack more filters in each path of the parallel paths in the GoogLeNet to further expand the receptive field. On the other hand, the ResNet [17] introduces short connections to neural networks, thereby alleviate the gradient vanishing while obtaining much deeper network structures. During the feature extraction procedure, short connections allow different combinations of convolutional operators, resulting in a large number of equivalent feature scales. Similarly, densely connected layers in the DenseNet [20] enable the network to process objects in a very wide range of scales. DPN [6] combine the ResNet with DenseNet to enables feature re-usage ability of ResNet and the feature exploration ablity of DenseNet. The recently proposed DLA [47] method combines layers in a tree structure. The hierarchical tree structure enables the network to obtain even stronger layer-wise multi-scale representation capability.

2.2 Multi-scale Representations for Vision Tasks

Multi-scale feature representations of CNNs are of great importance to a number of vision tasks including object detection [33], salient object detection [18], and semantic segmentation [3], boosting the model performance of those fields.

2.2.1 Object detection.

Effective CNN models need to locate objects of different scales in a scene. Earlier works such as the R-CNN [12] mainly rely on the backbone network, i.e. , VGGNet [37], to extract features of multiple scales. He et al.propose an SPP-Net approach [16] that utilizes spatial pyramid pooling after the backbone network to enhance the multi-scale ability. The Faster R-CNN method [33] further proposes the region proposal networks to generate bounding boxes with various scales. Based on the Faster R-CNN, the FPN [25] approach introduces feature pyramid to extract features with different scales from a single image. The SSD method [28] utilizes feature maps from different stages to process visual information at different scales.

2.2.2 Semantic segmentation.

Extracting essential contextual information of objects requires CNN models to process features at various scales for effective semantic segmentation. Long et al. [30] proposes one of the earliest methods that enables multi-scale representations of the fully convolutional network (FCN) for semantic segmentation task. In DeepLab, Chen et al. [3, 4] introduces cascaded atrous convolutional module to expand the receptive field further while preserving spatial resolutions. More recently, global context information is aggregated from region-based features via the pyramid pooling scheme in the PSPNet [51].

2.2.3 Salient object detection.

Precisely locating the salient object regions in an image requires an understanding of both large-scale context information for the determination of object saliency, as well as small-scale features to localize object boundaries accurately. Early approaches [2] utilize handcrafted representations of global contrast [7] or multi-scale region features [41]. Li et al.[23]

proposes one of the earliest methods that enables multi-scale deep features for salient object detection. Later, multi-context deep learning

[53] and multi-level convolutional features [49] are proposed for improving salient object detection. More recently, Hou et al.[18] introduce dense short connections among stages to provide rich multi-scale feature maps at each layer for salient object detection.

3 Res2Net

3.1 Res2Net Module

The bottleneck structure shown in Fig. 2(a) is a basic building block in many modern backbone CNNs architectures, e.g. , ResNet [17], ResNeXt [43], and DLA [47]. Instead of extracting features using a group of filters as in the bottleneck block, we seek alternative architectures with stronger multi-scale feature extraction ability, while maintaining similar computational load. Specifically, we replace a group of filters with smaller groups of filters, while connecting different filter groups in a hierarchical residual-like style. Since our proposed neural network module involves residual-like connections within a single residual block, we name it Res2Net.

Fig. 2 shows the differences between the bottleneck block and the proposed Res2Net module. After the convolution, we evenly split the feature maps into feature map subsets, denoted by , where . Each feature subset has the same spatial size but number of channels compared to the input feature map. Except for , each has a corresponding convolution, denoted by . We denote by the output of . The feature subset is added with the output of , and then fed into . To reduce parameters while increasing , we omit the convolution for . Thus, can be written as:

(1)

Notice that each convolutional operator could potentially receive feature information from all feature splits . Each time a feature split go through a convolutional operator, the output result can have a larger receptive field than . Due to the combinatorial explosion effects, the output of the Res2Net module contains the different number and different combinations of receptive field sizes/scales.

In the Res2Net module, splits are processed in a multi-scale fashion, which is conducive to the extraction of both global and local information. To better fuse information at different scales, we concatenate all splits and pass them through a convolution. The split and concatenation strategy can enforce convolutions to process features more effectively. To reduce the number of parameters, we omit the convolution for the first split, which can also be regarded as a form of feature reuse.

In this work, we use as a control parameter of the scale dimension. Larger typically corresponds to stronger multi-scale ability, with negligible computational/memory overheads introduced by concatenation.

3.2 Integration with Modern Modules

[width=]improved_structure.pdf

Fig. 3: The Res2Net module can be integrated with the dimension cardinality [43] (replace conv with group conv) and SE [19] blocks.

Numerous neural network modules have been proposed in recent years, including cardinality dimension introduced by Xie et al.[43], as well as squeeze and excitation (SE) block presented by Hu et al.[19]. The proposed Res2Net module introduces the scale dimension that is orthogonal to these improvements. As shown in Fig. 3, we can easily integrate the cardinality dimension  [43] and SE block [19] with the proposed Res2Net module.

3.2.1 Dimension cardinality.

The dimension cardinality indicates the number of groups within a filter [43]. This dimension changes filters from the single-branch to multi-branch and improves the representation ability of a CNN model. In our design, we can replace the 3 3 convolution with the 3 3 group convolution, where indicates the number of groups. Experimental comparisons between the scale dimension and cardinality are presented in Sec. 4.2 and Sec. 4.4.

3.2.2 SE block.

An SE block adaptively re-calibrates channel-wise feature responses by explicitly modeling interdependencies between channels [19]. Similar to [19]

, we add the SE block right before the residual connections of the Res2Net module. Our Res2Net module can benefit from the integration of the SE block, which we have experimentally demonstrated in Sec. 

4.2 and Sec. 4.3.

3.3 Integrated Models

Since the proposed Res2Net module does not have specific requirements for the overall network structure and the multi-scale representation ability of the Res2Net module is orthogonal to the layer-wise feature aggregation models of CNNs, we can easily integrate the proposed Res2Net module into the state-of-the-art  models, such as ResNet [17], ResNeXt [43], and DLA [47]. The corresponding models are referred to as Res2Net, Res2NeXt, and Res2Net-DLA, respectively.

The proposed scale dimension is orthogonal to the cardinality [43] dimension and width [17] dimension of prior work. Thus, after the scale is set, we adjust the value of cardinality and width to maintain the overall model complexity similar to its counterparts. We do not focus on reducing the model size in this work since it requires more meticulous designs such as depth-wise separable convolution [32], model pruning [13], and model compression [8].

For experiments on the ImageNet [34] dataset, due to our limited computational resources, we mainly use the ResNet-50 [17], ResNeXt-50 [43] and DLA-60 [47] as our baseline models. The complexity of the proposed model is approximately equal to that of the baseline models, whose number of parameters is around and the number of FLOPs for an image of pixels is around for 50-layer networks. For experiments on the CIFAR [21] dataset, we use the ResNeXt-29, 8c64w [43] as our baseline model. Empirical evaluations and discussions of the proposed models with respect to model complexity are presented in Sec. 4.4.

4 Experiments

4.1 Implementation Details

We implement the proposed models using the Pytorch framework. For fair comparisons, we use the Pytorch implementation of ResNet 

[17], ResNext [43] as well as DLA [47], and only modify the original bottleneck block with the proposed Res2Net module. Similar to prior work, on the ImageNet dataset [34], each image is of 224224 pixels randomly cropped from a resized image. We use the same data argumentation strategy as [17]. Similar to [17]

, we train the network using SGD with weight decay 0.0001, momentum 0.9, and a mini-batch of 256 on 4 Titan Xp GPUs. The learning rate is initially set to 0.1 and divided by 10 after 30 epochs.

All models for the ImageNet, including the baseline and proposed models, are trained for 100 epochs with the same training and data argumentation strategy. For test, we use the same image cropping method as [17]. On the CIFAR dataset, we use the implementation of ResNeXt-29 [43] with no other modification. For other tasks, we use the original implementations of baselines and only replace the backbone model with the proposed Res2Net.

top-1 err. () top-5 err. ()
ResNet-50 [17] 23.85 7.13
Res2Net-50 22.01 6.15
InceptionV3 [40] 22.55 6.44
Res2Net-50-299 21.41 5.88
ResNeXt-50 [43] 22.61 6.50
Res2NeXt-50 21.76 6.09
DLA-60 [47] 23.32 6.60
Res2Net-DLA-60 21.53 5.80
DLA-X-60 [47] 22.19 6.13
Res2NeXt-DLA-60 21.55 5.86
SENet-50 [19] 23.24 6.69
SE-Res2Net-50 21.56 5.94
TABLE I: Top-1 and Top-5 test error on the ImageNet dataset.

4.2 ImageNet

We conduct experiments on the ImageNet dataset [34], which contains 1.28 million training images and 50k validation images for 1000 classes. Due to the limited computational resources, we construct the models with approximately 50 layers for performance evaluation against the state-of-the-art methods. More ablation studies are conducted on the CIFAR dataset.

4.2.1 Performance gain.

Table I shows the top-1 and top-5 test error on the ImageNet dataset. For simplicity, all Res2Net models in Table I has the scale . The Res2Net-50 has an improvement of 1.84 on top-1 error over the ResNet-50. The Res2NeXt-50 achieves a 0.85 improvement in terms of top-1 error over the ResNeXt-50. Also, the Res2Net-DLA-60 outperforms the DLA-60 by 1.27 in terms of top-1 error. The Res2NeXt-DLA-60 outperforms the DLA-X-60 by 0.64 in terms of top-1 error. The SE-Res2Net-50 has an improvement of 1.68 over the SENet-50. Note that the ResNet [17], ResNeXt [43], SE-Net [19], and DLA [47] are the state-of-the-art CNN models. Compared with these strong baselines, models integrated with the Res2Net module still have consistent performance gains.

We also compare our method against the InceptionV3 [40] model, which utilizes parallel filters with different kernel combinations. For fair comparisons, we use the ResNet-50 [17] as the baseline model and train our model with the input image size of 299299 pixels, as what is used in the InceptionV3 model. The proposed Res2Net-50-299 outperforms InceptionV3 by 1.14 on top-1 error. We conclude that the hierarchical residual-like connection of the Res2Net module is more effective than the parallel filters of InceptionV3 when processing multi-scale information. While the combination pattern of filters in InceptionV3 is dedicatedly designed, the Res2Net module presents a simple but effective combination pattern.

4.2.2 Going deeper with Res2Net.

Deeper networks have been shown to have stronger representation capability [17, 43] for vision tasks. To validate our model with greater depth, we compare the classification performance of the Res2Net and the ResNet, both with 101 layers. As shown in Table II, the Res2Net-101 achieves significant performance gains over the ResNet-101 with 1.82 in terms of top-1 error. Note that the Res2Net-50 has the performance gain of 1.84 in terms of top-1 error over the ResNet-50. These results show that the proposed module with additional dimension scale can be integrated with deeper models to achieve better performance. We also compare our method with the DenseNet [20]. Compared with the DenseNet-161, the best performing model of the officially provided DenseNet family, the Res2Net-101 has an improvement of 1.54 in terms of top-1 error, even though the DenseNet-161 requires nearly 100M more memory than the Res2Net-101 does for an image of pixels.

top-1 err. top-5 err. Memory
DenseNet-161 [20] 22.35 6.20 268M
ResNet-101 [17] 22.63 6.44 162M
Res2Net-101 20.81 5.57 179M
TABLE II: Top-1 and Top-5 test error () of deeper networks on the ImageNet dataset.

4.2.3 Effectiveness of scale dimension.

Setting FLOPs Runtime top-1 err. top-5 err.
ResNet-50 64w 4.2G 149ms 23.85 7.13
Res2Net-50
(Preserved
complexity)
48w2s 4.2G 148ms 23.68 6.47
26w4s 4.2G 153ms 22.01 6.44
14w8s 4.2G 172ms 21.86 6.14
Res2Net-50
(Increased
complexity)
26w4s 4.2G - 22.01 6.44
26w6s 6.3G - 21.42 5.87
26w8s 8.3G - 20.80 5.63
Res2Net-50-L 18w4s 2.9G 106ms 22.92 6.67
TABLE III: Top-1 and Top-5 test error () of Res2Net-50 with different scales on the ImageNet dataset. Parameter is the width of filters, and is the number of scale, as described in Equation (1).

To validate our proposed dimension scale, we experimentally analyze the effect of different scales. As shown in  Table III, the performance increases as the increase of scale. As the increase of scale, the Res2Net-50 with 14w8s achieves performance gains over the ResNet-50 with 1.99 in terms of top-1 error. Note that with the preserved complexity, the width of decreases as the increase of scale. We further evaluate the performance gain of increasing scale with increased model complexity. The Res2Net-50 with 22w8s achieves significant performance gains over the ResNet-50 with 3.05 in terms of top-1 error. A Res2Net-50 with 18w8s also outperforms the the ResNet-50 by 0.93 in terms of top-1 error with only 69 FLOPs.

4.3 Cifar

We also conduct some experiments on the CIFAR-100 dataset [21], which contains 50k training images and 10k testing images for 100 classes. The ResNeXt-29, 8c64w [43] is used as the baseline model. We only replace the original basic block to our proposed Res2Net module while keeping other configurations unchanged. Table IV shows the top-1 test error and model size on the CIFAR-100 dataset. Experimental results show that our method surpasses the baseline and other methods with fewer parameters. Our proposed  Res2NeXt-29, 6c24w6scale outperforms the baseline by 1.11. Res2NeXt-29, 6c24w4scale even outperforms the ResNeXt-29, 16c64w with only 35 parameters. We also achieve better performance with fewer parameters, compared with DenseNet-BC (k = 40). Note that DenseNet is memory-consuming compared to our method. Compared with Res2NeXt-29, 6c24w4scale, Res2NeXt-29, 8c25w4scale achieves a better result with more width and cardinality, indicates that the dimension scale is orthogonal to dimension width and cardinality. We also integrate the recently proposed SE block into our structure. With fewer parameters, our method still outperforms the ResNeXt-29, 8c64w-SE baseline.

Params top-1 err.
Wide ResNet [48] 36.5M 20.50
ResNeXt-29, 8c64w [43] (base) 34.4M 17.90
ResNeXt-29, 16c64w [43] 68.1M 17.31
DenseNet-BC (k = 40) [20] 25.6M 17.18
Res2NeXt-29, 6c24w4scale 24.3M 16.98
Res2NeXt-29, 8c25w4scale 33.8M 16.93
Res2NeXt-29, 6c24w6scale 36.7M 16.79
ResNeXt-29, 8c64w-SE [19] 35.1M 16.77
Res2NeXt-29, 6c24w4scale-SE 26.0M 16.68
Res2NeXt-29, 8c25w4scale-SE 34.0M 16.64
Res2NeXt-29, 6c24w6scale-SE 36.9M 16.56
TABLE IV: Top-1 test error () and model size on the CIFAR-100 dataset. Parameter indicates the value of cardinality, and is the width of filters.

4.4 Scale Variation

Similar to Xie et al.[43], we evaluate the test performance of the baseline model by increasing different CNN dimensions, including scale (Equation (1)), cardinality [43], and depth [37]. While increasing model capacity using one dimension, we fix all other dimensions. A series of networks are trained and evaluated under these changes. Since [43] has already shown that increasing cardinality is more effective than increasing width, we only compare the proposed dimension scale with cardinality and depth.

Fig. 4 shows the test precision on the CIFAR-100 dataset with regard to the model size. The depth, cardinality, and scale of the baseline model are and , respectively. Experimental results suggest that scale is an effective dimension to improve model performance, which is consistent with what we have observed on the ImageNet dataset in Sec. 4.2. Moreover, increasing scale is more effective than other dimensions, results in quicker performance gains. As described in  Equation (1) and Fig. 2, for the case of scale , we only increase the model capacity by adding more parameters of filters. Thus, the model performance of is slightly worse than increasing cardinality. For , the combination effects of our hierarchical residual-like structure produce a rich set of equivalent scales, results in significant performance gains. However, the models with scale 5 and 6 have limited performance gains, which we assume that the image in the CIFAR dataset is too small (3232) to have many scales.

[width=]cmp_scale_card.pdf 234561218243036568311013716429-6-1

Fig. 4: Test precision on the CIFAR-100 dataset with regard to the model size, by changing cardinality (ResNeXt-29), depth (ResNeXt), and scale (Res2Net-29).

ResNet-50

Res2Net-50

Baseball Penguin Ice cream Bulbul Mountain dog Ballpoint Mosque
Fig. 5: Visualization of class activation mapping [35], using ResNet-50 and Res2Net-50 as backbone networks.

4.5 Class Activation Mapping

To understand the multi-scale ability of the Res2Net, we visualize the class activation mapping (CAM) using Grad-CAM [35], which is commonly used to localize the discriminative regions for image classification. In the visualization examples shown in shown in Fig. 5, stronger CAM areas are covered with lighter colors. Compared with ResNet, the Res2Net based CAM results have more concentrated activation maps on small objects, such as ‘baseball’ and ‘penguin’. Both two methods have similar activation maps on the middle size objects such as ‘ice cream’. Due to stronger multi-scale ability, the Res2Net has activation maps that tend to cover the whole object on big objects such as ‘bulbul’, ‘mountain dog’, ‘ballpoint’, and ‘mosque’, while activation maps of ResNet only cover parts of objects. Such ability of precisely localizing CAM region makes the Res2Net  be potentially valuable for object region mining in weakly supervised semantic segmentation tasks [42].

Dateset Backbone AP AP@IoU=0.5
VOC07 ResNet-50 72.1 -
Res2Net-50 74.4 -
COCO ResNet-50 31.1 51.4
Res2Net-50 33.7 53.6
TABLE V: Object detection results on the PASCAL VOC07 and COCO datasets, measured using AP () and AP@IoU=0.5 (). The Res2Net has similar complexity compared with its counterparts.
Object size
Small Medium Large All
ResNet-50
AP
()
13.5 35.4 46.2 31.1
Res2Net-50 14.0 38.3 51.1 33.7
Improve. +0.5 +2.9 +4.9 +2.6
ResNet-50
AR
()
21.8 48.6 61.6 42.8
Res2Net-50 23.2 51.1 65.3 45.0
Improve. +1.4 +2.5 +3.7 +2.2
TABLE VI: Average Precision (AP) and Average Recall (AR) of object detection with different sizes on the COCO dataset.

4.6 Object Detection

For object detection task, we validate the Res2Net on the PASCAL VOC07 [11] and MS COCO [26] datasets, using Faster R-CNN [33] as the baseline method. We use the backbone network of ResNet-50 vs. Res2Net-50, and follow all other implementation details of [33] for fair comparison. Table V shows the object detection results. On the PASCAL VOC07 dataset, the Res2Net-50 based model outperforms its counterparts by on average precision (AP). On the COCO dataset, the Res2Net-50 based model outperforms its counterparts by on AP, and on AP@IoU=0.5.

We further test the AP and average recall (AR) scores for objects of different sizes as shown in Table VI. Objects are divided into three categories based on the size, according to [26]. The Res2Net based model has a large margin of improvement over its counterparts by , , and on AP for small, medium and large objects, respectively. The improvement of AR for small, medium and large objects are , , and , respectively. Due to the strong multi-scale ability, the Res2Net based models can cover a large range of receptive fields, boosting the performance on objects of different sizes.

4.7 Semantic Segmentation

Backbone Mean IoU ()
50-layer 101-layer
ResNet 77.0 78.5
Res2Net 77.9 79.3
TABLE VII: Performance of semantic segmentation on PASCAL VOC12 val set. The Res2Net has similar complexity compared with its counterparts.

GT

ResNet-50

Res2Net-50

Fig. 6: Visualization of semantic segmentation results [5], using ResNet-101 and Res2Net-101 as backbone networks.

Semantic segmentation requires a strong multi-scale ability of CNNs to extract essential contextual information of objects. We thus evaluate the multi-scale ability of Res2Net on the semantic segmentation task using PASCAL VOC12 dataset [10]. We follow the previous work to use the augmented PASCAL VOC12 dataset [14] which contains 10582 training images and 1449 val images. We use the Deeplab v3+ [5] as our segmentation method. All implementations remain the same with Deeplab v3+ [5]

except that the backbone network is replaced with ResNet and our proposed Res2Net. The output stride used in training and evaluation is both 16. As shown in  Table 

VII, Res2Net-50 based method outperforms its counterpart by 0.9 on mean IoU. And Res2Net-101 based method outperforms its counterpart by 0.8 on mean IoU. Visual comparisons of semantic segmentation results on challenging examples are illustrated in Fig. 6. The Res2Net based method tends to segment all parts of objects regardless of object size.

4.8 Instance Segmentation

Instance Segmentation is the combination of object detection and semantic segmentation. It requires not only the correct detection of objects with various sizes in an image but also the precise segmentation of each object. As mentioned in Sec. 4.6 and Sec. 4.7, both object detection and segmantic segmentation require a strong multi-scale ability of CNNs. Thus, the multi-scale representation is quite beneficial to instance segmentation. We use the Mask R-CNN [15] as the instance segmentation method, and replace the backbone network of ResNet-50 with our proposed Res2Net-50. The performance of instance segmentation on MS COCO [26] dataset is shown in Table VIII. The Res2Net based method outperforms its counterparts by 1.7 on AP and 2.4 on AP@IoU=0.5. The performance gains on objects with different sizes are also demonstrated. The improvement of AP for small, medium and large objects are 0.9, 1.9 and 2.8, respectively. And the Res2Net based method also has 1.7, 1.0 and 1.7 performance gains in terms of AR on small, medium and large objects, respectively.

Object size
Small Medium Large All IoU=0.5
ResNet-50
AP
()
14.8 36.0 50.9 33.9 55.2
Res2Net-50 15.7 37.9 53.7 35.6 57.6
Improve. +0.9 +1.9 +2.8 +1.7 +2.4
ResNet-50
AR
()
25.0 49.3 62.0 45.9 -
Res2Net-50 26.7 50.3 63.7 47.2 -
Improve. +1.7 +1.0 +1.7 +1.3 -
TABLE VIII: Average Precision (AP) and Average Recall (AR) of instance segmentation with different sizes on the COCO dataset.

4.9 Salient Object Detection

Dataset Backbone F-measure MAE
ECSSD ResNet-50 0.910 0.065
Res2Net-50 0.926 0.056
PASCAL-S ResNet-50 0.823 0.105
Res2Net-50 0.841 0.099
HKU-IS ResNet-50 0.894 0.058
Res2Net-50 0.905 0.050
DUT-OMRON ResNet-50 0.748 0.092
Res2Net-50 0.800 0.071
TABLE IX: Salient object detection results on different datasets, measured using F-measure and Mean Absolute Error (MAE). The Res2Net has similar complexity compared with its counterparts.

Pixel level tasks such as salient object detection also require the strong multi-scale ability of CNNs to locate both the holistic objects as well as their region details. Here we use the latest method DSS [18] as our baseline. For fair comparison, we only replace the backbone with ResNet-50 and our proposed Res2Net-50, while keeping other configurations unchanged. Following [18], we train those two models using the MSRA-B dataset [27], and evaluate results on ECSSD [45], PASCAL-S [24], HKU-IS [23], and DUT-OMRON [46] datasets. The F-measure and Mean Absolute Error (MAE) are used for evaluation. As shown in Table IX, the Res2Net based model has a consistent improvement compared with its counterparts on all datasets. On the DUT-OMRON dataset (containing 5168 images), the Res2Net based model has a improvement on F-measure and a improvement on MAE, compared with ResNet based model. The Res2Net based approach achieves greatest performance gain on the DUT-OMRON dataset, since this dataset contains the most significant object size variation than the other three datasets. Some visual comparisons of salient object detection results on challenging examples are illustrated in Fig. 7.

Images GT ResNet-50 Res2Net-50
Fig. 7: Examples of salient object detection [18] results, using ResNet-50 and Res2Net-50 as backbone networks, respectively.

5 Conclusion and Future Work

We present a simple yet efficient block, namely Res2Net, to further explore the multi-scale ability of CNNs at a more granular level. The Res2Net exposes a new dimension, namely “scale”, which is an essential and more effective factor in addition to existing dimensions of depth, width, and cardinality. Our Res2Net module can be integrated with existing state-of-the-art methods with no effort. Image classification results on CIFAR-100 and ImageNet benchmarks suggested that our new backbone network consistently performs favorably against its state-of-the-art competitors, including ResNet, ResNeXt, DLA, etc.

Although the superiority of the proposed backbone model has been demonstrated in the context of several representative computer vision tasks, including class activation mapping, object detection, and salient object detection, we believe multi-scale representation is essential for a much wider range of application areas. To encourage future works to leverage the strong multi-scale ability of the Res2Net, the source code will be publicly available upon acceptance.

References

  • [1] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002.
  • [2] A. Borji, M.-M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12):5706–5722, 2015.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
  • [4] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
  • [6] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems (NIPS), pages 4467–4475, 2017.
  • [7] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569–582, 2015.
  • [8] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
  • [9] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • [10] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
  • [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
  • [13] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pages 1135–1143, 2015.
  • [14] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV). IEEE, 2011.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [18] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [19] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  • [23] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5455–5463, 2015.
  • [24] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 280–287, 2014.
  • [25] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 4, 2017.
  • [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
  • [27] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2):353–367, 2011.
  • [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016.
  • [29] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5872–5881. IEEE, 2017.
  • [30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  • [31] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • [32] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision (ECCV), September 2018.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [35] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017.
  • [36] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), pages 568–576, 2014.
  • [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In

    The National Conference on Artificial Intelligence (AAAI)

    , volume 4, page 12, 2017.
  • [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  • [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
  • [41] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng. Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision, 123(2):251–268, 2017.
  • [42] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995. IEEE, 2017.
  • [44] S. Xie and Z. Tu. Holistically-nested edge detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1395–1403, 2015.
  • [45] Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1155–1162, 2013.
  • [46] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3166–3173, 2013.
  • [47] F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggregation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2403–2412, 2018.
  • [48] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016.
  • [49] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregating multi-level convolutional features for salient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 202–211, 2017.
  • [50] T. Zhang, C. Xu, and M.-H. Yang. Multi-task correlation particle filter for robust object tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [51] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [52] K. Zhao, W. Shen, S. Gao, D. Li, and M.-M. Cheng. Hi-fi: Hierarchical feature integration for skeleton detection. In International Joint Conference on Artificial Intelligence (IJCAI), 2018.
  • [53] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1265–1274, 2015.