See More Than Once -- Kernel-Sharing Atrous Convolution for Semantic Segmentation

08/26/2019 ∙ by Ye Huang, et al. ∙ University of Technology Sydney 15

The state-of-the-art semantic segmentation solutions usually leverage different receptive fields via multiple parallel branches to handle objects with different sizes. However, employing separate kernels for individual branches degrades the generalization and representation abilities of the network, and the amount of parameters increases by the times of the number of branches. To tackle this problem, we propose a novel network structure namely Kernel-Sharing Atrous Convolution (KSAC), where branches of different receptive fields share the same kernel, i.e., let a single kernel `see' the input feature maps more than once with different receptive fields, to facilitate communication among branches and perform `feature augmentation' inside the network. Experiments conducted on the benchmark VOC 2012 dataset show that the proposed sharing strategy can not only boost network's generalization and representation abilities but also reduce the model complexity significantly. Specifically, when compared with DeepLabV3+ equipped with MobileNetv2 backbone, 33 Xception is used as the backbone, the mIOU is elevated from 83.34 with about 10M parameters saved. In addition, different from the widely used ASPP structure, our proposed KSAC is able to further improve the mIOU by taking benefit of wider context with larger atrous rates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Recent advances in computer vision techniques have been largely fueled by the advances of deep learning techniques. As a classical computer vision application, semantic segmentation assigns pixels belonging to the same object class with the same label. During the segmentation procedure, deep networks are required to handle both local detailed and global semantic information, so as to handle objects of arbitrary sizes. To achieve such robustness, numerous efforts have been made by the research community. For example, the Fully Convolutional Network (FCN) 

[13] and U-Net [14] combined the low-resolution feature maps with the high-resolution ones via concatenation or element-wise adding operation to extract both detailed and context features, while the PSPNet [24] utilized multiple pooling layers in parallel to dig richer information. Particularly, in the well-known DeepLab family [2, 3, 5, 12, 4], a more powerful and successful Atrous Spatial Pyramid Pooling (ASPP) structure was proposed to exploit different receptive fields via multiple parallel convolutional branches with different atrous rates to extract features for both small and large objects. The ASPP structure improved the networks’ generalizability significantly. Thanks to the superiority of this parallel concatenation strategy, ASPP has been widely used and further improved by many other works, such as CE-Net [7], DenseASPP net [21] and Pixel-Anchor net [10].

Figure 2: Illustration of our proposed KSAC structure. The single kernel is shared by three parallel branches with different atrous rates.

However, though ASPP and other similar parallel strategies have improved, to some extent, the robustness of their models to objects’ scale variability, they still suffer from other limitations. First, the lack of communication among branches compromises the generalizability of individual kernels, as shown in Fig. 1. Specifically, kernels in the convolutional branches with small atrous rates or high-resolution feature maps are able to learn detailed information and handle small semantic classes well. However, for large semantic classes, these kernels are incapable of learning features that concern a broader range of context. In contrast, kernels in branches with big atrous rates or low-resolution feature maps are able to extract features with large receptive fields, but may miss much detailed information. Therefore, the generalizability of kernels is limited. On the other hand, the number of samples contributive to train individual branches are reduced since small (or big) objects are only effective for the training of branches with small (or big) atrous rates, so the representative ability of individual kernels is affected. Secondly, it is obvious that by using parallel branches with separate kernels, the number of parameters increases by the times of the number of parallel branches.

To tackle the above mentioned problems, in this work, we propose a novel network structure namely Kernel-Sharing Atrous Convolution (KSAC), as shown in Fig. 2, where multiple branches with different atrous rates can share a single kernel effectively. With this sharing strategy, the shared kernel is able to scan the input feature maps more than once with both small and large receptive fields, and thus to see both local detailed and global contextual information extracted for objects of small or big sizes. In other words, the information learned with different atrous rates is also shared. Moreover, since objects of various sizes can all contribute to the training of the shared kernel, the number of effective training samples increase, resulting in the improved representation ability of the shared kernel. On the other hand, the amount of parameters is significantly reduced with the sharing mechanism, and the implementation of the proposed KSAC is quite easy. According to our experimental results on the benchmark VOC 2012 dataset, when the MobileNetv2 and Xception backbones are used, the models’ sizes are reduced by 33% (4.5M vs 3.0M) and 17% (54.3M vs 44.8M), respectively; Meanwhile, the mIOUs are improved by 0.6% (75.70% vs 76.30%) and 2.62% (83.03% vs 85.96%), respectively. Moreover, by exploring a wider range of context, the mIOU is further improved to 86.16%, which is 3.13% higher than that of DeepLab V3+. Finally, our model is implemented with the latest deep learning framework, i.e.,Tensorflow 2.0, and the full code is publicly available on Github.

Related Works

Fully Convolutional Network

The Fully Convolutional Network (FCN) proposed in [13]

was a watershed in the development of semantic segmentation techniques. It was the first publication that successfully applied deep neural networks to the spatially dense prediction tasks. As we all know, the fully-connected layers in deep networks require fixed-size inputs, which is contradictory to the arbitrary-size inputs of semantic segmentation tasks. FCN solves this problem by transforming the fully connected layers into convolutional layers, allowing networks to produce arbitrary sized heatmaps. In addition, FCN uses the skip connections to fuse global semantic information with local appearance information so that more accurate predictions can be produced. According to their reported results on the benchmark dataset VOC 2012, FCN has made a major breakthrough for the problem of semantic segmentation, and outperformed the state of the arts dramatically.

Thus, since the introduction of FCN, all of the existing deep networks designed for semantic segmentation have followed the fully convolutional fashion. An example is the most widely used medical image segmentation network U-Net [14], where concatenation is used to combine low-level features with high-level features in the skip operation, instead of element-wise adding used in FCN.

DeepLab Family

Models from the DeepLab family [2, 3, 5, 12, 4] have championed the semantic segmentation solutions, as shown on the VOC 2012 segmentation test leaderboard (http://host.robots.ox.ac.uk:8080/leaderboard). Their success benefits from the contributions made by the advanced network architecture as well as their huge training dataset.

In the first version of DeepLab (DeepLab V1) [3], atrous convolution (aka, ‘dilated convolution’) was proposed to expand network’s receptive fields without shrinking the feature maps’ resolutions, and this was achieved by inserting zeros into the kernels. Additionally, they also employed the fully connected CRFs to obtain more accurate boundary predictions. ASPP was the key technique designed in the second version of DeepLab (DeepLab V2) [2], which exploited multiple parallel branches with different atrous rates to generate multi-scale feature maps to handle scale variability. This technique has been retained in all of the subsequent DeepLab versions due to its extraordinary performance. In particular, DeepLab V3 [4] augmented ASPP with image-level features by encoding global context to further boost the segmentation performance. Moreover, DeepLab V3+ [5] embedded the ASPP to a more efficient encoder-decoder architecture, i.e., Xception, and achieved the best performance in the semantic segmentation task. Besides, the authors of DeepLab also explored more efficient convolution operators like depthwise separable convolution in MobileNet [15] and more effective network structures via the Neural Architecture Search (NAS) techniques in [12, 1]

However, though ASPP has achieved so remarkable performance, we find that it still has the limitations in terms of generalization ability and model complexity, as explained earlier. Therefore, in this work, we propose the novel Kernel-Sharing Atrous Convolution (KSAC) to handle the scale variability problem more effectively. According to the experimental results on the benchmark VOC 2012 dataset, KSAC achieves much better performance than ASPP with a lot fewer parameters.

Other Semantic Segmentation Models

In addition to the aforementioned models, there are many other outstanding deep networks designed for semantic segmentation. For instance, the PSPNet proposed in [24] aggregated the global context information via a pyramid pooling module together with their proposed pyramid scene parsing network. The DenseASPP [21] argued that the scale-axis of ASPP was not dense enough for autonomous driving scenario, so they designed a more powerful DenseASPP structure, where a bunch of atrous convolutional layers were connected in a quite dense way. Considering the importance of global contextual information, a Context Encoding Module was proposed in [22] to capture the semantic context of scenes and enhance the class-dependent feature maps. This method improved the segmentation results with only slightly extra computation cost when compared with the FCN structure [13].

More recently, many advanced networks have been proposed for semantic segmentation and achieved promising performance. For instance, the Self-Supervised Model Adaptation (SSMA) fusion mechanism proposed in [17] leveraged complementary modalities to enable the network to learn more semantically richer representations. HRNet [16] connected high-to-low resolution convolutions in parallel and repeatedly aggregated the up-sampled representations from all the parallel convolutions to maintain strong high-resolution representations through the whole process. To speed up computation and reduce memory consumption, a novel joint up-sampling model named Joint Pyramid Upsampling (JPU) was proposed in [18] for semantic segmentation. In [25], a fully dense neural network, i.e.,

FDNet, was proposed to take advantage of feature maps learned in the early stages and construct the spatial boundaries more accurately. Besides, the authors also designed a novel boundary-aware loss function to focus more attentions on ‘hard examples’,

i.e., pixels near the boundaries. As we all know, pixel-level labeling is a time-consuming and exhausting work, and domain adaption and few-short learning are the key solutions to the data scarcity problem. From this perspective, a self-ensembling attention network and an attention-based multi-context guiding (A-MCG) network were proposed in [20] and [9], respectively.

Clearly, improving the representation capability and effectiveness of the network for handling objects with arbitrary sizes has been an intrinsic goal for recent semantic segmentation techniques. The existing attempts have explored various possibilities to take into consideration both global context and local appearance information. In this work, we propose an effective sharing strategy, i.e., KSAC. Our experimental results demonstrate the superiority of this idea in terms of improving the segmentation quality, reducing the network complexity and considering wider range of context. Next, the technical details of our proposed KSAC, together with our motivations and justification, are presented.

Figure 3: The architecture of the network with the ASPP structure [2](left) and our proposed KSAC structure (right). In ASPP, multiple kernels are used for branches with different atrous rates. In our proposed KSAC, there is only one single kernel, which is shared by atrous convolutional layers with different atrous rates.

Kernel-Sharing Atrous Convolution (KSAC)

As introduced above, thanks to the development of techniques including atrous convolution, depthwise separable convolution, ASPP and Xception, etc, the DeepLab family [2, 3, 5, 12, 4] have achieved the highest performance for the task of semantic segmentation and become the most significant and successful multi-branch structures. In our work, for fair comparison with the well-known ASPP structure, we base our proposed KSAC on the DeepLab framework and replace the ASPP module by KSAC, as shown in Fig. 3,. More details are presented below.

Atrous Spatial Pyramid Pooling (ASPP)

The receptive field of a filter represents the range of context that can be viewed when calculating features as input for the subsequent layers. A large receptive field enables the network to consider wider range context and more semantic information, which is vital to handling objects of big sizes. In contrast, a small receptive field is good for capturing local detailed information, which can help to generate more refined boundaries and more accurate predictions, especially for small objects. However, the receptive fields are fixed in traditional convolution operators (e.g., a kernel has a fixed receptive field of ). Atrous convolution allows us to expand the receptive fields of filters flexibly by setting various atrous rates for the traditional convolutional layer and inserting zeros to the filters accordingly.

Furthermore, in the ASPP structure [2], to handle objects with arbitrary sizes, multiple atrous convolution layers with different atrous rates were used in parallel and their outputs are combined to integrate information extracted with various receptive fields. However, as analyzed above, this design does harm to the generalizability of kernels in individual branches and also increases the computation burden. To address this issue, we propose a novel sharing mechanism (i.e., KSAC) to improve the semantic segmentation performance of existing models.

Figure 4: The detailed architecture of our proposed Kernel-Sharing Atrous Convolution with
1:: Input channels, : Input feature maps, : Output channels, : Atrous rates
2:
3: Kernel() generate shared kernel
4:for  do
5:     Conv2D()
6:      BatchNorm()
7:end for
8: ()
Algorithm 1 Kernel-Sharing Atrous Convolution

Atrous Convolution with Shared Kernel

As shown in Fig. 4, our proposed KSAC is composed of three components, i.e., a convolutional layer, a global average pooling layer followed by a convolutional layer to obtain the image-level features, and a pyramid atrous convolutional module with a shared kernel and atrous rates

. Note that the batch normalization layers are used after each convolutional layer. More implementation details of the KSAC structure are presented in Algorithm 

1.

As we can see, there is only one kernel in our KSAC, which is shared by multiple parallel branches at different atrous rates so that it can see the input feature maps for multiple times with different receptive fields. In contrast, in ASPP each branch has its own kernel and therefore the number of total parameters increases by the times of the number of branches. Specifically, the model complexity is for our KSAC, while it is for ASPP. Here, and () denote the input feature map channels, the output feature map channels, the number of branches and the number of parameters in other two convolutional layers. In other words, the model complexities of KSAC and ASPP are and , respectively. Apparently, a large number of parameters are saved in our KSAC. For instance, in the case shown in Fig. 4, compared with the ASPP structure, about 62% parameters are saved by sharing the kernel in three parallel convolutional branches.

As demonstrated in our experiments, our proposed sharing strategy not only helps reduce the number of parameters but also improves the segmentation performance. This improvement can be explained from two aspects. Firstly, the generalization ability of the shared kernels are enhanced by learning both local detailed features for small objects and global semantically rich features for large objects, which is realized via varying the atrous rates. Secondly, the number of effective training samples is increased by sharing information, which improves the representation ability of the shared kernels. As described in Fig. 1, kernels with small atrous rates in ASPP cannot extract features comprehensive enough for large objects, while those with large atrous rates are ineffective on extracting local and fine details for small objects. Therefore, kernels in individual branches can only be trained effectively by some objects in the training images. In contrast, in our proposed KSAC, all of the objects in the training images are contributive samples for the training of the shared kernel. Note that, this kernel-sharing strategy is essentially to conduct ‘feature’ augmentation inside the network by sharing kernels among branches. Like data augmentation performed in the pre-processing stage, feature augmentation performed inside the network can help to enhance the representation ability of the shared kernels.

Figure 5: Visualization of the feature maps extracted by KSAC and ASPP. The edges and contours extracted by our KSAC are much clearer than those extracted by ASPP. feature maps have been presented for each rate in this figure, and we enlarge the ones indicated by red bounding boxes on the top of the figure. Readers are suggested to zoom in to see more details.

To better understand the enhanced generalization and representation abilities of our KSAC, we visualize the feature maps learned by its shared kernel, and compare these feature maps with those generated by ASPP’s separate kernels, as shown in Fig. 5. Obviously, no matter for branches with small atrous rates (or small receptive fields) or large atrous rates (large receptive fields), the feature maps produced by our KSAC are much more comprehensive, expressive and discriminative than those generated by ASPP. Specifically, as illustrated in Fig. 5, the edges (local detailed information) and contours (global semantic) detected by our KSAC are much clearer than those detected by ASPP.

Moreover, as pointed out in [4], the DeepLab model has achieved the best performance under the setting in ASPP. However, when an additional parallel branch with was added, the performance actually dropped slightly by 0.12%. That is to say, ASPP is not able to produce better performance through capturing a longer range of context. In contrast, according to our experimental results, the performance obtained with our proposed KSAC can be further improved with the setting . This demonstrates that, compared with ASPP, our proposed KSAC is more effective in terms of capturing longer ranges of context with larger atrous rates and wider parallel atrous convolutional branches.

In addition, note that, in this new setting, five branches share one single kernel, so the number of parameters remains the same as that of the setting .

Experimental Setting

To demonstrate the effectiveness of our proposed KSAC sharing mechanism, we evaluate its performance on the benchmark PASCAL VOC 2012 dataset and compare with the state-of-the-art approaches. In this section, we describe the details of the dataset utilized, model implementation and training protocol.

Datasets and Data Augmentation

In this work, the benchmark datasets SBD and COCO are used for pre-training and the PASCAL VOC 2012 dataset is used for the fine-tuning and evaluation. Augmentation methods including random flipping, random scaling and random cropping are employed.

Pascal Voc 2012

The PASCAL VOC 2012 dataset is created for multiple purposes, including detection, recognition and segmentation, etc. There are a large amount of images provided in this dataset, but only about 4,500 of them are labelled with high quality for segmentation. In particular, the PASCAL VOC 2012 segmentation dataset consists of about 1,500 annotated training images, 1,500 annotated validation images and 1,500 unannotated test images.

Semantic Boundaries Dataset (SBD)

The SBD dataset [8] is a third party extension of the PASCAL VOC 2012 dataset and composed of about 8,500 annotated training images and 2,800 annotated validation images. Among the released training images, more than 1,000 of them are picked directly from the official PASCAL VOC 2012 validation set. Therefore, in order to use the SBD dataset for the training and accurately evaluate the performance of related models with the PASCAL VOC 2012 validation set, we remove these images from the SBD and merge the rest 8,000 training images with the 2,800 validation images to create the SBD ‘trainaug’ dataset.

Common Objects in Context Dataset (COCO)

COCO is a huge dataset created for multiple tasks. As mentioned in literature, additional improvement can be made if the model is pre-trained with the COCO dataset. Therefore, following the practice of DeepLab V3 [4], we select about 60K training images from the COCO dataset to include images containing classes defined in PASCAL VOC 2012 and with an annotation region greater than 1,000. Moreover, any classes that are not defined in PASCAL VOC 2012 are treated as background.

Data Augmentation

To fairly compare our proposed model with other existing works, we also apply some widely adopted data augmentation strategies in our training, including horizontally flipping with 50% probability, randomly scaling the images with a scaling factor between 0.5 and 2.0 and at a step size of 0.25, padding and randomly cropping the scaled images to a size of

.

Implementation Details

Reimplementation of Encoders

In this work, we use the most popular MobileNetV2 and Xception structures as our encoder and both of them are fully implemented with the latest syntax Tensorflow2.0. In addition, we load and transfer the weights pre-trained on the ImageNet for both encoders in our experiments. The new implementation of MobileNetV2 and Xception are also available in our code to be released.

Training Protocol

In our experiments, the batch size is set to 32 and 16 for the MobileNetV2-based models and Xception-based models, respectively. According to [19], a minimum batch size of 16 is required to achieve a desirable performance of the Batch Normalization layer. Otherwise, if a batch size less than 16 is used, the error rate will increase noticeably and the performance will drop significantly. In this work, our models are trained with two Titan RTX GPUs. Additionally, in the first pre-training stage of our experiments, the models are trained on the mixed dataset of COCO, SBD and VOC for 300K iterations with a learning rate of 1e-3. Then, the learning rate is adjusted to 4e-4 and models are continually trained on the SBD and VOC mixed datasets for another 40K iterations. Finally, the models are fine-tuned on the VOC training set with a learning rate of 2e-4.

Evaluation Results

Improved mIOU

To demonstrate the effectiveness of our proposed KSAC, we first compare it with ASPP, the most successful multi-branch structure that has played a key role in the DeepLab family. The comparison results are shown in Table 1. Note that the combination of ASPP, Xception and Decoder is exactly the architecture of DeepLab V3+ [5]. In addition, the ASPP module and our KSAC module used in Table 1 are with the same atrous rate setting, i.e., (6, 12, 18), which is the standard setting of DeepLab V3+ [5].

As we can see from Table 1 that, under the same configuration, by replacing ASPP with the proposed KSAC, the mIOU figures have been improved for both Xception-based models and MobileNetV2-based models. In particular, when the Xception encoder is used, our proposed KSAC has achieved the highest mIOU of 85.96%, which is 2.62% higher than DeepLab V3+ (83.34%). Moreover, according to [5], with the assist of Google’s private dataset JFT, where millions of images are provided for semantic segmentation, the performance of DeepLab V3+ is improved to 84.22%, which is still 1.74% lower than our proposed KSAC-based model that was trained without the assist of the JFT dataset.

Parallel Structure Encoder Test Strategy Pre-train Dataset Performance
Our KSAC ASPP Xception MobileNetV2 Decoder MS Flip COCO JFT mIOU (%) Params (M)
82.20 54.3
83.34 -
83.03 -
84.22 -
83.92 44.8
85.96 -
75.32 -
75.70 4.5
76.30 3.0
Table 1: Experimental results obtained on PASCAL VOC 2012 val set with different inference strategies when using ASPP and our proposed KSAC, with Xception or MobileNetV2 as the backbone. KSAC: Using our proposed Kernel-Sharing Atrous Convolution. ASPP: Using the standard ASPP structure proposed in [4]. Xception: Using Xception65 [5] as the backbone. MobileNetV2: Using MobileNetV2 [15] as the backbone. Decoder:

Concatenating the OS = 4 feature maps from backbone during the upsampling of the logits.

MS: Employing multi-scale (MS) inputs, with a scaling rate of 0.5, 0.75, 1.0, 1.25, 1.5 or 1.75, during the evaluation. Flip: Adding left-right flipped inputs during the evaluation. COCO: Model is pre-trained on COCO. JFT: Model is pre-trained on JFT

To further illustrate the superiority of our proposed KSAC, we also compare its performance with that of other state-of-the-art approaches, as listed in Table 2.

From the above comparison results, we can conclude that our proposed KSAC structure is more robust and effective than the ASPP structure, and by seeing the input feature maps for multiple times with different receptive fields, the networks’ generalization and representation abilities have been significantly improved.

Method mIOU (%)
Auto-DeepLab [12] 82.04
MSCI [11] 85.1
ExFuse [23] 85.8
SDN [6] 84.8
DeepLab V3+(ASPP) [5] 83.34
DeepLab V3 [4] 82.70
KSAC(Ours) 85.96
KSAC*(Ours) 86.16
Table 2: Comparison results with other state-of-the-art approaches on the PASCAL VOC 2012 val set. The atrous rate is set to (6, 12, 18) for ASPP and KSAC, while it is (1, 6, 12, 18, 24) for KSAC*. For a fair comparison, only COCO, VOC and SBD datasets are used for the training of the listed models.

Reduced Network Model Size

Table 1 also compares the number of parameters in each resultant model. As it can be seen, with our proposed sharing mechanism, the total number of parameters learned with our KSAC has been significantly reduced. Concretely, when the MobileNetV2 network is used as the encoder, about 33.33% parameters are saved (3M vs 4.5M) due to the efficient sharing strategy. Note that, by replacing the traditional convolution with the efficient depthwise separable convolution, MobileNetV2 is already a light deep network structure that is specially designed for mobile devices. While by combining MobileNetV2 with our proposed KSAC, the model has become even lighter and the performance is also further improved. In other words, KSAC can make the model more effective and efficient for mobile devices and IOT devices. In addition, when the Xception decoder is used, about ten times of parameters have been reduced (about 10M) compared with MobileNetV2. In other words, for models involving traditional convolutional operations, our proposed KSAC is able to save more parameters.

Capability of Handling Wider Range of Context

As claimed in [4], the DeepLab V3 model achieved the best performance when three parallel branches with were used in the ASPP module, while an additional parallel branch with resulted in a slight drop (0.12%) of the performance. In contrast, our proposed KSAC is able to take the benefit of a wider range of context to further improve the segmentation performance. As shown in Table 3, when added with two atrous convolution branches with rates and in our KSAC structure, the mIOU is further improved from 85.96% to 86.16%. Specifically, since the newly added branches share the same kernel with the original three branches, no additional parameters are added. In our best guess, the performance degradation of ASPP is caused by the insufficient training of the newly introduced parameters, while the shared kernel of our proposed KSAC can be further trained and enhanced when it is shared by additional branches.

Atrous Rate Test Strategy
(6, 12, 18) (1, 6, 12, 18, 24) MS Flip mIOU (%)
83.92
85.96
84.01
86.16
Table 3: Experimental results of our proposed KSAC on Pascal VOC 2012 val set with different settings of atrous rates. MS: Employing the multi-scale inputs during the evaluation. Flip: Adding left-right flipped inputs.

Conclusion

In this work, to handle the scale variability problem in semantic segmentation, we have proposed a novel and effective network structure namely Kernel-Sharing Atrous Convolution (KSAC), where different branches share one single kernel with different atrous rates, i.e., let a single kernel see the input feature maps more than once with different receptive fields. Experimental results conducted on the benchmark PASCAL VOC 2012 have demonstrated the superiority of our proposed KSAC. KSAC has not only effectively improved the segmentation performance but also reduced the number of parameters significantly. Additionally, compared with the well-known ASPP structure, our KSAC can also capture a wider range of context without introducing extra parameters via adding additional parallel branches with larger atrous rates.

References