Interlaced Sparse Self-Attention for Semantic Segmentation

07/29/2019 ∙ by Lang Huang, et al. ∙ 2

In this paper, we present a so-called interlaced sparse self-attention approach to improve the efficiency of the self-attention mechanism for semantic segmentation. The main idea is that we factorize the dense affinity matrix as the product of two sparse affinity matrices. There are two successive attention modules each estimating a sparse affinity matrix. The first attention module is used to estimate the affinities within a subset of positions that have long spatial interval distances and the second attention module is used to estimate the affinities within a subset of positions that have short spatial interval distances. These two attention modules are designed so that each position is able to receive the information from all the other positions. In contrast to the original self-attention module, our approach decreases the computation and memory complexity substantially especially when processing high-resolution feature maps. We empirically verify the effectiveness of our approach on six challenging semantic segmentation benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Experiments

We first compare our approach with self-attention mechanism on three challenging semantic segmentation benchmarks including: Cityscapes [cordts2016cityscapes], ADE[zhou2017scene], LIP [liang2018look]. Then we also conduct ablation study on the object detection/ instance segmentation benchmark COCO [lin2014microsoft] based on Mask R-CNN baseline [he2017mask].

1.1 Experiments on Semantic Segmentation

We first illustrate the details of the evaluated benchmarks, and then we provide the related results on each benchmark. Last, we present the visual improvements of our approach. Especially, we use mIoU (mean of class-wise intersection over union) and pixel accuracy as evaluation metrics on all the three semantic segmentation benchmark.

Cityscapes. The dataset contains finely annotated images with semantic classes. The images are in resolution and captured from 50 different cities. The training, validation, and test sets consists of , , images respectively.

ADE20K. The dataset is very challenging that contains K densely annotated images with fine-grained semantic concepts. The training and validation sets consists of K, K images respectively.

LIP. The dataset is a large-scale dataset that focuses on semantic understanding of human bodies. It contains K images with semantic human part labels and background label for human parsing. The training, validation, and test sets consists of K, K, K images respectively.

1.1.1 Implementation Details

Network

. We use ImageNet-pretrained ResNet-

/ResNet- as our backbone [long2015fully]. Following the common practice [chen2017rethinking], we remove the last two down-sample operation in the ResNet-/ResNet- and employ dilated convolutions in last two stages, thus the size of the output feature map is smaller than input image.

Training setting. For all the three semantic segmentation benchmarks, we use the ”poly” learning rate policy where the learning rate is multiplied by with power as . We choose momentum of and a weight decay of . Besides, we also apply an auxiliary loss on the intermediate feature map after stage- of ResNet with a weight of following the PSPNet [zhao2017pyramid]. For the data augmentation, We apply random horizontal flip and random scaling (from to

) and random crop over all the training images. Especially, we use the synchronized batch normalization 

[Bulò_2018_CVPR] in all of our experiments. For Cityscapes, we choose initial learning rate of , batch size of and crop size of  [zhao2017pyramid, chen2017rethinking]. For ADEK, we choose initial learning rate of , batch size of and crop size of following [zhao2017pyramid, OCNet]. For LIP, we choose initial learning rate of , batch size of and crop size of following  [liu2018devil]. We all use P GPUs for training for all of our experiments.

Training epochs for different datasets …

1.1.2 Cityscapes

Ablation study We use the ResNet based FCN as our baseline, and we first conduct a group experiment to compare our approach with both the baseline and the non-local/ self-attention based method. We report the related results in Table 1. According to the results, we can see that both our approach and the non-local based methods bring significant improvements over the baseline, which reveals that capturing long range context is crucial for semantic segmentation. For example, our method achieves % absolute improvement over the baseline on the validation set of Cityscapes.

Method Pixel Acc. () mIoU ()
ResNet- Baseline
ResNet- + NL
ResNet- + IA
Table 1: Comparison to baseline and self-attention/ non-local mechanism on validation set of Cityscapes.

Comparison with the state-of-the-arts. We report the results in Table 2 to compare with the recent state-of-the-arts on the test set of Cityscapes, where we train our models for more iterations and apply the multi-scale testing and flip testing. Our method outperforms all the previous methods that only use the fine-labeled datasets for training. For example, our approach achieves % mIoU and improves the previous state-of-the-art method AAF [aaf2018] by %. Moreover, our approach achieves % when using validation set, which outperforms the previous state-of-the-art methods by a large margin.

Method val Backbone mIoU ()
PSPNet [zhao2017pyramid] ResNet-
PSANet [psanet] ResNet-
AAF [aaf2018] ResNet-
IANet ResNet-
RefineNet [lin2017refinenet] ResNet-
SAC [Zhang_2017_ICCV] ResNet-
DUC-HDC [wang2017understanding] ResNet-
DFN [Yu_2018_CVPR] ResNet-
DSSPN [Liang_2018_CVPR] ResNet-
DepthSeg [Kong_2018_CVPR] ResNet-
DenseASPP [Yang_2018_CVPR] DenseNet-
BiSeNet [yu2018bisenet] ResNet-
PSANet [psanet] ResNet-
IANet ResNet-
Table 2: Comparison to state-of-the-arts on the test set of Cityscapes. We report both results with and without val fine data.

1.1.3 AdeK

Ablation study. We compare interlaced attention with baseline and non-local method on the validation set of ADEK in Table 3. Our interlaced attention improves ResNet- baseline by % in mIoU and % in pixel accuracy, which is significant considering that ADEK is very challenge.

Method Pixel Acc () mIoU ()
ResNet- Baseline
ResNet- + NL
ResNet- + IA
Table 3: Comparison to non-local [wang2018non] (NL) on the validation set of ADE20K.

Comparison with the state-of-the-arts. In Table 4, we compare our method with the state-of-the-arts. To have fair comparison, we employ stronger ResNet- backbone and multi-scale test following other methods. From the results, we can see our method achieves best performance compared to all other methods. Concretely, IANet ahieves % mIoU in the validation set of ADEK, which improves recent proposed EncNet [Zhang_2018_CVPR] that using the same backbone by % and even outperforms PSPNet [zhao2017pyramid] that based on much stronger ResNet-.

Method Backbone mIoU ()
RefineNet [lin2017refinenet] ResNet-
RefineNet [lin2017refinenet] ResNet-
PSPNet [zhao2017pyramid] ResNet-
PSPNet [zhao2017pyramid] ResNet-
PSPNet [zhao2017pyramid] ResNet-
SAC [Zhang_2017_ICCV] ResNet-
PSANet [psanet] ResNet-
UperNet [xiao2018unified] ResNet-
DSSPN [Liang_2018_CVPR] ResNet-
EncNet [Zhang_2018_CVPR] ResNet-
IANet ResNet-
Table 4: Comparison to state-of-the-arts on the validation set of ADE20K.

1.1.4 Lip

Ablation study.

Comparison with the state-of-the-arts. To verify the generalized ability of our method in semantic segmentation task, we further evaluate our IANet on LIP dataset. LIP dataset is a human parsing benchmark that tasked for identifying which human part the pixels belong to and very different to previous two datasets. According to Table 5, IANet achieves new state-of-the-arts performance of % in mIoU, and outperforms the all other method using the same backbone by a large margin. The improvements further validate the generalized ability of our interlaced attention method. Note that we only employ single scale test following CE2P [liu2018devil] and multi-scale test can be further incorporated to improve performance.

Method Backbone mIoU ()
Attention+SSL [Gong_2017_CVPR] ResNet-
JPPNet [liang2018look] ResNet-
SS-NAN [Zhang_2017_ICCV] ResNet-
MMAN [luo2018macro] ResNet-
MuLA [nie2018mutual] ResNet-
CE2P [liu2018devil] ResNet-
IANet ResNet-
Table 5: Comparison to state-of-the-arts on the validation dataset of LIP.
Figure 1: Visualization of predictions of FCN with and without our interlaced attention. The first and last two rows each present examples from the validation set of Cityscapes and ADEK, respectively. Best viewed in color.

1.2 Application on Mask-RCNN

COCO. The dataset is one of the most challenging dataset for object detection and instance segmentation, which contains K images annotated with object bounding boxes and masks of 80 categories. The training, validation and test sets contains K, K, K images respectively. We report the Average Precision (AP) and AP (AP at IoU threshold %) for both bounding boxes and masks.

1.2.1 Implementation Details

We use Mask-RCNN [he2017mask] as our baseline to conduct our experiments. Similar to [wang2018non], we insert non-local or interlaced attention block before the last block of res- stage of the ResNet- FPN [lin2017fpn] backbone. All models are initialized with ImageNet pretrained weights and built upon open source toolbox [massa2018mrcnn]. We train the models using SGD with a batch size of and learning rate schedule ’’. The training and inference strategies keep the same with the default setting in the [massa2018mrcnn].

1.2.2 Results on COCO

We report the experiment results on COCO dataset in Table 6. Adding non-local or interlaced attention block consistently improves the competitive Mask-RCNN baseline by % on all metrics involving both object detection and instance segmentation. Considering that the implementation in [massa2018mrcnn]

is highly optimized, the improvements are non-trivial. Moreover, the performance of our method is comparable to non-local block. The substantial improvements on both object detection and instance segmentation again verify that our method generalizes well to various computer vision tasks.

Method AP^box AP^box_50 AP^mask AP^mask_50
Mask-RCNN
Mask-RCNN + NL
Mask-RCNN + IA
Table 6: Comparison to non-local [wang2018non] (NL) on the validation set of COCO. All models are based on ResNet- FPN backbone.

1.3 Ablation Studies

Comparison with CGNL. The dataset is a fine-grained image classification dataset. It contains images of types of birds. The training and validation sets consists of , images respectively.

We use ResNet-50 as backbone to conduct experiments on image classification task. Following [wang2018non], we insert 1 non-local block or our interlaced attention block before the last residual block of res- stage.

We report the Top- and Top- classification accuracy on the validation set of CUB-- in Table 7. The proposed interlaced attention improves the ResNet-50 baseline by % in terms of Top1 accuracy and % in terms of Top5 accuracy. We also compare our method with non-local [wang2018non] (NL) and the recent proposed compact generalized non-local [yue2018cgnl] module (CGNL), which generalizes the non-local module and take the correlations between the positions of any two channels into account. The results show that our method can outperform NL and CGNL on both Top and Top

accuracy. The consistent improvements for our method over other methods verify the ability of interlaced attention to enhance the representation of deep neural networks.

Method Top1 Acc. () Top5 Acc. ()
ResNet- Baseline
ResNet- + NL
ResNet- + CGNL
ResNet- + IA
Table 7: Comparison to non-local [wang2018non] (NL), compact generalized non-local [yue2018cgnl] (CGNL) on validation set of CUB--.

Comparison to Downsample and RCCA. One intuitive way to reduce the heavy cost of self-attention methods is downsampling the input feature map before perform self-attention. In Table 8, we compare the our method with non-local method with and without downsampled feature map. We can see that the performance of NL drop substantially if we downsample the feature map before fed into NL block. This observation suggests that downsample is not the solution for heavy cost of NL method.

We also compare interlaced attention to the recent proposed recurrent criss-cross attention [huang2018ccnet] (RCCA), which recurrently aggregate the contextual information of surrounding pixels on the criss-cross path and also reduce the cost of NL method. From the last two rows in Table 8, we can see that our method outperforms RCCA by a large margin. The comparison again indicates the effectiveness of interlaced attention.

Visualization. We visualize the prediction of baseline FCN, our method and the ground truth in Figure 1. The first and last two rows in Figure 1 each present examples from the validation set of Cityscapes and ADEK, respectively. It can be seen that our method tend to produce ’smoother’ prediction compared with the FCN baseline.

We conjecture that this may attribute to the property of interlaced attention that aggregate contextual information from all other similar pixels and enhance the representation. Thus, we visualize the attention map of our method in Figure. Visualization of attention map and COCO detection/mask result.

Method Down Pixel Acc. () mIoU ()
ResNet- - -
ResNet- + NL
ResNet- + NL -
ResNet- + RCCA -
ResNet- + IA -
Table 8: Comparison to non-local [wang2018non] (NL) w/ and w/o downsample and recurrent criss-cross attention [huang2018ccnet] (RCCA) on the validation set of Cityscapes.

Influence of the Partitions. The number of partitions (i.e., ) is key hyper-parameter of interlaced attention. Table 9 evaluates influence of different partitions on the validation set of Cityscapes. According to the results, we can see interlaced attention with different partitions consistently improves the baseline, and larger partition (i.e., or ) achieves slightly better results than others. As discusses in Section LABEL:ia_implementation, the complexity of interlaced attention is minimized when . And since the spatial size of feature maps that interlaced attention applied on usually range in [30, 100] during training, we use partition in all our experiments unless stated.

Method Pixel Acc. () mIoU ()
ResNet-
ResNet- + IA ()
ResNet- + IA ()
ResNet- + IA ()
Table 9: Influence of the Partitions (i.e., ) of Interlaced Attention on the validation set of Cityscapes.

Order of Long-range and Short-range Attention. In Section LABEL:sec:detail_ia, we show that our method can obtain information from every other pixels by performing cascaded long-range attention and short-range attention. We study the effect of the order of these two stages in Table 10. We can see perform long-range attention first and then short-range attention achieves better results on all metrics. One possible explanation is that performing short-range attention on neighboring region of original feature do not introduce mass amount of long-range context, and performing short-range attention on feature updated by long-range attention is likely to have much richer contextual information.

Method Pixel Acc. () mIoU ()
Short-Long Range Attention
Long-Short Range Attention
Table 10: Impact of the order of long-range and short-range attention on the validation set of Cityscapes.

Going Deeper with Interlaced Attention. In Table 11, we investigate the performance gain by adding multiple interlaced attention blocks to backbone ResNet- networks on the validation set of CUB.

Method Top Acc. () Top Acc. ()
ResNet-
ResNet- + IA
ResNet- + IA
Table 11: Performance gain by adding more interlaced attention blocks on the validation set of CUB.

References