1 Experiments
We first compare our approach with self-attention mechanism on three challenging semantic segmentation benchmarks including: Cityscapes [cordts2016cityscapes], ADEK [zhou2017scene], LIP [liang2018look]. Then we also conduct ablation study on the object detection/ instance segmentation benchmark COCO [lin2014microsoft] based on Mask R-CNN baseline [he2017mask].
1.1 Experiments on Semantic Segmentation
We first illustrate the details of the evaluated benchmarks, and then we provide the related results on each benchmark. Last, we present the visual improvements of our approach. Especially, we use mIoU (mean of class-wise intersection over union) and pixel accuracy as evaluation metrics on all the three semantic segmentation benchmark.
Cityscapes. The dataset contains finely annotated images with semantic classes. The images are in resolution and captured from 50 different cities. The training, validation, and test sets consists of , , images respectively.
ADE20K. The dataset is very challenging that contains K densely annotated images with fine-grained semantic concepts. The training and validation sets consists of K, K images respectively.
LIP. The dataset is a large-scale dataset that focuses on semantic understanding of human bodies. It contains K images with semantic human part labels and background label for human parsing. The training, validation, and test sets consists of K, K, K images respectively.
1.1.1 Implementation Details
Network
. We use ImageNet-pretrained ResNet-
/ResNet- as our backbone [long2015fully]. Following the common practice [chen2017rethinking], we remove the last two down-sample operation in the ResNet-/ResNet- and employ dilated convolutions in last two stages, thus the size of the output feature map is smaller than input image.Training setting. For all the three semantic segmentation benchmarks, we use the ”poly” learning rate policy where the learning rate is multiplied by with power as . We choose momentum of and a weight decay of . Besides, we also apply an auxiliary loss on the intermediate feature map after stage- of ResNet with a weight of following the PSPNet [zhao2017pyramid]. For the data augmentation, We apply random horizontal flip and random scaling (from to
) and random crop over all the training images. Especially, we use the synchronized batch normalization
[Bulò_2018_CVPR] in all of our experiments. For Cityscapes, we choose initial learning rate of , batch size of and crop size of [zhao2017pyramid, chen2017rethinking]. For ADEK, we choose initial learning rate of , batch size of and crop size of following [zhao2017pyramid, OCNet]. For LIP, we choose initial learning rate of , batch size of and crop size of following [liu2018devil]. We all use P GPUs for training for all of our experiments.Training epochs for different datasets …
1.1.2 Cityscapes
Ablation study We use the ResNet based FCN as our baseline, and we first conduct a group experiment to compare our approach with both the baseline and the non-local/ self-attention based method. We report the related results in Table 1. According to the results, we can see that both our approach and the non-local based methods bring significant improvements over the baseline, which reveals that capturing long range context is crucial for semantic segmentation. For example, our method achieves % absolute improvement over the baseline on the validation set of Cityscapes.
Method | Pixel Acc. () | mIoU () |
---|---|---|
ResNet- Baseline | ||
ResNet- + NL | ||
ResNet- + IA |
Comparison with the state-of-the-arts. We report the results in Table 2 to compare with the recent state-of-the-arts on the test set of Cityscapes, where we train our models for more iterations and apply the multi-scale testing and flip testing. Our method outperforms all the previous methods that only use the fine-labeled datasets for training. For example, our approach achieves % mIoU and improves the previous state-of-the-art method AAF [aaf2018] by %. Moreover, our approach achieves % when using validation set, which outperforms the previous state-of-the-art methods by a large margin.
Method | val | Backbone | mIoU () |
---|---|---|---|
PSPNet [zhao2017pyramid] | ✗ | ResNet- | |
PSANet [psanet] | ✗ | ResNet- | |
AAF [aaf2018] | ✗ | ResNet- | |
IANet | ✗ | ResNet- | |
RefineNet [lin2017refinenet] | ✓ | ResNet- | |
SAC [Zhang_2017_ICCV] | ✓ | ResNet- | |
DUC-HDC [wang2017understanding] | ✓ | ResNet- | |
DFN [Yu_2018_CVPR] | ✓ | ResNet- | |
DSSPN [Liang_2018_CVPR] | ✓ | ResNet- | |
DepthSeg [Kong_2018_CVPR] | ✓ | ResNet- | |
DenseASPP [Yang_2018_CVPR] | ✓ | DenseNet- | |
BiSeNet [yu2018bisenet] | ✓ | ResNet- | |
PSANet [psanet] | ✓ | ResNet- | |
IANet | ✓ | ResNet- |
1.1.3 AdeK
Ablation study. We compare interlaced attention with baseline and non-local method on the validation set of ADEK in Table 3. Our interlaced attention improves ResNet- baseline by % in mIoU and % in pixel accuracy, which is significant considering that ADEK is very challenge.
Method | Pixel Acc () | mIoU () |
---|---|---|
ResNet- Baseline | ||
ResNet- + NL | ||
ResNet- + IA |
Comparison with the state-of-the-arts. In Table 4, we compare our method with the state-of-the-arts. To have fair comparison, we employ stronger ResNet- backbone and multi-scale test following other methods. From the results, we can see our method achieves best performance compared to all other methods. Concretely, IANet ahieves % mIoU in the validation set of ADEK, which improves recent proposed EncNet [Zhang_2018_CVPR] that using the same backbone by % and even outperforms PSPNet [zhao2017pyramid] that based on much stronger ResNet-.
Method | Backbone | mIoU () |
---|---|---|
RefineNet [lin2017refinenet] | ResNet- | |
RefineNet [lin2017refinenet] | ResNet- | |
PSPNet [zhao2017pyramid] | ResNet- | |
PSPNet [zhao2017pyramid] | ResNet- | |
PSPNet [zhao2017pyramid] | ResNet- | |
SAC [Zhang_2017_ICCV] | ResNet- | |
PSANet [psanet] | ResNet- | |
UperNet [xiao2018unified] | ResNet- | |
DSSPN [Liang_2018_CVPR] | ResNet- | |
EncNet [Zhang_2018_CVPR] | ResNet- | |
IANet | ResNet- |
1.1.4 Lip
Ablation study.
Comparison with the state-of-the-arts. To verify the generalized ability of our method in semantic segmentation task, we further evaluate our IANet on LIP dataset. LIP dataset is a human parsing benchmark that tasked for identifying which human part the pixels belong to and very different to previous two datasets. According to Table 5, IANet achieves new state-of-the-arts performance of % in mIoU, and outperforms the all other method using the same backbone by a large margin. The improvements further validate the generalized ability of our interlaced attention method. Note that we only employ single scale test following CE2P [liu2018devil] and multi-scale test can be further incorporated to improve performance.
Method | Backbone | mIoU () |
---|---|---|
Attention+SSL [Gong_2017_CVPR] | ResNet- | |
JPPNet [liang2018look] | ResNet- | |
SS-NAN [Zhang_2017_ICCV] | ResNet- | |
MMAN [luo2018macro] | ResNet- | |
MuLA [nie2018mutual] | ResNet- | |
CE2P [liu2018devil] | ResNet- | |
IANet | ResNet- |
1.2 Application on Mask-RCNN
COCO. The dataset is one of the most challenging dataset for object detection and instance segmentation, which contains K images annotated with object bounding boxes and masks of 80 categories. The training, validation and test sets contains K, K, K images respectively. We report the Average Precision (AP) and AP (AP at IoU threshold %) for both bounding boxes and masks.
1.2.1 Implementation Details
We use Mask-RCNN [he2017mask] as our baseline to conduct our experiments. Similar to [wang2018non], we insert non-local or interlaced attention block before the last block of res- stage of the ResNet- FPN [lin2017fpn] backbone. All models are initialized with ImageNet pretrained weights and built upon open source toolbox [massa2018mrcnn]. We train the models using SGD with a batch size of and learning rate schedule ’’. The training and inference strategies keep the same with the default setting in the [massa2018mrcnn].
1.2.2 Results on COCO
We report the experiment results on COCO dataset in Table 6. Adding non-local or interlaced attention block consistently improves the competitive Mask-RCNN baseline by % on all metrics involving both object detection and instance segmentation. Considering that the implementation in [massa2018mrcnn]
is highly optimized, the improvements are non-trivial. Moreover, the performance of our method is comparable to non-local block. The substantial improvements on both object detection and instance segmentation again verify that our method generalizes well to various computer vision tasks.
Method | AP^box | AP^box_50 | AP^mask | AP^mask_50 |
---|---|---|---|---|
Mask-RCNN | ||||
Mask-RCNN + NL | ||||
Mask-RCNN + IA |
1.3 Ablation Studies
Comparison with CGNL. The dataset is a fine-grained image classification dataset. It contains images of types of birds. The training and validation sets consists of , images respectively.
We use ResNet-50 as backbone to conduct experiments on image classification task. Following [wang2018non], we insert 1 non-local block or our interlaced attention block before the last residual block of res- stage.
We report the Top- and Top- classification accuracy on the validation set of CUB-- in Table 7. The proposed interlaced attention improves the ResNet-50 baseline by % in terms of Top1 accuracy and % in terms of Top5 accuracy. We also compare our method with non-local [wang2018non] (NL) and the recent proposed compact generalized non-local [yue2018cgnl] module (CGNL), which generalizes the non-local module and take the correlations between the positions of any two channels into account. The results show that our method can outperform NL and CGNL on both Top and Top
accuracy. The consistent improvements for our method over other methods verify the ability of interlaced attention to enhance the representation of deep neural networks.
Method | Top1 Acc. () | Top5 Acc. () |
---|---|---|
ResNet- Baseline | ||
ResNet- + NL | ||
ResNet- + CGNL | ||
ResNet- + IA |
Comparison to Downsample and RCCA. One intuitive way to reduce the heavy cost of self-attention methods is downsampling the input feature map before perform self-attention. In Table 8, we compare the our method with non-local method with and without downsampled feature map. We can see that the performance of NL drop substantially if we downsample the feature map before fed into NL block. This observation suggests that downsample is not the solution for heavy cost of NL method.
We also compare interlaced attention to the recent proposed recurrent criss-cross attention [huang2018ccnet] (RCCA), which recurrently aggregate the contextual information of surrounding pixels on the criss-cross path and also reduce the cost of NL method. From the last two rows in Table 8, we can see that our method outperforms RCCA by a large margin. The comparison again indicates the effectiveness of interlaced attention.
Visualization. We visualize the prediction of baseline FCN, our method and the ground truth in Figure 1. The first and last two rows in Figure 1 each present examples from the validation set of Cityscapes and ADEK, respectively. It can be seen that our method tend to produce ’smoother’ prediction compared with the FCN baseline.
We conjecture that this may attribute to the property of interlaced attention that aggregate contextual information from all other similar pixels and enhance the representation. Thus, we visualize the attention map of our method in Figure. Visualization of attention map and COCO detection/mask result.
Method | Down | Pixel Acc. () | mIoU () |
---|---|---|---|
ResNet- | - | - | |
ResNet- + NL | |||
ResNet- + NL | - | ||
ResNet- + RCCA | - | ||
ResNet- + IA | - |
Influence of the Partitions. The number of partitions (i.e., ) is key hyper-parameter of interlaced attention. Table 9 evaluates influence of different partitions on the validation set of Cityscapes. According to the results, we can see interlaced attention with different partitions consistently improves the baseline, and larger partition (i.e., or ) achieves slightly better results than others. As discusses in Section LABEL:ia_implementation, the complexity of interlaced attention is minimized when . And since the spatial size of feature maps that interlaced attention applied on usually range in [30, 100] during training, we use partition in all our experiments unless stated.
Method | Pixel Acc. () | mIoU () |
---|---|---|
ResNet- | ||
ResNet- + IA () | ||
ResNet- + IA () | ||
ResNet- + IA () |
Order of Long-range and Short-range Attention. In Section LABEL:sec:detail_ia, we show that our method can obtain information from every other pixels by performing cascaded long-range attention and short-range attention. We study the effect of the order of these two stages in Table 10. We can see perform long-range attention first and then short-range attention achieves better results on all metrics. One possible explanation is that performing short-range attention on neighboring region of original feature do not introduce mass amount of long-range context, and performing short-range attention on feature updated by long-range attention is likely to have much richer contextual information.
Method | Pixel Acc. () | mIoU () |
---|---|---|
Short-Long Range Attention | ||
Long-Short Range Attention |
Going Deeper with Interlaced Attention. In Table 11, we investigate the performance gain by adding multiple interlaced attention blocks to backbone ResNet- networks on the validation set of CUB.
Method | Top Acc. () | Top Acc. () |
---|---|---|
ResNet- | ||
ResNet- + IA | ||
ResNet- + IA |