1 Introduction

Semantic segmentation is a fundamental task in computer vision that has various important applications in self-driving car, robotics, etc. Great advancement has been achieved since the advent of deep neural networks. Lots of works have shown that effective integration of contextual information plays a central role in pushing forward the segmentation performance
[2, 3, 4, 5, 6, 7, 8, 9]. Contextual information implies the relational connection between an object and a region which facilitates the classification of the object.As the output of the layers of the CNN backbone encodes different scales and levels of contextual information which combine to form a feature pyramid, it emerges as a natural choice to leverage this multi-scale feature pyramid to achieve a high quality yet efficient context fusion. The multi-scale feature aggregation is commonly done with sum or concat followed by conv layers with a pixel-to-pixel correspondence. However, it fully passes down the high-level context to the following hierarchy without considering their interrelation. For example in Fig. 1 (a) and (b), not all context information in the predefined vicinity of the corresponding high-level feature B is beneficial to the classification of the low-level feature A (a portion of a rider). Ideally, feature A should discriminately aggregate feature that contains the high-level context, emphasize the semantically related and spatially close features from Rider and Bicycle, and suppress the others.
To this end, we propose a Relational Semantics Extractor (RSE) inspired by [10] to enable the low-level feature to extract the complementary relational context from adjacent high-level feature maps by using a cross-scale pixel-to-region relation operation. The key insight is that the proposed local relation operation essentially learns the composability between objects in the adjacent feature maps. The spatial relation is established by adding a positional embedding [11]. On top of RSE, we present the Relational Semantics Propagator (RSP) to propagate the extracted relational context. To progressively propagate the high-level context in a top-down manner, we construct an RSP head by stacking several RSP modules as in Fig. 2. As illustrated in Fig. 1(c) and (d), our simple and efficient model architecture allows each low-level feature to search and aggregate context information from a large region in the high-level feature map. The blue arrow and red arrow indicate the context extraction and context propagation respectively, together with a top-down progressive contextual information propagation and a large relation operation region. We essentially enable the low-level feature D to capture long-range dependencies from another low-level feature C. In summary, our contributions are:
-
Propose a cross-scale pixel-to-region relation operation as an effective solution to multi-scale feature aggregation.
-
Propose a Relational Semantics Extractor (RSE) and Relational Semantics Propagator (RSP) for context extraction and propagation respectively.
-
Introduce a simple, light-weight yet effective RSP head for semantic segmentation which performs competitively on Cityscapes and COCO.

2 Related Work
Multi-scale feature aggregation. Following the earlier work [2], various successful approaches have been developed based on exploiting multi-scale feature aggregation. Methods like [3, 13, 14, 1] and [4] extract multi-scale features with pyramid pooling and atrous spatial pyramid pooling respectively. On the other hand, Lin et al. [15] exploit the natural structure of the deep networks for the construction of multi-scale semantics. A recently proposed network [12] upsamples the multi-scale feature pyramid to the same spatial dimension through lateral paths and fuse them by element-wise summation. Chen et al. [16] use image pyramids of different scales as input, then use a CNN trunk to fuse multi-scale information by weighted summation. These methods use either concatenation or element-wise summation during fusion which propagates all high-level contextual information to the lower level. Besides, they conduct multi-scale feature aggregation based on a pixel-to-pixel correspondence and do not consider their interrelation.
Attention. Attention-based methods have shown great potential in computer vision. Wang et al. [17] demonstrate that long-range dependencies are beneficial to classification. Parmar et al. [18] takes one step further and shows that in the image recognition task, the convolutional kernel can be replaced by a form of self-attention operation. Compositional relationship between pixels in a local neighborhood is exploited in [10, 19] to meaningfully join elements together, and it highlights that a meaningful fusion is determined by the similarity of two pixels’ feature projections into a learned embedding space [11]. Recent work [20] applies a non-local operation to compare feature maps from two scale levels for feature enhancement. Our work extends local relation operation to cross-scale settings to learn the multi-scale composability to achieve a meaningful aggregation of information from multiple scale levels. A coarse-to-fine approach [5] exploits the coarse prediction to obtain a class center feature as context and then use it to enhance the coarse prediction. Yuan et al. [6] aggregates context from object regions in an image through a coarse prediction and distributes it back to all spatial positions based on the relationships between the feature position and the context representation [21]. These approaches leverage the pixel-to-region relation to extracting the context but are restrained to a single scale level whereas our method aggregates the related context in a cross-scale setting. Ding et al. [8] produce a context map for each pixel with a paired convolution and Gaussian kernel in a large predefined region. Then apply the mask to the weights of conv operations to make it shape-variant. Due to the computation cost, the shape-variant context-mask restrained to a single layer of low-resolution.

3 Approach
3.1 Relational Semantics Extractor
To address the inefficiency of the convolutional layer in modeling the compositional relationships, [10] propose to explicitly exploit relations between different pixels to extract meaningful features with relation operation. One key insight is that the proposed local relation operation essentially learns the composability between objects in the key map and the query map. The local relation operation obtains the key and value from the same region. In contrast, we propose a relational semantics extractor (RSE) to exploit the property of relation operation to enable the low-level feature map to selectively extract complementary context from its adjacent high-level feature map with a pixel-to-region correspondence as shown in Fig. 3. Formally, the operations in relational semantics extractor are defined as follows, given the upsampled high level feature map and the low level feature map :
(1) |
where is the output feature map, is the feature of a specific pixel at location in and respectively. is the relation operator, which looks for composability between the input pixel and the defined adjacent region of . extracts adjacent region of pixel in feature map . and
denotes linear transformations that project the features into the embedding space. If we define the kernel size of RSE as
, extracts the feature matrix , with the center of at the same location with , as visualized in Fig. 3(b). is a pixel-wise dot product with broadcasting in channel dimension if required. Following [10, 11], we denote and as the query, key and value respectively in Fig.3(a). To reduce the computation overhead, we reduce the channel number of the key and query in and by a factor of . The relation operator computes appearance composability and is defined as the dot product of the feature pairs:(2) |
where the output of is a weight matrix . For simplicity, we omit the linear transformation in the formula. There are other forms of relation modeling but their performance is similar [10, 19], therefore we adopt the dot product by default for implementation efficiency.
Since the current formulation does not encode positional information and is thus permutation invariant, additional positional embedding is required. We follow a similar strategy for positional embedding in [18]. In our case, a normalized 2D relative position map goes through its own linear transformation before the embedding is included in the relation operation. The 2D relative position is generated as , where the first channels are row offset and the second half is column offset. The normalization process projects the values between and . The relation operator with positional embedding can be now defined as:
(3) |
where denotes the linear transformation of the relative position map .
3.2 Relational Semantics Propagation Head
With the RSE that extracts complementary semantic information to the low-level feature maps, RSP propagates the information to the low-level feature map. With the element-wise addition shown in Fig. 3(b), we achieve scale fusion with only selected semantics. Specifically, the aggregation process can be expressed as:
(4) |
where the is the extracted relational semantics as in Eq. 1.
Compared to performing element-wise summation for multi-scale feature aggregation, the proposed RSE has two advantages. (A) During aggregation, summation only considers pixel-wise correspondence, while in RSP, information in a larger semantics region is aggregated to one pixel location from the low-level feature map. (B) Instead of propagating all contextual information from the high level semantic features, RSE selectively extracts useful features with respect to the low-level features.
We construct the RSP head by stacking a number of RSP modules. The overall structure of the RSP head can be seen in Fig. 2. Since the RSP is able to propagate the high level information to the low level feature maps, we follow [22, 23] and leverage a -level FPN structure [15]. Specifically, are generated by connecting a convolution to the feature maps of different stages in ResNet [24], and
are obtained by applying strided
convolution to and respectively. For more details please refer to [23].We denote the transformed feature map from as , and progressively aggregate the feature maps from the highest level to the lowest level. The high level feature map is first upscaled by a factor of two before it is fed into the RSP. For clarity, we denote the basic version of the RSP head without as RSP-2. The element-wise summations between level , and level , are replaced by the proposed RSP module. The full RSP head with 4 RSP modules is denoted as RSP-4. All the fusion between two scale levels are replaced with the RSP module except for the one between and , where we perform only simple summation for avoiding high computations. In our experiments, we also show that the aggregation of higher scale features using RSP yields better results.
4 Experiments
4.1 Implementation Details
Baseline Network.
The baseline network adopts FPN as the backbone for multi-scale feature extraction. The baseline for RSP-2 uses
with strides of pixels with respect to the input image. Additional are used in the baseline for RSP-4 with strides of pixels. Our baseline networks aggregate the features with a pixel-to-pixel correspondence. It starts from the highest level (RSP-2)/(RSP-4) and gradually approaches by upsampling the high level feature map to match the spatial dimension of the following low-level feature map with bilinear upsampling and then apply element-wise summation. A final 11 convolution, 4 bilinear upsampling, and softmax are used to generate the per-pixel class labels at the original image resolution.Cityscapes. The Cityscapes dataset [25] is tasked for urban scene understanding with 19 categories for semantic segmentation evaluation. The dataset contains 5,000 high resolution pixel-level finely annotated images and 20,000 coarsely annotated images. The finely annotated images are divided into 2,975/500/1,525 images for training, validation and testing.
COCO. The COCO dataset [26] is challenging large scale dataset for computer vision tasks. The panoptic segmentation task [27]
uses all 2017 COCO images with 80 thing and 53 staff classes annotated. As we integrate the proposed semantic segmentation head to the panoptic FPN, we evaluate our approach in the panoptic segmentation task. We use mIoU as the evaluation metric for semantic segmentation and also report PQ, Mask AP and Box AP.
Training details. On Cityscapes, we follow [12] and use SGD with 0.9 momentum with 32 images per mini-batch cropped to a fixed 5121024 size; the training schedule is 40K/15K/10K updates at learning rates of 0.01/0.0001/0.0001 respectively; a linear learning rate warmup [28] over 1000 updates starting from a learning rate of 0.001 is applied; a weight decay of 0.0001 is applied; horizontal flippling, color augmentation [29], and crop bootstrapping [30] are used during training; scale train-time data augmentation resizes an input image from 0.5 to 2.0 with a 32 pixel step; BN layers are frozen; no test-time augmentation is used. The evaluation metric is mIoU (mean Intersection-over-Union). On COCO dataset, we use the default Mask R-CNN 1 training setting [31] with scale jitter (shorter image side in [640, 800]).
Loss function. For semantic segmentation, we use the per-pixel cross entropy loss. For panoptic segmentation, we follow [12] and use the weighted sum of the instance segmentation loss and the semantic segmentation loss, . The semantic segmentation loss weight is set to be and instance segmentation loss weight is set to be .
4.2 Performance Comparisons.
Semantic segmentation. We compare RSP with existing semantic segmentation methods on Cityscapes val set. Only fine annotation is used for training and the mIoU is evaluated without using flip and multi-scale testing. We first compare with Semantic FPN [27] on Cityscapes val as our RSP head is most similar to the Semantic FPN [12]. The results are shown in Table 1. ’D’ in model name indicates use of dilated kernel of size 3 and dilation 3, the detail is in Section 4.3. RSP-2 outperforms Semantic FPN with 15% fewer FLOPs. It is worth noting that RSP-4 with ResNet-50-FPN backbone already achieves 77.5% mIoU, which is very close to the result of Semantic FPN 77.7% mIoU with the heavier ResNet-101 backbone. Next, we compare RSP-4 with other top-performing methods. The results are shown in Table 2. Note that, RSP-4 outperforms DeeplabV3 [1] by 0.7% with 75% fewer FLOPs when using the same backbone. Our approach, which is simple in design, is able to perform on par with DeepLabV3+ which have undergone many design iterations. RSP-4 achieves strong results compared to state-of-the-art method OCR [6] with lighter FLOPs.
Panoptic segmentation. Next, we conduct experiments to compare with the semantic segmentation branch in the panoptic segmentation task on COCO val set by replacing the semantic segmentation branch with the RSP head. The results are shown in Table 3. RSP improves the semantic segmentation performance mIoU by a large margin and this also leads to improvement in the panoptic segmentation metric PQ.
Model | Backbone | mIoU | FLOPs | # param. |
---|---|---|---|---|
Baseline | ResNet-50-FPN | 74.8 | 51.7G | 4.7M |
RSP-2 | ResNet-50-FPN | 76.1 0.2 | 53.4G | 5.1M |
RSP-4 | ResNet-50-FPN | 77.5 0.2 | 53.7G | 7.8M |
Baseline | ResNet-101-FPN | 76.7 | 51.7G | 4.7M |
RSP-2-D | ResNet-101-FPN | 77.9 0.2 | 53.4G | 5.1M |
RSP-4-D | ResNet-101-FPN | 78.5 0.2 | 53.7G | 7.8M |
Semantic FPN[12] | ResNet-50-FPN | 75.8 | 62.5G | 6.5M |
Semantic FPN[12] | ResNet-101-FPN | 77.7 | 62.5G | 6.5M |
Cityscapes annotations are used for training. ’D’ in the model name indicates the use of dilated kernel of size 3 and dilation 3. The median and standard deviation of 5 random runs are reported and the best results are in bold. Note that RSP-4 with ResNet-50-FPN backbone achieves a very close performance to Semantic FPN
[12] with ResNet-101-FPN backbone. FLOPs (multiply-adds ) and the number of parameters are only calculated for the head i.e. backbone excluded.Model | Backbone | mIoU | FLOPs | memory. |
Semantic FPN[12] | ResNet-101-FPN | 77.7 | 0.5T | 0.8G |
DeeplabV3 [1] | ResNet-101-D8 | 77.8 | 1.9T | 1.9G |
PSANet101 [32] | ResNet-101-D8 | 77.9 | 2.0T | 2.0G |
SETR-PUP [33] | T-Large | 79.3 | 1.0T | 2.7G |
Mapillary [30] | WideResNet-38-D8 | 79.4 | 4.3T | 1.7G |
DeeplabV3+ [4] | X-71-D16 | 79.6 | 0.5T | 1.9G |
OCR [6] | HRNetV2 | 80.8 | 1.3T | 1.4G |
RSP-4-D | ResNet-101-FPN | 78.5 | 0.5T | 0.8G |
RSP-4-D | ResNeXt-101-FPN | 79.5 | 0.8T | 1.4G |
Model | mIoU | PQ | Mask AP | Box AP | FLOPs | # param |
---|---|---|---|---|---|---|
Panoptic FPN [12] | 41.3 | 39.4 | 34.6 | 37.5 | 62.5G | 6.5M |
RSP-2-D head | 41.9 | 40.1 | 34.6 | 37.5 | 53.1G | 5.1M |
RSP-4-D head | 42.7 | 40.2 | 34.5 | 37.5 | 53.4G | 7.8M |
Model | RSP | Sum | mIoU | FLOPs | # param. |
---|---|---|---|---|---|
BASELINE | - | (54, 43) | 74.8 | 51.7G | 4.7M |
+ RSP | 54 | 43 | 75.2(+0.4) | 52.0G | 4.9M |
+ RSP | 43 | 54 | 75.6(+0.8) | 53.1G | 4.9M |
+ RSP | (54, 43) | - | 76.1(+1.3) | 53.4G | 5.1M |
+ SELF | - | - | 75.5(+0.7) | 53.4G | 5.1M |
+ CONTEXT | - | - | 75.2(+0.4) | 51.7G | 4.7M |
|
|
Model | RSP | Sum | mIoU | FLOPs | # param. |
BASELINE | - | (54, 43) | 74.8 | 51.7G | 4.7M |
+ Q6, Q7 | - | (76, 65, 54, 43) | 76.0(+1.2) | 51.9G | 7.1M |
+ RSP | 43 | (76, 65, 54) | 76.3(+1.5) | 53.2G | 7.3M |
+ RSP | 76 | (65, 54, 43) | 76.2(+1.4) | 51.9G | 7.3M |
+ RSP | (54, 43) | (76, 65) | 76.4(+1.6) | 53.6G | 7.4M |
+ RSP | (76, 65) | (54, 43) | 76.9(+2.1) | 52.0G | 7.4M |
+ RSP | (65, 54, 43) | 76 | 77.1(+2.3) | 53.7G | 7.6M |
+ RSP | (76, 65, 54) | 43 | 77.2(+2.4) | 52.3G | 7.6M |
+ RSP | (76, 65, 54, 43) | - | 77.5(+2.7) | 53.7G | 7.8M |
4.3 Ablation Study on Cityscapes
Ablation study of the RSP-2 head. We break down the improvements of RSP-2 over the baseline, by adding RSP modules to the baseline one-by-one. The results are shown in Table 4. Adding RSP module consistently improves the baseline. With 2 RSP modules (3% computation increment), the RSP head achieves a 1.3 mIoU improvement over the baseline. In the experiment + SELF, we replace all the cross-scale relation operations with local relation operation [10], and the input is only the high-level feature map. Compared to + SELF, RSP achieves much better performance because the cross-scale setting of our relation operation enables the low-level feature to access context from a much larger region. In the experiment + CONTEXT, we propagate high-level semantic information by simply aggregating high-level features in a local receptive field by average pooling and add it to the low-level feature. This outperforms the baseline but not our RSP-2. It proves the superiority of our proposed relation operation in extracting meaningful context information from the high-level feature map.
Ablation study of RSP Module. We analyze the effect of kernel sizes in the RSP module, as shown in Table 5. The RSP-2 achieves the best performance when the kernel size is 7 and dilation is 1. Meanwhile, RSP obtains a similar result for kernel size 3 and dilation 3, where the effective kernel size is also 7. Therefore, we decide to adopt kernel size 7 and dilation 1 when using backbone ResNet-50 for a better performance, and kernel size 3 and dilation 3 when using larger backbone ResNet-101 since using the dilation reduces the number of computation as well as the GPU memory. For the number of middle channels, we choose the dimension reduction factor as for its better performance.
Ablation study of RSP-4 head. We study effect of the number of RSP modules in the RSP-4 head, results are in Table 6. We have three observations. 1) Increasing the depth improves the performance even the additional higher-level features are aggregated by element-wise summation. This confirms that high-level semantics is beneficial to classification. 2) Increasing the number of RSP modules consistently improves the performance. 3) With the same number of RSP modules employed, start adding RSP modules from the highest-level generally gives a better result than from the lowest level. This proves that the proposed RSP successfully meets our design goal to propagates the complementary contextual information in a top-down manner.
4.4 Qualitative Evaluation.
Complementary information in relation operation. As shown in [10], the feature extracted by the query and key transformation complement each other. In our case, we demonstrate that in cross-scale relation operation, this observation stands. As is visualized in Fig. 4, both key-query feature pairs in RSP-54 and RSP-43 complement each other.

Qualitative results on Cityscapes. We provide the qualitative comparisons between the RSP-4 and the baseline network with ResNet-50-FPN in the upper part of Fig. 5(a). We use the red
box to mark those challenging regions. The baseline model misclassifies the sidewalk near the crowd as the road in the first image, the portion of the rider far from the motorcycle as a person in the second image, and pixels at the boundary of a bus as the car in the last image. In contrast, the proposed RSP head classifies all those areas correctly. The rider case demonstrates that the RSP head enables long-range dependencies to be captured. The sidewalk and bus case confirms that the RSP head allows the pixel at the boundary to select the helpful high-level context.
Visualization of the context propagation. We visualize and compare the feature maps from the same channel produced by RSP-4 and the baseline during the whole feature aggregation process on two images Fig. 5(b). In the left image, half part of the rider not near the bicycle is misclassified as the person. In the right image, the pixels at the boundary of the car and bus are misclassified. The feature maps that display the context propagation show that the baseline model fully passes down the high-level context which includes wrong or incomplete context whereas the RSP-4 successfully reject those context and aggregate the complementary and informative context. In the last row, we use white circles to highlight the features produced by the RSP-4 and baseline model that represents rider and car in the red box. The RSP-4 produces complete and clear features which is much easier to be discriminated against.

5 Conclusion
In this work, we propose a relational multi-scale feature aggregation approach for semantic segmentation. The multi-scale feature aggregation is achieved through the proposed relational semantics propagator (RSP) head, where the high-level context is selectively propagated to the low-level feature maps with a pixel-to-region correspondence. We propose a cross-scale relation operation named relational semantics extractor (RSE) to extract complementary contextual information w.r.t. the low-level feature from the corresponding region of adjacent high-level feature maps. The cross-scale setting also enables the low-level features to capture long-range dependency in a compute-efficient way. Extensive experiments show the effectiveness of the RSP module and the consistent improvement by adding multiple RSP modules.
Acknowledgment
We thank Feng Xue and Guirong Zhuo for their helpful discussion and generous support and the Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University for providing the GPUs for experiments.
References
- [1] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
- [2] W. Liu, A. Rabinovich, and A. Berg, “ParseNet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015.
-
[3]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,”
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2017, pp. 2881–2890. - [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
- [5] F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding, “ACFNet: Attentional class feature network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6798–6807.
- [6] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” arXiv preprint arXiv:1909.11065, 2019.
- [7] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.
- [8] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Semantic correlation promoted shape-variant context for segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8885–8894.
- [9] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- [10] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3464–3473.
- [11] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
- [12] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.
- [13] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- [14] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” arXiv preprint arXiv:1412.7062, 2014.
- [15] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- [16] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to Scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3640–3649.
- [17] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
- [18] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, “Stand-alone self-attention in vision models,” in Advances in Neural Information Processing Systems, 2019, pp. 68–80.
- [19] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 076–10 085.
- [20] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature pyramid transformer,” arXiv preprint arXiv:2007.09451, 2020.
- [21] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “-Nets: Double attention networks,” in Advances in neural information processing systems, 2018, pp. 352–361.
- [22] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 9627–9636.
- [23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [25] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
- [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- [27] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 9404–9413.
- [28] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, Large Minibatch SGD: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
- [30] S. Rota Bulò, L. Porzi, and P. Kontschieder, “In-place activated batchnorm for memory-optimized training of DNNs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5639–5647.
- [31] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron,” https://github.com/facebookresearch/detectron, 2018.
- [32] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “PSANet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283.
- [33] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” arXiv preprint arXiv:2012.15840, 2020.