Hierarchical Multi-Scale Attention for Semantic Segmentation

05/21/2020 ∙ by Andrew Tao, et al. ∙ 20

Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.1 IOU test).



There are no comments yet.


page 2

page 4

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of semantic segmentation is to label all pixels within an image as belonging to one of N classes. There is a trade off in this task in that certain types of predictions are best handled at lower inference resolution and other tasks better handled at higher inference resolution. Fine detail, such as the edges of objects or thin structures, is often better predicted with scaled up images sizes. And at the same time, predictions of large structures, which requires more global context, is often done better at scaled down image sizes, because the network’s receptive field can observe more of the necessary context. We refer to this latter issue as class confusion. Examples of both of these cases are presented in Figure 1.

Using multi-scale inference is a common practice to address this trade off. Predictions are done at a range of scales, and the results are combined with averaging or max pooling. Using averaging to combine multiple scales generally improves results, but it suffers the problem of combining the best predictions with poorer ones. For example, if for a given pixel, the best prediction comes from the 2x scale, and a much worse prediction comes from the 0.5x scale, then averaging will combine these predictions, resulting in sub-par output. Max-pooling, on the other hand, selects only one of N scales to use for a given pixel, while the optimal answer may be a weighted combination across the different scales of predictions.

To address this problem, we adopt an attention mechanism to predict how to combine multi-scale predictions together at a pixel level, similar to the method proposed by Chen et. al. [chen2015attention]. We propose a hierarchical attention mechanism by which the network learns to predict a relative weighting between adjacent scales. In our method, because of it’s hierarchical nature, we only require to augment the training pipeline with one extra scale whereas other methods such as [chen2015attention] require each additional inference scale to be explicitly added during the training phase. For example, when the target inference scales for multi-scale evaluation are {0.5, 1.0 and 2.0}, other attention methods require the network to first be trained with all of those scales, resulting in x ( + ) extra training cost. Our method only requires adding an extra 0.5x scale during training, which only adds x () cost. Furthermore, our proposed hierarchical mechanism also provides the flexibility of choosing extra scales at inference time as compared to previous proposed methods that are limited to only use training scales during inference.

To achieve state-of-the-art results in Cityscapes, we also adopt an auto-labelling strategy of coarse images in order to increase the variance in the dataset, thereby improving generalization. Our strategy is motivated by multiple recent works, including  

[xie2019selftraining, arazo2019pseudo, lee2013pseudo]. As opposed to the typical soft-labelling strategy, we adopt hard labelling in order to manage label storage size, which helps to improve training throughput by lowering the disk IO cost.

1.1 Contributions

  • An efficient hierarchical multi-scale attention mechanism that helps with both class confusion and fine detail by allowing the network to learn how to best combine predictions from multiple inference scales

  • A hard-threshold based auto-labelling strategy which leverages unlabelled images and boosts IOU.

  • We achieve state-of-the-art results in Cityscapes (85.1 IOU) and Mapillary Vistas (61.1 IOU)

Input images Prediction at 0.5x Scale Prediction at 2.0x Scale


Figure 1: Illustration of common failures modes for semantic segmentation as they relate to inference scale. In the first row, the thin posts are inconsistently segmented in the scaled down (0.5x) image, but better predicted in the scaled-up (2.0x) image. In the second row, the large road / divider region is better segmented at lower resolution (0.5x).

2 Related Work

Multi-scale context methods.

State-of-the-art semantic segmentation networks use network trunks with low output stride. This allows the networks to be able to resolve fine detail better but it also has the effect of shrinking the receptive field. This reduction in the receptive field can cause networks to have difficulty with predicting large objects in a scene. Pyramid pooling can counteract the shrunken receptive field by assembling multi-scale context. PSPNet 

[zhao2017pyramid] use a spatial pyramid pooling module which assembles features at multiple scales using the features obtained from the final layer of network trunk using a sequence of pooling and convolution operation. DeepLab [chen2018encoder] uses Atrous Spatial Pyramid Pooling (ASPP) which employs atrous convolutions with different levels of dilation, thus creating, denser feature as compared to PSPNet. More recently, ZigZagNet [Lin_2019_CVPR] and ACNet  [fu2019adaptive] leverage intermediate features instead of just the features from the final layer of the network trunk to create the multi-scale context.

Relational context methods. In practice, pyramid pooling techniques attend to fixed, square context regions because pooling and dilation are typically employed in a symmetric fashion. Furthermore, such techniques tend to be static and not learned. However, relational context methods build context by attending to the relationship between pixels and are not bound to square regions. The learned nature of relational context methods allow context to be built based on image composition. Such techniques can build more appropriate context for non-square semantic regions, such as a long train or a tall thin lamp post. OCRNet [yuan2019objectcontextual], DANET [fu2018dual], CFNet [zhang2019co], OCNet [yuan2018ocnet] and other related work [A2Net, Zhang_2019_ICCV, chen2018graph, NIPS2018_7456, li2018beyond, NIPS2018_7886, li19, huang2018ccnet] use such relationships to build better context.

Multi-scale inference. Both relation and multi-scale context methods  [chen2017rethinking, chen2018encoderdecoder, cheng2019panopticdeeplab, yuan2019objectcontextual] use multi-scale evaluation to achieve the best results. There are two common approaches to combining network predictions at multiple scales: average and max pooling, with average pooling being more common. However, average pooling involves equally weighting output from different scales, which may be sub-optimal. To address this issue  [chen2015attention, yang2018attention] use using attention to combination multiple scales. Chen et. al. [chen2015attention]

train an attention head across all scales simultaneously using final features from a neural network. While Chen et. al. use attention from a specific layer, Yang et. al.

[yang2018attention] use a combination of features from different network layers to build better contextual information. However, both of the aforementioned methods share the trait that the network and attention heads are trained with a fixed set of scales. Only those scales may be used at run-time, else the network must be re-trained. We propose a hierarchical based attention mechanism that is agnostic to number of scales during inference time. Furthermore, we show that our proposed hierarchical attention mechanism not only improves performance over average-pooling, but also allows us to diagnostically visualize the importance of different scales for classes and scenes. Furthermore, our method is orthogonal to other attention or pyramid pooling methods such as  [chen2018encoderdecoder, sinha2019multiscale, lin2016refinenet, yuan2019objectcontextual, Huang_2019_ICCV, fu2018dual, li2018pyramid] as these methods use single scale image and perform attention to better combine multi-level features for generating high-resolution predictions.

Auto-labelling. Most recent semantic segmentation work for Cityscapes in particular has utilized the ~ coarsely labelled images as-is for training state-of-the-art models  [yuan2018ocnet, semantic_cvpr19]. However, a significant amount of each coarse image is unlabelled due to the coarseness of the labels. To achieve state-of-the-art results on Cityscapes, we adopt an auto-labelling strategy, motivated by Xie et. al. [xie2019selftraining], other semi-supervised self-training in semantic segmentation [Lian_2019_Pyramid, Li_2019_bidirection, Luc2017futureSeg, Zou2018DAClassBalance, Zou_2019_CRST], and other approaches based on pseudo label such as  [lee2013pseudo, iscen2019label, shi2018transductive, arazo2019pseudo]. We generate dense labels for the coarse images in Cityscapes. Our generated labels have very few unlabelled regions, and thus we are able to take advantage of the full content of the coarse images.

While most image classification auto-labelling work use continuous or soft labels, we generate hard

thresholded labels, for storage efficiency and training speed. With soft labels, a teacher network provides a continuous probability for each of N classes for each pixel of an image, whereas for hard labels a threshold is used to pick a single top class per pixel. Similar to 

[li2019decoupled, lee2013pseudo] we generate hard dense labels for the coarse Cityscapes images. Examples are shown in Figure  4. Unlike Xie. et. al. [xie2019selftraining], we do not perform iterative refinement of our labels. Rather, we perform a single iteration of full training of our teacher model with the default coarse and fine labelled provided images. After this joint training, we perform auto-labelling of the coarse images, which are then substituted in our teacher training recipe to obtain state-of-the-art test results. Using our pseudo generated hard labels in combination with our proposed hierarchical attention, we are able to obtain state-of-the-art results on Cityscapes.

3 Hierarchical multi-scale attention

[width=1.0tics=2,trim=1cm 1cm 1cm 1cm,clip]pics/HMNet_arch.pdf  Explicit MethodTraining and InferenceOur Hierarchical MethodTrainingInference

Scale 3Scale 2Scale 1

Scale 2Scale 1

Scale 3Scale 2Scale 1

Figure 2: Network Architecture Left and right panels show explicit vs. hierarchical (Ours) architectures, respectively. Left shows the architecture from  [chen2015attention], where the attention for each scale is learned explicitly. Right shows our hierarchical attention architecture. Right top An illustration of our training pipeline, whereby the network learns to predict attention between adjacent scale pairs. Right bottom Inference is performed in a chained/hierarchical manner in order to combine multiple scales of predictions. Lower scale attention determines the contribution of the next higher scale.

Our attention mechanism is conceptually very similar to that of [chen2015attention], where a dense mask is learned for each scale, and these multi-scale predictions are combined by performing pixel-wise multiplication between masks with the predictions followed by pixel-wise summation among the different scales to obtain the final results, see Figure 2. We refer to Chen’s method as explicit. With our hierarchical method, instead of learning all attention masks for each of a fixed set of scales, we learn a relative attention mask between adjacent scales. When training the network, we only train with adjacent scale pairs. As shown in Figure 2, given a set of image features from a single (lower) scale, we predict a dense pixel-wise the relative attention between the two image scales. In practice, to obtain the pair of scaled images, we take a single input image and scale it down by a factor of 2, such that we are left with a 1x scale input and an 0.5x scaled input, although any scale-down ratio could be selected. It is important to note that the network input itself is a re-scaled version of the original training images because we use image scale augmentation when we train. This allows the network network learns to predict relative attention for a range of image scales. When running inference, we can hierarchically apply the learned attention to combine N scales of predictions together, in a chain of computations as shown in Figure and described by equation below. We give precedence to Lower scales and work our way up to higher scales, with the idea that they have more global context and can choose where predictions need to be refined by higher scale predictions.

More formally, during training a given input image is scaled by factor where denotes a down-sampling by factor of 2, denotes upsampling by factor of 2, denotes no operation. For our training, we choose and . The two images with and

are then sent through the shared network trunk, which produces semantic logits

and also an attention mask for each scale, which are used to combine the logits between scales. Thus for two scale training and inference, with being the bilinear upsampling operation, and are pixel-wise multiplication and addition respectively, the equation can be formalized as:


There are two advantages using our proposed strategy:

  • At inference time, we can now flexibly select scales, thus adding new scales such x or x to a model trained with x and x is possible with our proposed attention mechanism chains together in a hierarchical way. This differs from previously proposed methods that limited to using the same scaled that were used during model training.

  • This hierarchical structure allows us to improve on the training efficiency as compared to the explicit method. With the explicit method, if using scales , , , the training cost is , relative to single-scale training. With our hierarchical method the training cost is only .

3.1 Architecture

Backbone For the ablation studies in this section, we use ResNet-50 [he2016deep] (configured with output stride of 8) as the trunk for our network. For state-of-the-art results, we use a larger, more powerful trunk, HRNet-OCR [yuan2019objectcontextual]. Semantic Head: Semantic predictions are performed by a dedicated fully convolutional head consisting of 3x3 conv BN ReLU 3x3 conv BN ReLU 1x1 conv. The final convolution outputs num_classes channels. Attention Head:Attention predictions are done using a separate head that is structurally identical to the semantic head, except for the final convolutional output, which outputs a single channel. When using ResNet-50 as the trunk, the semantic and attention heads are fed with features from the final stage of ResNet-50. When using HRNet-OCR, the semantic and attention heads are fed with features out of the OCR block. With HRNet-OCR, there also exists an auxiliary semantic head, which takes its features directly from the HRNet trunk, before OCR. This head consists of 1x1 conv BN ReLU 1x1 conv. After attention is applied to the semantic logits, the predictions are upsampled to the target image size with bilinear upsampling.

Method Eval scales () IOU FLOPS (relative) Minibatch training time (sec)
Single Scale x
AvgPool x
AvgPool x
Explicit x
Hierarchical (Ours) 51.6 x
Hierarchical (Ours) 52.2 x
Table 1: Comparison of our hierarchical multi-scale attention method vs. other approaches on Mapillary validation set. The network architecture is DeepLab V3+ with a ResNet-50 trunk. Eval scales: scales used for multi-scale evaluation. FLOPS: the relative amount of flops consumed by the network for training. Minibatch time: measured training minibatch time on an Nvidia Tesla V100 GPU.

3.2 Analysis

In order to evaluate the effectiveness of our multi-scale attention approach, we train networks with a DeepLab V3+ architecture and ResNet50 trunk. In Table 1, we show that our hierarchical attention approach results in better accuracy (51.6) as compared to the baseline averaging approach (49.4) or the explicit approach (51.4). We also observe significantly better results with our approach when adding the 0.25x scale Unlike the explicit method, our method does not require re-training the network when using the additional 0.25x scale. This flexibility at inference time is a key benefit of our method. We can train once but evaluate flexibly with a range of different scales.

Furthermore, we also observe that with the baseline averaging multi-scale method, simply adding x scale is detrimental to accuracy as it causes a reduction in IOU, whereas for our method, adding the extra 0.25x scale boosts accuracy by another IOU. With the baseline averaging method, the 0.25x prediction is so coarse that when averaged into the other scale, we observe classes such as lane marking, man-hole, phone-booth, street-light, traffic light and traffic sign (back and front ), bike racks, among others drop by IOU. The coarseness of the prediction hurts the edges and fine detail. However, with our proposed attention method, adding x scale improves our result by since our network is able to apply the 0.25x prediction in the most appropriate way, staying away from using it around edges. Examples of this can be observed in Figure 3, where for the fine posts in the image on the left, very little of the posts are attended to by the 0.5x prediction, but a very strong attention signal is present in the 2.0x scale. Conversely, for the very large region on the right, the attention mechanism learns to most leverage the lower scale (0.5x) and very little of the erroneous 2.0x prediction.

3.2.1 Single vs. dual-scale features

While the architecture we settled upon feeds the attention head from features coming out of only the lower of two adjacent image scales (see Figure 2), we experimented with training the attention head with features from both adjacent scales. We did not observe significant difference in accuracy, so we settled on a single set of features.

Input images
Semantic and Attention prediction at scale x
Semantic and Attention prediction at scale x
Semantic and Attention prediction at scale x
Figure 3: Semantic and attention predictions at every scale level for two different scenes. The scene on the left illustrates a fine detail problem while the scene on the right illustrates a large region segmentation problem. A white color for attention indicates a high value (close to 1.0). The attention values for a given pixel across all scales sums to 1.0. Left: The thin road-side posts are best resolved at 2x scale, and the attention successfully attends more to that scale than other scales, as evidenced by the white color for the posts in the 2x attention image. Right: The large road/divider region is best predicted at 0.5x scale, and the attention does successfully focus most heavily on the 0.5x scale for that region.

4 Auto Labelling on Cityscapes

Inspired by recent work on auto-labelling for image classification tasks  [xie2019selftraining] and  [tarvainen2017mean], we adopt an auto-labelling strategy for Cityscapes to boost the effective dataset size and label quality. In Cityscapes, there 20,000 coarsely labelled images to go along with the 3,500 finely labelled images. The label quality of the coarse images is very modest and contains a large amount of unlabelled pixels, see Figure 4. By using our auto-labelling approach, we can improve the label quality, which in turn helps the model IOU.

Original image Original coarse label Auto-generated coarse label
Figure 4: Example of our auto-generated coarse image labels. Auto-generated coarse labels (right) provide finer detail of labelling than the original ground truth coarse labels (middle). This finer labelling improves the distribution of the labels since both small and large items are now represented, as opposed to primarily large items.

A common technique for auto-labelling in image classification is to use soft or continuous labels, whereby a teacher network provides a target (soft) probability for each of N classes for every pixel of every image. A challenge of this approach is disk space and training speed: it costs roughly 3.2TB in disk space to store the labels: 20000 images * 2048 w * 1024 h * 19 classes * 4B = 3.2TB. Even if we chose to store such labels, reading such a volume of labels during training would likely slow training considerably.

Instead, we adopt a hard labelling strategy, whereby for a given pixel, we select the top class prediction of the teacher network. We threshold the label based on teacher network output probability. Teacher predictions that exceed the threshold become true labels, otherwise the pixel is labelled as ignore class. In practice we use a threshold of 0.9.

5 Results

5.1 Implementation Protocol

In this section, we describe our implementation protocol in detail.

Training details

Our models are trained using Pytorch 


on Nvidia DGX servers containing 8 GPUs per node with mixed precision, distributed data parallel training and synchronous batch normalization. We use Stochastic Gradient Descent (SGD) for our optimizer, with a batch size of

per GPU, momentum and weight decay in training. We apply the “polynomial” learning rate policy  [liu2015parsenet]. We use RMI  [zhao2019rmi]

as the the primary loss function under default settings, and we use cross-entropy for the auxiliary loss function. For Cityscapes, we use a poly exponent of

, an initial learning rate of , and train for epochs across DGX nodes. For Mapillary, we use a poly exponent of , an initial learning rate of , and train for epochs across DGX nodes. As in  [semantic_cvpr19], we use class uniform sampling in the data loader to equally sample from each class, which helps improve results when there is unequal data distribution.

Data augmentation: We employ gaussian blur, color augmentation, random horizontal flip and random scaling (x - x) on the input images to augment the dataset the training process. We use a crop size of x for Cityscapes and x for Mapillary.

MS Attention Auto-labeling IOU Gain
85.4 0.5
86.0 1.1
86.3 1.4
Table 2: Ablation study on Cityscapes validation set. The baseline method uses HRNet-OCR as the architecture. MS Attention is our proposed multi-scale attention method. Auto-labeling indicates whether we are using automatically generated or ground truth coarse labels during training. A combination of both techniques yields the best results.

5.1.1 Results on Cityscapes

Cityscapes  [Cordts2016Cityscapes] is a large dataset that labels semantic classes across 5000 high resolution images. For Cityscapes, we use HRNet-OCR as the trunk along with our proposed multi-scale attention method. We use RMI as the loss for the main segmentation head but for the auxiliary segmentation head we use cross entropy because we found that using RMI loss led to reduced training accuracy deep into the training. Our best results are achieved by first pre-training on the larger Mapillary dataset, and then training on Cityscapes. For the Mapillary pre-training task, we do not train with attention. Our state-of-the-art recipe on Cityscapes was achieved using train + val images in addition to the auto-labelled coarse images. At 50% probability we sample from the train + val set, else we sample from the auto-labelled pool of images. At inference time, we use scales = {, , } and image flipping.

We conduct ablation studies on Cityscapes validation set as shown in Table 2. Multi-scale attention yields % IOU over the baseline HRNet-OCR architecture with average pooling. Auto-labelling provides a boost of 1.1% IOU over the baseline. Combining both techniques together results in a total gain of 1.4% IOU.

Finally, in Table 3 we show results of our method as compared to other top-performing methods in the Cityscapes test set. Our method achieves a score of , which is the best reported Cityscapes test score of all methods, beating the best previous score by IOU. In addition, our method has the top per-class scores in all but three classes. Some results are visualized in Figure 5.

Method road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mcycle bicycle mIoU
VPLR [semantic_cvpr19] 98.8 87.8 94.2 64.1 65.0 72.4 79.0 82.8 94.2 74.0 96.1 88.2 75.4 96.5 78.8 94.0 91.6 73.7 79.0 83.5
HRNet-OCR ASPP [yuan2019objectcontextual] 98.8 88.3 94.3 66.9 66.7 73.3 80.2 83.0 94.2 74.1 96.0 88.5 75.8 96.5 78.5 91.8 90.1 73.4 79.3 83.7
Panoptic Deeplab [cheng2019panopticdeeplab] 98.8 88.1 94.5 68.1 68.1 74.5 80.5 83.5 94.2 74.4 96.1 89.2 77.1 96.5 78.9 91.8 89.1 76.4 79.3 84.2
iFLYTEK-CV 98.8 88.4 94.4 68.9 66.8 73.0 79.7 83.3 94.3 74.3 96.0 88.8 76.3 96.6 84.0 94.3 91.7 74.7 79.3 84.4
SegFix [yuan2020segfix] 98.8 88.3 94.3 67.9 67.8 73.5 80.6 83.9 94.3 74.4 96.0 89.2 75.8 96.8 83.6 94.1 91.2 74.0 80.0 84.5

99.0 89.2 94.9 71.6 69.1 75.8 82.0 85.2 94.5 75.0 96.3 90.0 79.4 96.9 79.8 94.0 85.8 77.4 81.4 85.1
Table 3: Comparison vs other methods on the Cityscapes test set. Best results in each class are represented in bold.

5.1.2 Results on Mapillary Vistas

Mapillary Vistas  [MVD2017] is a large dataset containing high resolution images annotated into object categories. For Mapillary, we used HRNet-OCR as the trunk along with our proposed multi-scale attention method. Because Mapillary images can have very high and varied resolutions, we resize the images such that the long edge is 2177 as was done in  [cheng2019panopticdeeplab]

. We initialize the HRNet part of the model with weights from HRNet trained on ImageNet classification. Because of the greater memory requirements for the 66 classes in Mapillary, we decreased the crop size to 1856 x 1024. In Table

4 we show results of our method on Mapillary validation set. Our single-model based method achieves , which is higher than the next closest method, Panoptic Deeplab  [cheng2019panopticdeeplab], which uses ensemble of models to achieve .

Method mIOU
Seamless [Porzi_2019_CVPR] 50.4
DeeperLab [yang2019deeperlab] 55.3
Panoptic DeepLab  [cheng2019panopticdeeplab] 56.8
Panoptic DeepLab ( Ensemble )  [cheng2019panopticdeeplab] 58.7
Ours 61.1
Table 4: Comparison of results on Mapillary validation set. Best results in each class are represented in bold.
Input images Ground truth Our network prediction
Figure 5: Qualitative Results. From left to right: input, ground truth, our method on Cityscapes.

6 Conclusion

In this work, we present a hierarchical multi-scale attention approach for semantic segmentation. Our approach yields an improvement in segmentation accuracy while also being memory and computationally efficient, both of which are practical concerns. Training efficiency limits how fast research can be done while GPU memory efficiency limits how large of a crop networks can be trained with, which can also limit network accuracy. We empirically show consistent improvement in Cityscapes and Mapillary using our proposed approach.

Acknowledgements: We’d like to thank Sanja Fidler, Kevin Shih, Tommi Koivisto and Timo Roman for helpful discussions.