Log In Sign Up

CAR: Class-aware Regularizations for Semantic Segmentation

by   Ye Huang, et al.

Recent segmentation methods, such as OCR and CPNet, utilizing "class level" information in addition to pixel features, have achieved notable success for boosting the accuracy of existing network modules. However, the extracted class-level information was simply concatenated to pixel features, without explicitly being exploited for better pixel representation learning. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. In this paper, aiming to use class level information more effectively, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Three novel loss functions are proposed. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. Our method can be easily applied to most existing segmentation models during training, including OCR and CPNet, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23 available at


page 3

page 6

page 13

page 14

page 18

page 19


CARD: Semantic Segmentation with Efficient Class-Aware Regularized Decoder

Semantic segmentation has recently achieved notable advances by exploiti...

Semantic Segmentation via Pixel-to-Center Similarity Calculation

Since the fully convolutional network has achieved great success in sema...

Dual Prototypical Contrastive Learning for Few-shot Semantic Segmentation

We address the problem of few-shot semantic segmentation (FSS), which ai...

Inter-Image Communication for Weakly Supervised Localization

Weakly supervised localization aims at finding target object regions usi...

Location-aware Upsampling for Semantic Segmentation

Many successful learning targets such as dice loss and cross-entropy los...

Convolutional Fine-Grained Classification with Self-Supervised Target Relation Regularization

Fine-grained visual classification can be addressed by deep representati...

Context Prior for Scene Segmentation

Recent works have widely explored the contextual dependencies to achieve...

Code Repositories


CAR: Class-aware Regularizations for Semantic Segmentation

view repo

1 Introduction

Semantic segmentation, which assigns a class label for each pixel in an image, is a fundamental task in computer vision. It has been widely used in many real-world scenarios that require the results of scene parsing for further processing,


, image editing, autopilot, etc. It also benefits many other computer vision tasks such as object detection and depth estimation.

After the early work FCN [15] which used fully convolutional networks to make the dense per-pixel segmentation task more efficient, many works [34, 2] have been proposed which have greatly advanced the segmentation accuracy on various benchmark datasets. Among these methods, many of them have focused on better fusing spatial domain context information to obtain more powerful feature representations (termed pixel features in this work) for the final per-pixel classification. For example, VGG [20] utilized large square context information by successfully training a very deep network, and DeepLab [2] and PSPNet [34] utilized multi-scale features with the ASPP and PPM modules.

Recently, methods based on dot-product self-attention (SA) [24, 6, 30, 33, 10, 35, 5, 19, 21] have become very popular since they can easily capture the long-range relationship between pixels. SA aggregates information dynamically (by different attention maps for different inputs) and selectively (using weighted averaging spatial features according to their similarity scores). Using multi-scale and self-attention techniques during spatial information aggregation has worked very well (e.g.

, 80% mIOU on Cityscapes 

[16] (single-scale w/o flipping)).

As complements to the above methods, many recent works have proposed various modules to utilize class-level contextual information. The class-level information is often represented by the class center/context prior which are the mean features of each class in the images. OCR [29] and ACFNet [31] extract “soft” class centers according to the predicted coarse segmentation mask by using the weighted sum. CPNet [28] proposed a context prior map/affinity map, which indicates if two spatial locations belong to the same class, and used this predicted context prior map for feature aggregation. However, they [29, 31, 28] simply concatenated these class-level features with the original pixel features for the final classification.

In this paper, we also focus on utilizing class level information. Instead of focusing on how to better extract class-level features like the existing methods [29, 31, 28], we use the simple, but accurate, average feature according to the GT mask, and focus on maximizing the inter-class distance during feature learning. This is because it mirrors how humans can robustly recognize an object by itself no matter what other objects it appears with.

Learning more separable features makes the features of a class less dependent upon other classes, resulting in improved generalization ability, especially when the training set contains only limited and biased class combinations (e.g., cows and grass, boats and beach). Fig. 1

illustrates an example of such a problem, where the classification of dog and sheep depends on the classification of grass class, and has been mis-classified as cow. In comparison, networks trained with our proposed CAR successfully generalize to these unseen class combinations.

Figure 1: The concept of the proposed CAR. Our CAR optimizes existing models with three regularization targets: 1) reducing pixels’ intra-class distance, 2) reducing inter-class center-to-center dependency, and 3) reducing pixels’ inter-class dependency. As highlighted in this example (indicated with a red dot in the image), with our CAR, the grass class does not affect the classification of dog/sheep as much as before, and hence successfully avoids previous (w/o CAR) mis-classification.

To better achieve this goal, we propose CAR, a class-aware regularizations module, that optimizes the class center (intra-class) and inter-class dependencies during feature learning. Three loss functions are devised: the first encourages more compact class representations within each class, and the other two directly maximize the distance between different classes. Specifically, an intra-class center-to-pixel loss (termed as “intra-c2p”, Eq. (3)) is first devised to produce more compact representation within a class by minimizing the distance between all pixels and their class center. In our work, a class center is calculated as the averaged feature of all pixels belonging to the same class according to the GT mask. More compact intra-class representations leave a relatively large margin between classes, thus contributing to more separable representations. Then, an inter-class center-to-center loss (“inter-c2c”, Eq. (6)) is devised to maximize the distance between any two different class centers. This inter-class center-to-center loss alone does not necessarily produce separable representations for every individual pixels. Therefore, a third inter-class center-to-pixel loss (“inter-c2p”, Eq. (13)) is proposed to enlarge the distance between every class center and all pixels that do not belong to the class.

In summary, our contributions in this work are:

  1. [topsep=0pt,itemsep=0pt,parsep=0pt,partopsep=0pt]

  2. We propose a universal class-aware regularization module that can be integrated into various segmentation models to largely improve the accuracy.

  3. We devise three novel regularization terms to achieve more separable and less class-dependent feature representations by minimizing the intra-class variance and maximizing the inter-class distance.

  4. We calculate the class centers directly from ground truth during training, thus avoiding the error accumulation issue of the existing methods and introducing no computational overhead during inference.

  5. We provide image-level feature-similarity heatmaps to visualize the learned inter-class features with our CAR are indeed less related to each other.

We conduct extensive experiments on many baselines and demonstrate that our CAR can improve all SOTA methods substantially, including CNN and Transformer based models. The complete code implemented with fair settings of NVIDIA’s deterministic technology are available in the Github.

2 Related Work

Self-Attention. Dot-product self-attention proposed in [24, 22] has been widely used in semantic segmentation [6, 30, 33, 35]. Specifically, self-attention determines the similarity between a pixel with every other pixel in the feature map by calculating their dot product, followed by softmax normalization. With this attention map, the feature representation of a given pixel is enhanced by aggregating features from the whole feature map weighted by the aforementioned attention value, thus easily taking long-range relationship into consideration and yielding boosted performance. In self-attention, in order to achieve correct pixel classification, the representation of pixels belonging to the same class should be similar to gain greater weights in the final representation augmentation.

Class Center. In 2019 [31, 29], the concept of class center was introduced to describe the overall representation of each class from the categorical context perspective. In these approaches, the center representation of each class was determined by calculating the dot product of the feature map and the coarse prediction (i.e., weighted average) from an auxiliary task branch, supervised by the ground truth [34]. After that, those intra-class centers are assigned to the corresponding pixels on feature map. Furthermore, in 2020 [28], a learnable kernel and one-hot ground truth were used to separate the intra-class center from inter-class center, and then concatenated with the original feature representation.

All of these works [29, 31, 28]

have focused on extracting the intra (inter) class centers, but they then simply concatenated the resultant class centers with the original pixel representations to perform the final logits. We argue that the categorical context information can be utilized in a more effective way so as to reduce the inter-class dependency.

To this end, we propose a CAR approach, where the extracted class center is used to directly regularize the feature extraction process so as to boost the differentiability of the learned feature representations (see Fig. 

1) and reduce their dependency on other classes. Fig. 2 contrasts the two different designs. More details of the proposed CAR are provided in Sect. 3.

(a) Design of OCR, ACFNet and CPNet
(b) Our CAR
Figure 2: The difference between the proposed CAR and previous methods that use class-level information. Previous models focus on extracting class center while using simple concatenation of the original pixel feature and the class/context feature for later classification. In contrast, our CAR uses direct supervision related to class center as regularization during training, resulting in small intra-class variance and low inter-class dependency. See Fig. 1 and Sec. 3 for details.

Inter-Class Reasoning. Recently, [4, 12] studied the class dependency as a dataset prior and demonstrated that inter-class reasoning could improve the classification performance. For example, a car usually does not appear in the sky, and therefore the classification of sky can help reduce the chance of mis-classifying an object in the sky as a car. However, due to the limited training data, such class-dependency prior may also contain bias, especially when the desired class relation rarely appears in the training set.

Fig. 1 shows such an example. In the training set, cow and grass are dependent on each other. However, as shown in this example, when there is a dog or sheep standing on the grass, the class dependency learned from the limited training data may result in errors and predict the target into a class that appears more often in the training data, i.e., cow in this case. In our CAR, we design inter-class and intra-class loss functions to reduce such inter-class dependency and achieve more robust segmentation results.

3 Methodology

Figure 3: The proposed CAR approach. CAR can be inserted into various segmentation models, right before the logit prediction module (A1-A4). CAR contains three regularization terms, including (C) intra-class center-to-center loss (Sec. 3.2.2), (D) inter-class center-to-center loss (Sec. 3.3.2), and (E) inter-class center-to-pixel loss (Sec. 3.3.3).

3.1 Extracting Class Centers from Ground Truth

Denote a feature map and its corresponding resized one-hot encoded ground-truth mask as

111, and denote images’ height and width, and number of channels, respectively. and , respectively. We first get the spatially flattened class mask  and flattened feature map . Then, the class center222It is termed as class center in [31] and object region representations in [29]., which is the average features of all pixel features of a class, can be calculated by:


where denotes the number of non-zero values in the corresponding map of the ground-truth mask . In our experiments, to alleviate the negative impact of noisy images, we calculate the class centers using all the training images in a batch, and denote them as 333We use and omit the subscript for clarify. (see the appendix for details).

3.2 Reducing Intra-Class Feature Variance

3.2.1 Motivation.

More compact intra-class representation can lead to a relatively larger margin between classes, and therefore result in more separable features. In order to reduce the intra-class feature variance, existing works [24, 6, 35, 28, 10, 30] usually use self-attention to calculate the dot-product similarity in spatial space to encourage similar pixels to have a compact distance implicitly. For example, the self-attention in [24] implicitly pushed the feature representation of pixels belonging to the same class to be more similar to each other than those of pixels belonging to other classes. In our work, we devise a simple intra-class center-to-pixel loss to guide the training, which can achieve this goal very effectively and produce improved accuracy.

3.2.2 Intra-class Center-to-pixel Loss.

We define a simple but effective intra-class center-to-pixel loss to suppress the intra-class feature variance by penalizing large distance between a pixel feature and its class center. The Intra-class Center-to-pixel Loss is defined by:




In Eq. (3), is a spatial mask indicating pixels being ignored (i.e., ignore label), distributes the class centers to the corresponding positions in each image. Thus, our intra-class loss will push the pixel representations to their corresponding class center, using mean squared error (MSE) in Eq. (3).

3.3 Maximizing Inter-class Separation

3.3.1 Motivation.

Humans can robustly recognize an object by itself regardless which other objects it appears with. Conversely, if a classifier heavily relies on the information from other classes to determine the classification result, it will easily produce wrong classification results when a rather rare class combination appears during inference. Maximizing inter-class separation, or in another words, reducing the inter-class dependency, can therefore help the network generalize better, especially when the training set is small or is biased. As shown in Fig. 1, the dog and sheep are mis-classified as the cow because cow and grass appear together more often in the training set. To improve the robustness of the model, we propose to reduce this inter-class dependency. To this end, the following two loss functions are defined.

3.3.2 Inter-class Center-to-center Loss.

The first loss function is to maximize the distance between any two different class centers. Inspired by the center loss used in face recognition 

[25], we propose to reduce the similarity between class centers , which are the averaged features of each class calculated according to the GT mask. The inter-class relation is defined by the dot-product similarity [22] between any two classes as:


Moreover, since we only need to constrain the inter-class distance, only the non-diagonal elements are retained for the later loss calculation as:


We only penalize larger similarity values between any two different classes than a pre-defined threshold , i.e.,


Thus, the Inter-class Center-to-center Loss is defined by:


Here, a small margin is used in consideration of the feature space size and the mislabeled ground truth.

3.3.3 Inter-class Center-to-pixel Loss.

Maximizing only the distances between class centers does not necessarily result in separable representation for every individual pixels. We further maximize the distance between a class center and any pixel that does not belong to this class. More concretely, we first compute the center-to-pixel dot product as:


Ideally, with the previous loss , the features of all pixels belonging to the same class should be equal to that of the class center. Therefore, we replace the intra-class dot product with its ideal value, namely using the class center for calculating the intra-class dot product as:


and the replacement effect is achieved by using masks as:


This updated dot product is then used to calculate similarity across class axis with a softmax as:


Similar to the calculation of in the previous subsection, we have


Thus, the Inter-class Center-to-pixel Loss is defined by:


3.4 Differences with OCR, ACFNet and CPNet

Methods that are most closely related to ours are OCR [29], ACFNet [31] and CPNet [28], which all focus on better utilizing class-level features and differ on how to extract the class centers and context features. However, they all use a simple concatenation

to fuse the original pixel feature and the complementary context feature. For example, OCR and ACFNet first produce a coarse segmentation, which is supervised by the GT mask with a categorical cross-entropy loss, and then use this predicted coarse mask to generate the (soft) class centers by weighted summing all the pixel features. OCR then aggregates these class centers according to their similarity to the original pixel feature termed as “pixel-region relation”, resulting in a “contextual feature”. Slightly differently from OCR, ACFNet directly uses the probability (from the predicted coarse mask) to aggregate class center, obtaining a similar context feature

444Some nonlinear transformation layers used to increase regression capability are omitted here. termed as “attentional class feature”. CPNet defines an affinity map, which is a binary map indicating if two spatial locations belong to the same class. Then, they use a sub-network to predict their ideal affinity map and use the soft version affinity map termed as “Context Prior Map” for feature aggregation, obtaining a class feature (center) and a context feature. Note that CPNet concatenates class feature, which is the updated pixel feature, and the context feature.

We also propose to utilize class-level contextual features. Instead of extracting and fusing pixel features with sub-networks, we propose three loss functions to directly regularize training and encourage the learned features to maintain certain desired properties. The approach is simple but more effective thanks to the direct supervision (validated in Tab. 3). Moreover, our class center estimate is more accurate because we use the GT mask. This strategy largely reduces the complexity of the network and introduces no computational overhead during inference. Furthermore, it is compatible with all existing methods, including OCR, ACFNet and CPNet, demonstrating great generalization capability.

4 Experiments

4.1 Implementation

Training Settings.

For both CAR and baselines, we apply the settings common to most works [32, 33, 10, 9, 35], including SyncBatchNorm, batch size = 16, weight decay (0.001), 0.01 initial LR, and poly learning decay with SGD during training. In addition, for the CNN backbones (e.g., ResNet), we set

output stride

= 8 (see [3]). Training iteration is set to 30k iterations unless otherwise specified. For the thresholds in Eq. 6 and Eq. 13, we set and .

Determinism and Reproducibility

Our implementations are based on the latest NVIDIA deterministic framework (2022), which means exactly the same results can be always reproduced with the same hardware (e.g., same GPU types) and same training settings (including random seed). To demonstrate the effectiveness of our CAR with equal comparisons, we reproduced all the baselines that we compare, all conducted with exactly the same settings unless otherwise specified.

4.2 Experiments on Pascal Context

The Pascal Context 

[18] dataset is split into 4,998/5,105 for training/test set. We use its 59 semantic classes following the common practice [29, 33]. Unless otherwise specified, both baselines and CAR are trained on the training set with 30k iterations. The ablation studies are presented in Sect. 4.2.1.

4.2.1 Ablation Studies on Pascal Context

Methods  A mIOU (%)
R1 ResNet-50 + Self-Attention [24] - - 48.32
R2 48.56
R3 + CAR 49.17
R4 49.79
R5 50.01
R6 49.62
R7 50.00
R8 50.50
S1 Swin-Tiny + UperNet [27] - - 49.62
S2 49.82
S3 + CAR 49.01
S4 50.63
S5 50.26
S6 49.62
S7 50.58
S8 50.78
Table 1: Ablation studies of adding CAR to different methods on Pascal Context dataset. All results are obtained with single scale test without flip. “A” means replacing the conv with conv (detailed in Sec. 4.2.1). CAR improves the performance of different types of backbones (CNN & Transformer) and head blocks (SA & Uper), showing that the proposed CAR generalizes well on different network architectures.
CAR on ResNet-50 + Self-Attention.

We firstly test the CAR with “ResNet-50 + Self-Attention” (w/o image-level block in [33]) to verify the effectiveness of the proposed loss functions, i.e., , , and .

As shown in Tab. 1, using directly improves 1.30 mIOU (48.32 vs 49.62); Introducing and further improves 0.38 mIOU and 0.50 mIOU; Finally, with all three loss functions, the proposed CAR improves 2.18 mIOU from the regular ResNet-50 + Self-attention (48.32 vs 50.50).

CAR on Swin-Tiny + Uper.

“Swin-Tiny + Uper” is a totally different architecture from “ResNet-50 + Self-Attention [24]”. Swin [13] is a recent Transformer-based backbone network. Uper [27] is based on the pyramid pooling modules (PPM) [34] and FPN [11], focusing on extracting multi-scale context information. Similarly, as shown in Tab. 1, after adding CAR, the performance of Swin-Tiny + Uper also increases by 1.16, which shows our CAR can generalize to different architectures well.

The Devil is In the Architecture’s Detail.

We find it important to replace the leading conv (in the original method) with conv (Fig. 3B). For example, and did not improve the performance in Swin-Tiny + Uper (Row S3 vs S1, and S5 vs S4 in Tab. 1). A possible reason is that the network is trained to maximize the separation between different classes. However, if the two pixels lie on different sides of the segmentation boundary, a conv will merge the pixel representations from different classes, making the proposed losses harder to optimize.

To keep simplicity and maximize generalization, we use the same network configurations for all the baseline methods. However, performance may be further improved with some minor dedicated modifications for each baseline when deploying our CAR. For example, decreasing the filter number to 256 for the last conv layer of ResNet-50 + Self-Attention + CAR results in a further improvement to 51.00 mIOU (from 50.50). Replacing the conv layer after PPM (inside Uper block, Fig. 3A3) from to in Swin-Tiny + UperNet boosts Swin (tiny & large) + UperNet + CAR by an extra 0.5-1.0 mIOU. We did not try to exhaustively search these variants since they did not generalize.

CAR using Moving Average.

We also implemented a moving average version of CAR which tracks the class center with moving average similar to BatchNorm. As shown in Tab. 2, we find this moving average version of CAR negatively impacts both ResNet-50 + Self-Attention and Swin-Tiny + Uper.

Methods  CAR CAR (Moving Average)
0.8 0.9 0.99
ResNet-50 + Self-Attention 50.50 49.80(0.70) 50.26(0.24) 49.96(0.54)
Swin-Tiny + UperNet 50.78 49.56(1.22) 50.03(0.75) 48.93(1.85)
Table 2: Ablation studies of adding moving average to CAR on Pascal Context. Decay rate stands for the effect of old class center.
Methods Backbone mIOU(%)
Pascal Context COCO-Stuff10K
FCN [15] ResNet-50 [7] 47.72 34.10
FCN + CAR ResNet-50 [7] 48.40(+0.68) 34.91(+0.81
FCN [15] ResNet-101 [7] 50.93 35.93
FCN + CAR ResNet-101 [7] 51.39(+0.49) 36.88(+0.95
DeepLabV3 [3] ResNet-50 [7] 48.59 34.96
DeepLabV3 + CAR ResNet-50 [7] 49.53(+0.94) 35.13(+0.17)
DeepLabV3 [3] ResNet-101 [7] 51.69 36.92
DeepLabV3 + CAR ResNet-101 [7] 52.58(+0.89) 37.39(+0.47)
Self-Attention [24] ResNet-50 [7] 48.32 34.35
Self-Attention + CAR ResNet-50 [7] 50.50(+2.18) 36.58(+2.23
Self-Attention [24] ResNet-101 [7] 51.59 36.53
Self-Attention + CAR ResNet-101 [7] 52.49(+0.9) 38.15(+1.62)
CCNet [9] ResNet-50 [7] 49.15 35.10
CCNet + CAR ResNet-50 [7] 49.56(+0.41) 36.39(+1.29)
CCNet [9] ResNet-101 [7] 51.41 36.88
CCNet + CAR ResNet-101 [7] 51.97(+0.56) 37.56(+0.68)
DANet [6] ResNet-101 [7] 51.45 35.80
DANet + CAR ResNet-101 [7] 52.57(+1.12) 37.47(+1.67)
CPNet [28] ResNet-101 [7] 51.29 36.92
CPNet + CAR ResNet-101 [7] 51.98(+0.69) 37.12(+0.20
OCR [29] HRNet-W48 [23] 54.37 38.22
OCR + CAR HRNet-W48 [23] 54.99(+0.62) 39.53(+1.31)
UperNet [27] Swin-Tiny [13] 49.62 36.07
UperNet + CAR Swin-Tiny [13] 50.78(+1.16) 36.63(+0.56
UperNet [27] Swin-Large [13] 57.48 44.25
UperNet + CAR Swin-Large [13] 58.97(+1.49) 44.88(+0.63)
CAA [8] EfficientNet-B5 [17] 57.79 43.40
CAA + CAR EfficientNet-B5 [17] 58.96(+1.17) 43.93(+0.53)
Table 3: Ablation studies of adding CAR to different baselines on Pascal Context [18] and COCOStuff-10K [1]. We deterministically reproduced all the baselines with the same settings. All results are single-scale without flipping. CAR works very well in most existing methods. § means reducing the class-level threshold to 0.25 from 0.5. We found it is sensitive for some model variants to handle a large number of class. Affinity loss [28] and Auxiliary loss [34] are applied on CPNet and OCR, respectively, since they highly rely on those losses.
CAR on Different Baselines.

After we have verified the effectiveness of each part of the proposed CAR, we then tested CAR on multiple well-known baselines. All of the baselines were reproduced under similar conditions (see Sect. 4.1). Experimental results shown in Tab. 3 demonstrate the generalizability of our CAR on different backbones and methods.

Visualization of Class Dependency Maps.

In Fig. 4, we present the class dependency maps calculated on the complete Pascal Context test set, where every pixel stores the dot-product similarities between every two class centers. The maps indicate the inter-class dependency obtained with the standard ResNet-50 + Self-Attention and Swin-Tiny + UperNet, and the effect of applying our CAR. A hotter color means that the class has higher dependency on the corresponding class, and vice versa. According to Fig. 4 a1-a2, we can easily observe that the inter-class dependency has been significantly reduced with CAR on ResNet50 + Self-Attention. Fig. 4 b1-b2 show a similar trend when tested with different backbones and head blocks. This partially explains the reason why baselines with CAR generalize better on rarely seen class combinations (Figs. 1 and 5). Interestingly, we find that the class-dependency issue is more serious in Swin-Tiny + Uper, but our CAR can still reduce its dependency level significantly.

Figure 4: Class dependency maps generated on Pascal Context test set. One may zoom in to see class names. A hotter color means that the class has higher dependency to the corresponding class, and vice versa. It is obvious that our CAR reduces the inter-class dependency, thus providing better generalizability (see Figs. 1 and 5).
Visualization of Pixel-relation Maps.

In Fig. 5, we visualize the pixel-to-pixel relation energy map, based on the dot-product similarity between a red-dot marked pixel and other pixels, as well as the predicted results for different methods, for comparison. Examples are from Pascal Context test set. As we can see, with CAR supervision, the existing models focus better on objects themselves rather than other objects. Therefore, this reduces the possibility of the classification errors because of the class-dependency bias.

(a) ResNet50 + Self-Attention
(b) Swin-Tiny + UperNet
Figure 5: Visualization of the feature similarity between a given pixel (marked with a red dot in the image) and all pixels, as well as the segmentation results on Pascal Context test set. A hotter color denotes larger similarity value. Apparently, our CAR reduces the inter-class dependency and exhibits better generalization ability, where energies are better restrained in the intra-class pixels.

4.3 Experiments on COCOStuff-10K

COCOStuff-10K dataset [1] is widely used for evaluating the robustness of semantic segmentation models [10, 29]. The COCOStuff-10k dataset is a very challenging dataset containing 171 labeled classes and 9000/1000 images for training/test. As shown in Tab. 3, all of the tested baselines gain performance boost ranging from 0.17% to 2.23% with our proposed CAR. This demonstrates the generalization ability of our CAR when handling a large number of classes.

5 Conclusions and Future Work

In this paper, we have aimed to make a better use of class level context information. We have proposed a universal class-aware regularizations (CAR) approach to regularize the training process and boost the differentiability of the learned pixel representations. To this end, we have proposed to minimize the intra-class feature variance and maximize the inter-class separation simultaneously. Experiments conducted on benchmark datasets with extensive ablation studies have validated the effectiveness of the proposed CAR approach, which has boosted the existing models’ performance by up to 2.18% mIOU on Pascal Context and 2.23% on COCOStuff-10k with no extra inference overhead.

Appendix 0.A Appendix

0.a.1 Extra technical details

0.a.1.1 Deterministic

Control variables are very important for all scientific research. In computer vision, we always use the same backbones and the same datasets when verifying the difference between two methods.

Without using “deterministic” technology, all operations in neural networks contain some randomness. Nowadays, with the latest deterministic technology and fixed seeds, experiments can be conducted in a fully-controlled environment. This means that the performance difference between different settings (

i.e., w/ and w/o CAR) is not affected by this randomness any more but faithfully reflects the effectiveness of different methods.

In Tab. 4, we report the performance of our proposed CAR (ResNet-50 + Self-attention and Swin-Tiny + UperNet) with different seeds for readers who are interested in how our CAR performs when trained with different random seeds. As it is shown, our CAR consistently improves the mIOU over its baseline using different random seeds, demonstrating the effectiveness of the proposed CAR.

Methods Seed (mIOU%)
0 (default) 1 2
ResNet-50 + Self-Attention 48.32 47.54 47.69
ResNet-50 + Self-Attention + CAR 50.50(+2.18) 50.20(+2.66) 50.59(+2.90)
Swin-Tiny + UperNet 49.62 49.24 49.54
Swin-Tiny + UperNet + CAR 50.78(+1.16) 50.57(+1.33) 50.75(+1.21)
Table 4: Ablation studies of our proposed CAR using different random seeds on the Pascal Context dataset.

0.a.2 Extra experiments

0.a.2.1 Ablation studies on batch class center

In our experiments, we calculated the class centers using all the training images in a batch to alleviate the negative impact of noisy images. Here, we investigate the impact of using the class center of each individual image for class-aware regularizations.

Methods  Baseline CAR
Image Class Center Batch Class Center
ResNet-50 + Self-Attention 48.32 49.78 50.50
Swin-Tiny + UperNet 49.62 49.45 50.78
Table 5: Comparison of mIOUs (%) obtained when using batch class center vs image class center in CAR.

0.a.2.2 Exceeding state-of-the-art (SOTA) in Pascal Context

The main motivation of our CAR is to utilize class-level information as regularizations during training to boost the performance of all existing methods. However, following the convention and also for readers who are interested, we compare with state-of-the-art methods in Tab. 6 regardless their architectures are related to ours or not. Since Swin [13] is not compatible with dilation, we use JPU [26] as the substitution to obtain features with output stride = 8. Uper contains an FPN [11] module that can obtain features with output stride = 4.

Boosted by our CAR, the strong model ConvNext-Large [14] + CAA [8] achieved the performance of 62.70% mIOU under single-scale testing, and 63.91% under multi-scale testing. Also, we found increasing the training iterations from the default 30K to 40K when using Adam optimizer can further increase performance in Pascal Context dataset. Thus, the SOTA single model performance has now been boosted to 62.97% under single-scale testing, and 64.12% under multi-scale testing. This has outperformed the previous SOTA single model, i.e., EfficientNetB7 + CAA, by a large margin.

Methods Backbone Aux Optimizer Iterations SS mIOU(%) MF mIOU(%)
CAA§ EfficientNet-B7-D8 SGD 30K - 60.30
UperNet Swin-Large SGD 30K 57.48 59.45
UperNet + CAR Swin-Large SGD 30K 58.97 60.76
CAA Swin-Large + JPU SGD 30K 58.31 59.75
CAA + CAR Swin-Large + JPU SGD 30K 59.84 61.46
CAA + CAR Swin-Large + JPU Adam 30K 60.68 62.21
CAA ConvNeXt-Large + JPU SGD 30K 60.48 61.80
CAA + CAR ConvNeXt-Large + JPU SGD 30K 61.40 62.69
CAA + CAR ConvNeXt-Large + JPU Adam 30K 62.65 63.77
CAA + CAR ConvNeXt-Large + JPU Adam 30K 62.70 63.91
CAA + CAR ConvNeXt-Large + JPU Adam 40K 62.97 64.12
Table 6: Experiments on boosting the SOTA single-model performance on Pascal Context by our CAR. See Sec. 0.A.2.2 for the details. §: We report previous SOTA score as reference. SS: Single scale without flipping. MF: Multi-scale with flipping. JPU is used to get features with output stride = 8. Aux: Apply auxiliary loss during training, see [34]. Iterations: training iterations.

0.a.2.3 Exceeding SOTA performance in COCOStuff-10K

Simliar to Sec. 0.A.2.2, in Tab. 7, boosted by our CAR, the strong model ConvNext-Large [14] + CAA achieved the performance of 49.03% mIOU under single-scale testing, and 50.01% under multi-scale testing. This has also outperformed the previous SOTA single model, i.e., EfficientNetB7 + CAA, by a large margin.

Methods Backbone Aux Optimizer SS mIOU(%) MF mIOU(%)
CAA§ EfficientNet-B7-D8 SGD - 45.40
UperNet Swin-Large SGD 44.25 46.10
UperNet + CAR Swin-Large SGD 44.88 46.64
CAA Swin-Large + JPU SGD 44.22 45.31
CAA + CAR Swin-Large + JPU SGD 45.48 46.99
CAA ConvNeXt-Large + JPU SGD 46.49 47.23
CAA + CAR ConvNeXt-Large + JPU SGD 46.70 47.77
CAA + CAR ConvNeXt-Large + JPU Adam 48.20 48.83
CAA + CAR ConvNeXt-Large + JPU Adam 49.03 50.01
Table 7: Experiments on boosting SOTA on COCOStuff10k, levering the previous single model SOTA and boosted by our CAR. See Sec. 0.A.2.3 for details. §: We report the original SOTA scores. SS: Single scale without flipping. MF: Multi-scale with flipping. Aux Apply auxiliary loss during training, see [34].

0.a.3 Extra Visualizations

0.a.3.1 Visualization of OCRNet in Pascal Context

Similar to the main paper, in Fig. 6, we visualize the pixel-to-pixel relation energy maps obtained with HRNetW48 [29] + OCR [29]. This figure shows that our CAR can further improve the robustness of class center based models by making better use of the class center. Interestingly, as shown in C12 of Fig. 6 and Fig. 1 shown in our main paper what is predicted by ResNet-50 + Self-Attention, we find cow/sheep/dog misclassification is a common issue in many semantic segmentation models, especially when i.e. grass and cow co-exist frequently during training. This issue is better addressed by our CAR due to its reduced inter-class dependency.

Figure 6: Visualization of the feature similarity between a given pixel (marked with a red dot in the image) and all other pixels, as well as the segmentation results of HRNetW48 [23] + OCR [29] on Pascal Context test set. A hotter color denotes a greater similarity value.

0.a.3.2 Visualization of DeepLab in Pascal Context

We also visualize the pixel-to-pixel relation energy map of ResNet-50 [7] + DeepLabV3 [3] in Fig. 7. These visualizations clearly show that the reduced inter-class dependency helps to correct the classification.

Figure 7: Visualization of the feature similarity between a given pixel (marked with a red dot in the image) and all pixels, as well as the segmentation results of ResNet-50 [7] + DeepLab [3] on Pascal Context test set. A hotter color denotes a greater similarity value.


This research depends on the NVIDIA determinism framework. We appreciate the support from @duncanriach and @reedwm at NVIDIA and TensorFlow team.


  • [1] H. Caesar, J. Uijlings, and V. Ferrari (2018) COCO-Stuff: Thing and Stuff Classes in Context. In CVPR, Cited by: §4.3, Table 3.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI. Cited by: §1.
  • [3] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. External Links: 1706.05587 Cited by: Figure 7, §0.A.3.2, §4.1, Table 3.
  • [4] S. Choi, J. T. Kim, and J. Choo (2020) Cars can’t fly up in the sky: improving urban-scene segmentation via height-driven attention networks. In CVPR, Cited by: §2.
  • [5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §1.
  • [6] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, Cited by: §1, §2, §3.2.1, Table 3.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Figure 7, §0.A.3.2, Table 3.
  • [8] Y. Huang, D. Kang, W. Jia, X. He, and L. liu (2022) Channelized axial attention - considering channel relation within spatial attention for semantic segmentation. In AAAI, Cited by: §0.A.2.2, Table 3.
  • [9] Z. Huang, X. Wang, Y. Wei, L. Huang, H. Shi, W. Liu, and T. S. Huang (2020) CCNet: criss-cross attention for semantic segmentation. IEEE TPAMI. Cited by: §4.1, Table 3.
  • [10] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu (2019) Expectation-maximization attention networks for semantic segmentation. In ICCV, Cited by: §1, §3.2.1, §4.1, §4.3.
  • [11] T. Lin, P. Dollá, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §0.A.2.2, §4.2.1.
  • [12] M. Liu, D. Schonfeld, and W. Tang (2021) Exploit visual dependency relations for semantic segmentation. In CVPR, Cited by: §2.
  • [13] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In ICCV, Cited by: §0.A.2.2, §4.2.1, Table 3.
  • [14] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie (2022) A convnet for the 2020s. In CVPR, Cited by: §0.A.2.2, §0.A.2.3.
  • [15] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, Table 3.
  • [16] C. Marius, O. Mohamed, R. Sebastian, R. Timo, E. Markus, B. Rodrigo, F. Uwe, S. Roth, and S. Bernt (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, Cited by: §1.
  • [17] T. Mingxing and L. Quoc (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    In icml, Cited by: Table 3.
  • [18] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In CVPR, Cited by: §4.2, Table 3.
  • [19] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision transformers for dense prediction. In ICCV, Cited by: §1.
  • [20] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [21] Z. Sixiao, L. Jiachen, Z. Hengshuang, Z. Xiatian, L. Zekun, W. Yabiao, F. Yanwei, F. Jianfeng, X. Tao, T. P. H.S., and Z. Li (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, Cited by: §1.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2, §3.3.2.
  • [23] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao (2020) Deep high-resolution representation learning for visual recognition. IEEE TPAMI. Cited by: Figure 6, Table 3.
  • [24] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §1, §2, §3.2.1, §4.2.1, Table 1, Table 3.
  • [25] Y. Wen1, K. Zhang, Z. Li, and Y. Qiao (2016) Discriminative feature learning approach for deep face recognition. In ECCV, Cited by: §3.3.2.
  • [26] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yizhou (2019) FastFCN: rethinking dilated convolution in the backbone for semantic segmentation. Cited by: §0.A.2.2.
  • [27] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018) Unified perceptual parsing for scene understanding. In ECCV, Cited by: §4.2.1, Table 1, Table 3.
  • [28] C. Yu, J. Wang, C. Gao, G. Yu, C. Shen, and N. Sang (2020) Context prior for scene segmentation. In CVPR, Cited by: §1, §1, §2, §2, §3.2.1, §3.4, Table 3.
  • [29] Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. In ECCV, Cited by: Figure 6, §0.A.3.1, §1, §1, §2, §2, §3.4, §4.2, §4.3, Table 3, footnote 2.
  • [30] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang (2021) OCNet: object context network for scene parsing. IJCV. Cited by: §1, §2, §3.2.1.
  • [31] F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding (2019) ACFNet: attentional class feature network for semantic segmentation. In ICCV, Cited by: §1, §1, §2, §2, §3.4, footnote 2.
  • [32] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In CVPR, Cited by: §4.1.
  • [33] H. Zhang, H. Zhan, C. Wang, and J. Xie (2019) Semantic correlation promoted shape-variant context for segmentation. In CVPR, Cited by: §1, §2, §4.1, §4.2.1, §4.2.
  • [34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: Table 6, Table 7, §1, §2, §4.2.1, Table 3.
  • [35] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. In ICCV, Cited by: §1, §2, §3.2.1, §4.1.