As a fundamental computer vision task, semantic segmentation aims to produce pixel-level classification of the given image. The advent of Convolutional Neural Networks (CNNs) and its subsequent advancements have significantly improved the performance of semantic segmentation models[long2015fully, chen2017deeplab, zhou2016learning, chen2017rethinking]. However, the success of these models highly relies on a large training dataset with pixel-level annotations [everingham2010pascal, cordts2016cityscapes], which is expensive and time-consuming to obtain. To relieve the burden of pixel-wise labeling, weakly supervised semantic segmentation (WSSS) models have been widely studied, which are usually built upon weak annotations, including image-level labels [wang2020self, fan2020learning, wei2018revisiting, ahn2018learning, huang2018weakly, araslanov2020single], point supervisions [bearman2016s], scribble annotations [vernaza2017learning, lin2016scribblesup] and bounding boxes [song2019box, dai2015boxsup]. Among these weak annotations, image-level labels provide the lowest level of supervision, thus presenting the most challenging task.
Most of the image-level label based WSSS models rely on a Class Activation Map (CAM) [zhou2016learning] to provide initial localisation of the object. However, CAMs are incomplete, focusing on the most discriminative area and often revealing only a small part of the object, as shown in Fig. 1 (b), which makes it insufficient to serve as a pseudo label for the downstream semantic segmentation task.
Recent studies [ahn2018learning, wang2020self, chang2020weakly] have resorted to a three-stage training scheme to further boost the performance of the image-level supervision based WSSS models. Firstly, a better attention map is obtained to cover a larger part of the object, e.g., [wang2020self, chang2020weakly]
achieve this by incorporating additional constraints in the learning of a classification network. Secondly, the attention map is expanded through pixel correlations to increase the coverage rate. This is mainly achieved by learning an affinity matrix through AffinityNet[ahn2018learning] to a perform a random walk on the obtained class activation maps. Lastly, a semantic segmentation model, e.g. DeepLab [chen2017deeplab], is trained using the expanded attention maps as pseudo labels in a fully supervised manner.
The performance of the three-stage training scheme highly depends on the quality of the initial class activation map. We argue that although the discriminative area is beneficial for object classification, the lost structure information in the class activation map hinders its performance in subsequent processing, e.g. semantic segmentation. Further, due to different degrees of discrimination, regions inside the same object have different class activation scores. In this paper, we propose a smoothing branch that leverages semantic boundary information to impose structural constraints on the training of WSSS models. It forces the resulting class activation maps to pay attention to object structures. Furthermore, we introduce a semantic boundary-guided smoothness loss that achieves a more consistent attention map within the same object area while ignoring the pixels belonging to other classes as shown in Fig. 1. With the proposed framework, we can achieve better CAMs with structure information, leading to relatively sharp semantic predictions as shown in Fig. 1.
Our main contributions can be summarised as:
We propose a smoothing branch that leverages semantic boundaries. for weakly supervised semantic segmentation to obtain structure preserving attention maps.
We present a semantic boundary-guided smoothness loss function that enforces more consistent semantic prediction within the same object area.
Our proposed method achieves the state-of-the-art results on the PASCAL VOC 2012 dataset.
Ii Related Work
Ii-a Weakly supervised semantic segmentation
Weakly supervised semantic segmentation models aim to perform the semantic segmentation task with only coarse annotations. Successful examples have explored the potential of bounding box [song2019box, dai2015boxsup], scribble [vernaza2017learning, lin2016scribblesup], point [bearman2016s] and image-level [wang2020self, fan2020learning, wei2018revisiting, ahn2018learning, huang2018weakly, araslanov2020single] annotations. Among these, image-level labels require the minimum-level of manual annotation Further, they present the most challenging task, providing the least supervision for training. Our work employs image-level labels as weak supervision for semantic segmentation task.
Ii-B Class Activation Maps
Recent weakly supervised semantic segmentation works have been based on using CAMs [zhou2016learning] to provide the initial localisation cues for different classes. It requires modification to the structure of CNNs trained for a classification task with image-level labels. The penultimate global average pooling (GAP) layer is removed, and the classification layer is directly applied to the high dimensional features to produce the pixel-level score maps for each class. Grad-CAM [selvaraju2017grad]
aims to mitigate the limitation of CAM that it is only applicable to fully convolutional neural networks and its performance is slightly reduced due to the change of network structure. It associates the importance of feature maps at the penultimate layer with the class gradients and computes the score maps with a weighted sum of those feature maps. However, these class activation maps reveal only the most discriminative parts of the objects as they are trained primarily for classification. The output pseudo segmentation maps are incomplete and require further refinement.
Some research approaches enforce the network to learn a more complete representation by erasing the most discriminative parts of the object. Wei et al.[wei2017object] proposes to “erase” the obtained high-confidence area and retrain the network for the classification task with partially erased inputs in order to enforce the network to identify more activation areas associated with each class. However, as a result of the gradual removal of the high confidence area, the model eventually includes incorrect pixels surrounding the target object for classification. To mitigate this problem, SeeNet [hou2018self] employs two additional decoder branches to confine the erasure and attention within the object. DCSP [chaudhry2017discovering] applies the erasure iteratively and leverages an off-the-shelf saliency detector to confine the process, so that the attention within the object can be accumulated.
Recent works have been focusing on improving the quality of class activation maps directly. FickleNet [lee2019ficklenet] forces the model to learn a more complete CAM by randomly dropping connections of units in the convolution kernel. However, it is computationally expensive, involving generating 200 localisation maps for each input image and accumulating them to produce the revised result. OAA [jiang2019integral] discovers a shift and transformation of CAMs in the training of the classification task. It accumulates CAMs in different stages of training to produce a final activation map that covers a larger area. Both SSENet [wang2019self] and SEAM [wang2020self] adopt scale equivariant attention to achieve self supervision. [chang2020weakly] proposes to perform fine-grained classification within each class, and accumulate the sub-class activation maps with the class activation maps to enlarge the activation area. [fan2020learning]
learns a directional vector within each class to differentiate foreground and background pixels. However, these methods focus only on increasing the overlap ratio between the recovered and ground truth segmentation maps, while ignoring the structures of the objects, resulting in incorrect segmentation results around the object boundaries.
Ii-C Structure-aware semantic segmentation
Mismatch between the boundaries of CAMs and class objects is a key factor that hinders the recovery of the full segmentation mask of weakly supervised semantic segmentation. The idea of expanding the localisation cues, such as CAMs, to recover more complete representations of holistic objects was first proposed in [kolesnikov2016seed]. The expansion can benefit from more accurate initial localisation [li2018tell]. [huang2018weakly] incorporates the seeded region growing (SDRG) algorithm into the network to expand the initial CAMs. The expansion is confined by the boundary that separates the foreground and background pixels. AffinityNet [ahn2018learning] learns an affinity matrix on the initial CAMs and perform a random walk to diffuse confident labels to similar pixels. SSDD [shimoda2019self] optimises the segmentation results of semantic segmentation models by learning from the dense Conditional Random Fields (dCRF) and Random Walk (RW) revised results. These works exploit pixel correlations to alleviate the mismatch between the segmentation map boundary and the class boundary, while some other works [lee2019ficklenet, huang2018weakly, hou2018self, chaudhry2017discovering] directly adopt off-the-shelf saliency detection networks to localise the boundary between foreground and background pixels, which implicitly uses extra ground truth. Our work derives the semantic boundary from the semantic segmentation results of existing weakly supervised semantic segmentation model so that it does not implicitly or explicitly utilise any additional supervision. It utilises the semantic boundary to incorporate the structure information into the feature maps and make the activation score within the object more consistent, leading to improved segmentation results.
Iii Our Method
In this section, we introduce the proposed structure-aware semantic segmentation model with image-level supervision as shown in Fig. 2.
Iii-a Smoothing Branch (SB)
We adopt an existing WSSS model [wang2020self] as our baseline model. Our proposed smoothing branch is implemented upon the baseline model by connecting to its encoder and CAM output layer without modification to its network structure. The convolutional layers producing feature maps of the same resolutions are grouped in different stages of network, and we denote the feature map produced by the last convolutional layer of each stage as , where depends on the backbone used by the base model. The input image and image-level labels are denoted as and . The output CAMs from the base model are denoted as . The number of classes in the dataset is defined as .
Semantic Boundary Detection Module (SBDM): The Semantic Boundary Detection Module encourages the feature maps of the baseline model to encode richer structural information. We use feature maps from different levels of the backbone network to produce semantic boundary predictions of size corresponding to 20 classes in PASCAL VOC 2012 [everingham2010pascal] plus the background, where is the spatial size of the feature map. We define the upsampling operation for the as: , where applies a convolution followed by a convolution to channel size
, followed by an bilinear interpolation operation to achieve a feature map of the image size. We concatenate features of , and feed it to another two convolutional layers to obtain the initial semantic boundary of channels.
To inform the model of the image structure we use handcrafted feature-based boundaries as per [takikawa2019gated, zhang2020weakly]. The computed Canny edges are expanded to channels before being interleaved between the initial semantic boundary through concatenation, Then they are fed to another depthwise convolutional layer to obtain our final semantic boundary . The whole process to obtain our final semantic boundary is shown in Eq. 1:
where is semantic boundary prediction, is concatenation operation, is convolutional layer, is depthwise convolutional layer, and represents the output of applying the Canny edge detector on the input image. The SBDM is trained with a cross-entropy loss with the preprocessed semantic boundary as supervision, which is derived from the final semantic segmentation results of our baseline [wang2020self].
Semantic Boundary Guided Smoothness Loss: We enforce the pixels inside the object to achieve consistently high activation scores and a sharp contrast to the pixels falling outside the boundary in terms of confidence scores. Inspired by [wang2018occlusion, godard2017unsupervised], we develop a semantic boundary guided smoothness loss to impose such a constraint. The original smoothness loss is designed to smooth the intensity while preserving structure information in the whole image. However, this is detrimental to the semantic segmentation task where structure information other than the class boundary needs to be suppressed. Further, this becomes even worse for WSSS with only image labels. The smoothness loss function interprets the edges as gradients of class activation scores. When there are unwanted edges present within the semantic boundary, it tends to suppress the class activation scores on either side of the gradient line, resulting in deteriorated results.
To mitigate this problem, we employ the semantic boundary in our proposed branch as a guidance for the smoothness loss. Thus, it is able to diffuse the high scores of the most discriminative area to all pixels enclosed by the semantic boundary to achieve smoothness. We define our semantic boundary guided smoothness loss with both first-order () derivative and second-order () derivative as:
where is the image label to filter irrelevant classes, k is the total number of classes plus one for background, is the classification score at pixel for class , is the semantic boundary response at pixel for class , is a constant set to 10. Function is defined as to avoid calculating the square root of zero, following the settings in [wang2018occlusion, godard2017unsupervised], indicates that computes the partial derivatives in the or direction. This semantic boundary guided smoothness loss enforces score consistency by penalising the gradients in class activation maps, with a boundary-aware term ( or ) to maintain the score contrast along the semantic boundary.
Then our semantic boundary guided smoothness loss function is defined as weighted sum of and as:
where is a weight to balance the contribution of first-order and second-order derivative based smoothness loss, and empirically we set , following settings in [wang2018occlusion, godard2017unsupervised].
Objective Function: The overall loss function is composed of the baseline model loss functions, , the cross-entropy loss for the training of SBDM, and the semantic boundary guided smoothness loss . It is defined as:
where are factors given to the semantic boundary loss and the smoothness losses. The factor of loss functions associated with the base model is kept as 1, We empirically set , . is kept so that the semantic boundary loss does not dominate. The choice of is discussed in detail in Sec IV-C.
Iv Experimental Results
Dataset We follow the standard procedure of WSSS to evaluate our proposed smoothness branch on the PASCAL VOC 2012 segmentation dataset [everingham2010pascal] which contains 20 foreground classes and 1 background class. The official split contains 1,464 images in the training set, 1449 images in the validation set and 1456 images in the testing set. Following the common protocol for semantic segmentation, we use the augmented training set, which includes 10,582 images, provided by [hariharan2011semantic]. The mean Intersection-over-Union (mIoU) metric is adopted to evaluate the segmentation results. We follow the previous WSSS works [wang2020self, fan2020learning, ahn2018learning] to report results on training and validation sets for stage-1 and stage-2 models and report the results on both validation and test sets for the final semantic segmentation model.
Training Details We adopt SEAM [wang2020self] as our baseline model which has a ResNet38 backbone. The semantic boundary is derived from final semantic segmentation results so that we do not use any additional ground truth either implicitly or explicitly. We follow the settings of SEAM by randomly rescaling the images in the range of along the longest side and then crop the size to . The learning rate adopts the poly policy , where the initial learning rate is set to and decay is set to . We initialise our model with the released weights from our baseline [wang2020self]
. The network is trained on 4 NVIDIA RTX 2080TI GPUs. As our work focuses primarily on improving the quality of initial class activation maps, we just use models with publicly available Pytorch codes for the remaining stages of the pipeline.
Iv-B Comparison with the Baseline
Quantitative Analysis of CAM: We compare our proposed model to the baseline model and show the corresponding performance in Table I. Note that, unless otherwise stated, the results are evaluated on the PASCAL VOC 2012 training and validation sets. Table I shows that our model (“Ours”111Ours indicates “CAM+SEAM+1EP” with our proposed SB.) outperforms the baseline (“CAM+SEAM”
) on both training and validation sets. The improvement on the validation set is higher that on the training set, which demonstrates that our method generalises well to unseen samples. We further prove that our performance improvement is not a result of additional training (an extra epoch of training). We train the baseline model with one extra epoch of training and show its performance as“CAM+SEAM+1EP”. Although one extra epoch of training benefits the baseline model, the gap between “CAM+SEAM+1EP” and “Ours” further illustrates effectiveness of the proposed solution. The improvements are achieved without incurring additional computational cost during the inference where the smoothing branch is not involved.
|CAM + SEAM||55.41||52.54|
|CAM + SEAM + 1 EP||55.69||53.06|
Qualitative Analysis of CAM: To verify the effectiveness of our proposed module and loss function, we compare the class activation maps of our method with that of the baseline in Fig 3. As the baseline model does not include structural constraints, although its class activation maps are capable of localising the objects, they are prone to over-segmentation problem. As can be seen in the boat image, it wrongly associates some water and shore area to the boat class. Our method is able to alleviate this problem by enforcing a class score gradient at the semantic boundary so that class scores at pixels outside the object area are kept low. Another advantage of our method is that the class scores within the object area are more consistent. Our method produces larger high-score areas covering the majority of the object while that of our baseline model are more sparse.
Quantitative Analysis of the Performance after Random Walk: We employ AffinityNet [ahn2018learning] to learn an affinity matrix in order to perform a random walk on our produced class activation maps. As AffinityNet only utilises the pixels with high class activation scores while disregarding the low-score pixels, it is important to have class activation maps obtain consistently high class activation scores within the object area and reduce the over-segmentation on irrelevant pixels. Tab II demonstrates the effectiveness of our method on the downstream random walk processing, which indicates a further 3% performance improvement on the validation set.
Class-wise performance comparison between the pseudo ground truth of our method and our baseline on the PASCAL VOC 2012 validation set using the mIoU evaluation metric.
We further compare class-wise performance after the random walk in Tab III. It shows that our method achieves consistently higher performance except on cow, diningtable and dog classes. There are two main causes: (1) The initial localisation of the diningtable class focuses primarily on the tableware rather than on the diningtable itself. Our smoothness loss cannot spread the high scores to cover the table object as shown in the first row of Fig 4; (2) Cow and dog classes sometimes have a single source of high-score area in the initial localisation. Its attention is placed solely on the eye and nose for the cow and dog class respectively. Our smoothness loss tends to find a trivial solution by creating a score contrast around these small high-score areas, leading to under-segmentation in this case. This is illustrated in the second and the third rows of Fig 4.
Accuracy vs Image Occupancy Percentage (IOP)
We have noticed that the mismatch between the class-wise IOP for ImageNet and PASCAL VOC 2012 results in reduced segmentation accuracy for classes that are too large or too small. IOPs of the 20 classes of PASCAL VOC 2012 dataset are between 5% and 35% as shown in Fig.5
while those in ImageNet are around 25%. As a result, the group of classes with around 20% IOP have the highest mIoU accuracy. Among this group,Diningtable and Sofa classes have lower mIoU because they are generally associated with heavy occlusion and potentially inaccurate initial localisation as manifested in Fig 4. Besides, the furniture related images account for only 1.4% of the total images in the ImageNet dataset, which is a much smaller proportion than those of other classes. Cat, Car and Bus classes have around 30% IOP. They usually have the under-segmentation problem, resulting in lower mIoU accuracy. The classes with IOP smaller than 11% have the lowest segmentation accuracy. The mismatch on the IOP between two datasets leads to over-segmentation, which is detrimental to small class objects using mIoU evaluation metrics. The bird class achieves accurate segmentation results despite its small IOP. This can be attributed to the fact that ImageNet has abundant images of different kinds of birds. In total, the birds images account for around 5.7% of total images in ImageNet.
Iv-C Ablation Study
The improvements in performance can be attributed to the structural constraints imposed by our proposed module and loss function. Table IV shows an ablation study of our SBDM and the semantic boundary guided smoothness loss function. Through the SBDM, the feature maps from multiple levels of the encoder become structure-aware, which indirectly enhances the quality of the final score map by passing structural information. The major improvements come from the semantic boundary-guided smoothness loss function. It creates a sharp contrast of class activation scores around the semantic boundary while enforcing more consistent high scores within the area enclosed by the semantic boundary.
Semantic Boundary Detection Module: Our SBDM uses both feature maps and Canny edges to predict the semantic boundary. SDRG [huang2018weakly] has also used boundary information to impose structural constraints, however, they only use it to separate the foreground and background pixels. We demonstrate that making feature maps structure-aware is beneficial to predicting class activation maps. Tab. V
shows that SBDM requires both high-level features and low-level features to determine the boundary semantically and spatially. The Canny edge is demonstrated to provide extra spatial cues for the bottom-up estimation of the semantic boundary, leading to improved performance.
Semantic Boundary-Guided Smoothness Loss: The semantic boundary is essential for the smoothness loss to be used in WSSS with only image labels. As unwanted noisy edges are not present within the semantic boundary, it can enlarge the high-score area while alleviating over-segmentation. Tab. VI demonstrates that the factor assigned to the smoothness loss has a big impact on its performance. When the factor is too low, the smoothness loss fails to spread to high scores to surrounding pixels. On the contrary, when the factor is too high, it searches for a trivial solution by reducing all scores to minimal values, resulting in highly deteriorated results.
Iv-D Comparison with WSSS state-of-the-art
Tab. VII illustrates the results of previous WSSS methods with only image-level labels and ours and Tab. IX provides more detailed comparison on class-wise performance. Our method improves upon our baseline model on both validation and test sets. Furthermore, when no additional supervision is used either implicitly or explicitly, our method achieves the state-of-the-art results on the PASCAL VOC 2012 test set. As WSSS with image-level labels does not explicitly predict the background class, many methods separate foreground pixels from background ones by leveraging an external saliency detector, which implicitly uses saliency ground truth as additional supervision which includes object segmentation boundary information. Tab. VIII shows previous WSSS methods equipped with an external saliency detector. It can be seen that our method can still outperform most of them without implicitly using the saliency ground truth except the ICD method [fan2020learning]. We also visualise some qualitative results of our semantic segmentation model in Fig. 6, which demonstrates its ability to segment objects belonging to different classes.
In this paper, we propose a Smoothing Branch (SB) to impose structural constraints in the training of WSSS models with image-level labels. The SB leverages the derived semantic boundary to make the feature maps structure-aware, which subsequently improves the generated class activation maps. The semantic boundary also guides the smoothness loss function to make the scores more consistent within the enclosed object area and reduce the scores at outside pixels. Comprehensive experiments are conducted to study the components of our proposed method. The generated class activation maps are thus capable of better preserving the object structure. The semantic segmentation network trained with our segmentation maps as pseudo ground truth achieves state-of-the-art performance on the PASCAL VOC 2012 dataset, demonstrating the superior performance of our method.