SegSort: Segmentation by Discriminative Sorting of Segments

10/15/2019 ∙ by Jyh-Jing Hwang, et al. ∙ 46

Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and structures that are the basic building blocks of recognition. This motivates us to propose an end-to-end pixel-wise metric learning approach that mimics this process. In our approach, the optimal visual representation determines the right segmentation within individual images and associates segments with the same semantic classes across images. The core visual learning problem is therefore to maximize the similarity within segments and minimize the similarity between segments. Given a model trained this way, inference is performed consistently by extracting pixel-wise embeddings and clustering, with the semantic label determined by the majority vote of its nearest neighbors from an annotated set. As a result, we present the SegSort, as a first attempt using deep learning for unsupervised semantic segmentation, achieving 76% performance of its supervised counterpart. When supervision is available, SegSort shows consistent improvements over conventional approaches based on pixel-wise softmax training. Additionally, our approach produces more precise boundaries and consistent region predictions. The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

page 8

page 14

page 17

Code Repositories

SegSort

SegSort: Segmentation by Discriminative Sorting of Segments


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Top: Our proposed approach partitions an image in the embedding space into aligned segments (framed in red) and assign the majority labels from retrieved segments (framed in green or pink). Bottom: Our approach presents the first deep learning based unsupervised semantic segmentation (right). If supervised, our approach produces more consistent region predictions and precise boundaries in the supervised setting (middle) compared to its parametric counterpart (left).

Semantic segmentation is usually approached by extending image-wise classification [41, 38] to pixel-wise classification, deployed in a fully convolutional fashion  [47]. In contrast, we study the semantic segmentation task in terms of perceiving an image in groups of pixels and associating objects from a large set of images. Particularly, we take the perceptual organization view [66, 6]

that pixels group by visual similarity and objects form by visual familiarity; consequently a representation is developed to best relate pixels and segments to each other in the visual world. Our method, such motivated, not only achieves better supervised semantic segmentation but also presents the first attempt using deep learning for

unsupervised semantic segmentation.

We formulate this intuition as an end-to-end metric learning problem. Each pixel in an image is mapped via a CNN to a point in some visual embedding space, and nearby points in that space indicate pixels belonging to the same segments. From all the segments collected across images, clusters in the embedding space form semantic concepts. In other words, we sort segments

with respect to their visual and semantic attributes. The optimal visual representation delivers the right segmentation within individual images and associates segments with the same semantic classes across images, yielding a non-parametric model as its complexity scales with number of segments (exemplars).

We derive our method based on maximum likelihood estimation of a single equation, resulting in a two-stage Expectation-Maximization (EM) framework. The first stage performs a spherical (von Mises-Fisher) K-Means clustering 

[4] for image segmentation. The second stage adapts the E-step for a pixel-to-segment loss to optimize the metric learning CNN.

As a result, we present the SegSort (Segment Sorting) as a first attempt to apply deep learning for semantic segmentation from the unsupervised perspective. Specifically, we create pseudo segmentation masks aligned with visual cues using a contour detector [2, 32, 73] and train the pixel-wise embedding network to separate all the segments. The unsupervised SegSort achieves performance of its supervised counterpart. We further show that various visual groups are automatically discovered in our framework.

When supervision is available (, supervised semantic segmentation), we segment each image with the spherical K-Means clustering and train the network following the same optimization, but incorporated with Neighborhood Components Analysis criterion [22, 71] for semantic labels.

To summarize our major contributions:

  1. We present the first end-to-end trained non-parametric approach for supervised semantic segmentation, with performance exceeding its parametric counterparts that are trained with pixel-wise softmax loss.

  2. We propose the first unsupervised deep learning approach for semantic segmentation, which achieves performance of its supervised counterpart.

  3. Our segmentation results can be easily understood from retrieved nearest segments and readily interpretable.

  4. Our approach produces more precise boundaries and more consistent region segmentations compared with parametric pixel-wise prediction approaches.

  5. We demonstrate the effectiveness of our method on two challenging datasets, PASCAL VOC 2012 [16] and Cityscapes [14].

2 Related Works

Segmentation and Clustering. Segmentation involves extracting representations from local patches and clustering them based on different criteria, , fitting mixture models [74, 5], mode-finding [13, 4], or graph partitioning [18, 62, 49, 64, 78]. The mode-finding algorithms, , mean shift [13] or K-Means [26, 4]

, are mostly related. Traditionally, pixels are encoded in a joint spatial-range domain by a single vector with their spatial coordinates and visual features concatenated. Applying mean shift or K-Means filtering can thus converge for each pixel. Spectral graph theory

[12], and in particular the Normalized Cut [62] criterion provides a way to further integrate global image information for better segmentation. More recently, superpixel approaches [1] emerge to be a popular pre-processing step that helps reduce the computation, or can be used to refine the semantic segmentation predictions [20]. However, the challenge of perceptual organization is to process information from different levels together to form consensus segmentation. Hence, our proposed approach aims to integrate image segmentation and clustering into end-to-end embedding learning for semantic segmentation.

Semantic Segmentation. Current state-of-the-art semantic segmentation models are based on Fully Convolutional Networks [41, 61, 47]

, tackling the problem via pixel-wise classification. Given limited local context, it may be ambiguous to correctly classify a single pixel, and thus it is common to resort to multi-scale context information

[28, 63, 36, 39, 23, 76, 51, 34, 31]. Typical approaches include image pyramids [17, 55, 15, 43, 10, 8] and encoder-decoder structures [3, 56, 42, 19, 54, 77, 79, 11]. Notably, to better capture multi-scale context, PSPNet [80] performs spatial pyramid pooling [24, 40, 46] at several grid scales, while DeepLab [8, 9, 75] applies the ASPP module (Atrous Spatial Pyramid Pooling) consisting of several parallel atrous convolution [30, 21, 61, 53] with different rates. In this work, we experiment with applying our proposed training algorithm to PSPNet and DeepLabv3+, and show consistent improvements.

Before deep learning takes a leap, non-parametric methods for semantic segmentation are explored. In the unsupervised setting, [57] proposes data-driven boundary and image grouping, formulated with MRF to enhance semantic boundaries; [67] extracts superpixels before nearest neighbor search; [45] performs dense SIFT to find dense deformation fields between images to segment and recognize a query image. With supervision, [50] learns semantic object exemplars for detection and segmentation.

It is worth noting Kong and Fowlkes [37] also integrate vMF mean-shift clustering into the semantic segmentation pipeline. However, the clustering with contrastive loss is used for regularizing features and the whole system still relies on softmax loss to produce the final segmentation.

Our work also bears a similarity to the work Scene Collaging [33], which presents a nonparametric scene grammar for parsing the images into segments for which object labels are retrieved from a large dataset of example images.

Figure 2: The overall training diagram for our proposed framework, Segment Sorting (SegSort), with the vMF clustering [4]. Given a batch of images (leftmost), we compute pixel-wise embeddings (middle left) from a metric learning segmentation network. Then we segment each image with the vMF clustering (middle right), dubbed pixel sorting. We train the network via the maximum likelihood estimation derived from a mixture of vMF distributions, dubbed segment sorting. In between, we also illustrate how to process pixel-wise features on a hyper-sphere for pixel and segment sorting. A segment (rightmost) is color-framed with its corresponding vMF clustering color if in the displayed images. Unframed segments from different images are associated in the embedding space. The inference is done with the same procedure but using the k-nearest neighbor search to associate segments in the training set.

Metric Learning. Metric learning approaches [35, 22] have achieved remarkable performance on different vision tasks, such as image retrieval [70, 72, 71]

and face recognition

[65, 69, 60]. Such tasks usually involve open world recognition, since classes during testing might be disjoint from the ones in the training set. Metric learning minimizes intra-class variations and maximizes inter-class variations with pairwise losses , contrastive loss [7] and triplet loss [29]. Recently, Wu [72] propose a non-parametric softmax formulation for training feature embeddings to separate every image for unsupervised image recognition and retrieval. The non-parametric softmax is further incorporated with Neighborhood Components Analysis [22] to improve generalization for supervised image recognition [71]. An important technical point on metric learning is normalization [68, 60] so that features lie on a hypersphere, which is why the vMF distribution is of particular interest.

3 Method

Our end-to-end learning framework consists of three sequential components: 1) A CNN, , DeepLab [11], FCN [47], or PSPNet [80], that generates pixel-wise embeddings from an image. 2) A clustering method that partitions the pixel-wise embeddings into a fine segmentation, dubbed pixel sorting. 3) A metric learning formulation for separating and grouping the segments into semantic clusters, dubbed segment sorting.

We start with an assumption that the pixel-wise normalized embeddings from the CNN within a segment follow a von Mises-Fisher (vMF) distribution. We thus formulate the pixel sorting with spherical K-Means clustering and the segment sorting with corresponding maximum likelihood estimation. During inference, the segment sorting is replaced with k-nearest neighbor search. We then apply to each query segment the majority label of retrieved segments.

We now give a high level mathematical explanation of the entire optimization process. Let be the set of pixel embeddings where is produced by a CNN centered at pixel . Let be the image segmentation with segments, or indicates if a pixel belongs to a segment . Let be the set of parameters that capture the representative feature of a segment through a predefined distribution (mixture of vMF here). Our main optimization objective can be concluded as:

(1)

In pixel sorting, we use a standard EM framework to find the optimal and , with fixed. In segment sorting, we adapt the previous E step for loss calculation through a set of images to optimize , with and fixed. Performing pixel sorting and segment sorting can thus be viewed as a two-stage EM framework.

This section is organized as follows. We first describe the pixel sorting in Sec. 3.1, which includes a brief review of spherical K-Means clustering and creation of aligned segments. We then derive two forms of the segment sorting loss for segment sorting in Sec. 3.2. Finally, we describe the inference procedure in Sec. 3.3. The overall training diagram is illustrated in Fig. 2 and the summarized algorithm can be found in the supplementary.

3.1 Pixel Sorting

We briefly review the vMF distribution and its corresponding spherical K-Means clustering algorithm [4], which is used to segment an image as pixel sorting.

We assume the pixel-wise -dimensional embeddings

(CNN’s last layer features after normalization) within a segment follow a vMF distribution. vMF distributions are of particular interest as it is one of the simplest distributions with properties analogous to those of the multivariate Gaussian for directional data. Its probability density function is given by

(2)

where is the normalizing constant where represents the modified Bessel function of the first kind and order . is the mean direction and is the concentration parameter. Larger indicates stronger concentration about . In our particular case, we assume a constant for all vMF distributions to circumvent the expensive calculation of .

The embeddings of an image with segments can thus be considered as a mixture of vMF distributions with a uniform prior, or

(3)

where . Let be the hidden variable that indicates a pixel embedding belongs to a particular segment , or . Let be the set of pixel embeddings and be the set of corresponding hidden variables. The log-likelihood of the observed data is thus given by

(4)

Since is unknown, the EM framework is used to estimate this otherwise intractable maximum likelihood, resulting in the spherical K-Means algorithm with an assumption of . This assumption holds if all the embeddings within a region are the same (homogeneous), which will be our training objective described in Sec. 3.2.

The E-step that maximizes the likelihood of Eqn. 4 is to assign

with a posterior probability 

[52]:

(5)

In the setting of K-Means, we use hard assignments to update , or . We further denote the set of pixels within a segment as ; hence if or otherwise after hard assignments.

The M-step that maximizes the expectation of Eqn. 4 can be derived [4] as

(6)

which is the mean direction of pixel embeddings within segment . The spherical K-Means clustering is thus done through alternating updates of (E-step) and (M-step).

One problem of K-Means clustering is the dynamic number of EM steps, which would cause uncertain memory consumption during training. However, we find in practice a small fixed number of EM steps, , iterations, can already produce good segmentations.

If we only use embedding features for K-Means clustering, each resulted cluster is often disconnected and scattered. As our goal is to spatially segment an image, we concatenate pixel coordinates with the embeddings so that the K-Means clustering is guided by spatiality.

Creating Aligned Segments. Segments that are aligned with different visual cues are critical for producing coherent boundaries. However, segments produced by K-Means clustering do not always conform to the ground truth boundaries. If one segment contains different semantic labels, it clearly contradicts our assumption of homogeneous embeddings within a segment. Therefore, we partition a segment given the ground truth mask in the supervised setting so that each segment contains exactly a single semantic label as illustrated in Fig. 3.

It is easy to see that the segments after partition are always aligned with semantic boundaries. Furthermore, this partition creates small segments of false positives and false negatives which can naturally serve as hard negative examples during loss calculation.

3.2 Segment Sorting

Figure 3: During supervised training, we partition the proposed segments (left) given the ground truth mask (middle). The yielded segments (right) are thus aligned with ground truth mask. Each aligned segment is labeled ( or ) according to ground truth mask. Note that the purple and yellow segments become, respectively, false positive and false negative that help regularize predicted boundaries.

Following our assumption of homogeneous embeddings per segment, the training is therefore to enforce this criterion, which is done by optimizing the CNN parameters for better feature extraction.

We first define a prototype as the most representative embedding feature of a segment. Since the embeddings in a segment follow a vMF distribution, the mean direction vector in Eqn. 6 can naturally be used as the prototype.

In Sec. 3.1, we consider the posterior probability of a pixel embedding belonging to a segment with fixed CNN parameters in Eqn. 5. Now we revisit it with free CNN parameters

and a constant hyperparameter

:

(7)

As both the embedding and prototype are of unit length, the dot product

becomes the cosine similarity. The numerator indicates the exponential cosine similarity between a pixel embedding

and a particular segment prototype . The denominator includes the exponential cosine similarities w.r.t. all the segment prototypes. The value of indicates the ratio of pixel embedding close to segment compared to all the other segments.

The training objective is thus to maximize the posterior probability of a pixel embedding belonging to its corresponding segment obtained from the K-Means clustering. In other words, we want to minimize the following negative log-likelihood, or the vMF loss:

(8)

The total loss is the average over all pixels. As a result, minimizing has two effects: One is expressed by the numerator, where it encourages each pixel embedding to be close to its own segment prototype. The other is from the denominator, where it encourages each embedding feature to be far away from all other segment prototypes.

Note that this vMF loss does not require any ground truth semantic labels. We can therefore use this loss to train the CNN in an unsupervised setting. As the loss pushes every segment as far away as possible, visually similar segments are forced to stay closer on the hypersphere.

To make use of ground truth semantic information, we consider soft neighborhood assignments in the Neighborhood Components Analysis [22]. The idea of soft neighborhood assignments is to encourage the probability of one example selecting its neighbors (excluding itself) of the same category. In our case, we want to encourage the probability of a pixel embedding selecting any other segment in the same category, denoted as , as its neighbors. We can define such probability as follows, adapted from Eqn. 7:

(9)

We denote the set of segments w.r.t. pixel as .

Our final loss function is therefore the negative log total probability of pixel

selecting a neighbor prototype in the same category:

(10)

The total loss is the average over all pixels. Minimizing this loss is to maximize the expected number of pixels correctly classified by associating the right neighbor prototypes. The ground truth labels are thus used for finding the set of same-class segments w.r.t. pixel within a mini-batch (and memory banks). If there is no other segment in the same category, we fall back to the previous vMF loss. Since both vMF and vMF-N losses serve the same purpose for grouping and separating segments by optimizing the CNN feature extraction, we dub them segment sorting losses.

Understandably, an essential component of the segment sorting loss is the existence of semantic neighbor segments (in the numerator) and the abundance of alien segments (in the denominator). That is, the more examples presented at once, the better the optimization. We thus leverage two strategies: 1) We calculate the loss w.r.t. all the segments in the batch as opposed to traditionally image-wise loss function. 2) We use additional memory banks that cache the segment prototypes from previous batches. In our experiments, we cache up to batches. These two strategies help the fragmented segments (produced by segment partition in Fig. 3) connect to other similar segments between different images, or even between different batches.

3.3 Inference via K-Nearest Neighbor Retrieval

After training, we calculate and save all the segment prototypes in the training set. We calculate the prototypes using pixels with majority labels within the segments, ignoring other unresolved noisy pixels.

During inference, we again conduct the K-Means clustering and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set. The ablation study on inference runtime and memory can be found in the supplementary.

Our overall framework is non-parametric. We use vMF clustering to organize pixel embeddings into segment exemplars, whose number is proportional to number of images in the training set. The embeddings of exemplars are trained with a nearest neighbor criterion such that the inference can be done consistently, resulting in a non-parametric model.

 

Base / Backbone / Method mIoU f-measure
DeepLabv3+ / MNV2 / Softmax 72.51 50.90
DeepLabv3+ / MNV2 / SegSort 74.94 58.83
PSPNet / ResNet-101 / Softmax 80.12 59.64
PSPNet / ResNet-101 / ASM [31] 81.43 62.35
PSPNet / ResNet-101 / SegSort 81.77 63.71
DeepLabv3+ / MNV2 / Softmax 73.25 -
DeepLabv3+ / MNV2 / SegSort 74.88 -
PSPNet / ResNet-101 / Softmax 80.63 -
PSPNet / ResNet-101 / SegSort 82.41 -

 

Table 1: Quantitative results on Pascal VOC 2012. The first 4 rows with gray colored background are on validation set while the last 4 rows are on testing set. Networks trained with SegSort consistently outperform their parametric counterpart (Softmax) by 1.63 to 2.43% in mIoU and by 4.07 to 7.97% in boundary f-measure.

4 Experiments

In this section, we demonstrate the efficacy of our Segment Sorting (SegSort) through experiments and visual analyses. We first describe the experimental setup in Section 4.1. Then we summarize all the quantitative and qualitative results of fully supervised semantic segmentation in Section 4.2. Lastly, we present results of the proposed approach for unsupervised semantic segmentation in Section 4.3. Additional experiments including ablation studies, t-SNE embedding visualization, and qualitative results on Cityscapes can be found in the supplementary.

4.1 Experimental Setup

Datasets. We mainly use two datasets in the experiments, i.e., PASCAL VOC 2012 [16] and Cityscapes [14].

PASCAL VOC 2012 [16] segmentation dataset contains object categories and one background class. The original dataset contains (train) / (val) / (test) images. Following the procedure of [47, 8, 80], we augment the training data with the annotations of [25], resulting in (train_aug) images.

Cityscapes [14]

is a dataset for semantic urban street scene understanding.

high quality pixel-level finely annotated images are divided into training, validation, and testing sets with / / images, respectively. It defines categories containing flat, human, vehicle, construction, object, nature, etc.

Segmentation Architectures. We use DeepLabv3+ [11] and PSPNet [80] as the segmentation architectures, powered by MobileNetV2 [58] and ResNet101 [27]

, respectively, both of which are pre-trained on ImageNet

[38].

We follow closely the training procedures of the base architectures when training the baseline model with the standard pixel-wise softmax cross-entropy loss. The performance of the final model might be slightly worse from what is reported in the original papers mainly due to two reasons: 1) We do not pre-train on any other segmentation dataset, such as MS COCO [44] dataset. 2) We do not adopt any additional training tricks, such as balance sampling or fine-tuning specific categories.

Hyper-parameters of SegSort. For all the experiments, we use the following hyper-parameters for training SegSort: The dimension of embeddings is . The number of clustering in K-Means are set to and for VOC and Cityscapes, respectively. The EM steps in K-Means are set to and for VOC and Cityscapes, respectively. The concentration constant is set to . During inference, we use the same hyper-parameters for K-Means segmentation and nearest neighbors for predicting categories.

We use different learning rates and iterations for supervised training with SegSort. For VOC 2012, we train the network with initial learning rate for k iterations on train_aug set and with initial learning rate for k iterations on train set. For Cityscapes, we train the network with initial learning rate for the same 90k iterations as the softmax baseline.

Training on VOC 2012 requires more iterations than the baseline procedure mainly because most images only contain very few categories while the network can only compare segments in batches ( batches were cached). We find that enlarging the batch size or increasing memory banks might reduce the training iterations. As a comparison, images from Cityscapes contain ample categories, so the training iterations remain the same.

 

Method road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mIoU
Softmax 97.96 83.89 92.22 57.24 59.31 58.89 68.39 77.07 92.18 63.71 94.42 81.80 63.11 94.85 73.54 84.82 67.42 69.34 77.42 76.72
SegSort 98.18 84.86 92.75 55.63 61.57 63.72 71.66 80.01 92.62 64.64 94.65 82.32 62.75 95.08 77.27 87.07 78.89 63.63 77.51 78.15

 

Table 2: Per-class results on Cityscapes validation set. We conclude that network trained with SegSort outperforms Softmax consistently.
Figure 4: Visual comparison on PASCAL VOC 2012 validation set. We show the visual examples with DeepLabv3+ (upper 2 rows) and PSPNet (lower 2 rows). We observe prominent improvements on thin structures, such as human leg and chair legs. Also, more consistent region predictions can be observed when context is critical, such as wheels in motorcycles and big trunk of buses.
Figure 5: Two examples, correct and incorrect predictions, for segment retrieval for supervised semantic segmentation on VOC 2012 validation set. Query segments (leftmost) are framed by the same color in clustering. (Top) The query segments of rider, horse, and horse outlines can retrieve corresponding semantically relevant segments in the training set. (Bottom) For the failure case, it can be inferred from the retrieved segments that the number tag on the front of bikes is confused by the other number tags or front lights on motorbikes, resulting in false predictions.

4.2 Fully Supervised Semantic Segmentation

VOC 2012: We summarize the quantitative results of fully supervised semantic segmentation on Pascal VOC 2012 [16] in Table 1, evaluated using mIoU and boundary evaluation following [2, 34] on both validation and testing set.

We conclude that networks trained with SegSort consistently outperform their parametric counterpart (Softmax) by to in mIoU and by to in mean boundary f-measure. (Per-class results can be found in the supplementary.) We notice that SegSort with DeepLabv3+ / MNV2 captures better fine structures, such as in ‘bike’ and ‘mbike’ while with PSPNet / ResNet-101 enhances more towards detecting small objects, such as in ‘boat’ and ‘plant’.

We present the visual comparison in Fig. 4. We observe prominent improvements on thin structures, such as human legs and chair legs. Also, more consistent region predictions can be found when context is critical, such as wheels in motorcycles and big trunk of buses.

One of the most important features of SegSort is the self-explanatory predictions via nearest neighbor segment retrieval. We therefore demonstrate two examples, correct and incorrect predictions, in Fig. 5. As can be seen, the query segments (on the leftmost) of rider, horse, and horse outlines can retrieve corresponding semantically relevant segments in the training set. For the incorrect example, it can be inferred from the retrieved segments that the number tag on the front of bikes was confused by the other number tags on motorbikes, resulting in false predictions.

Cityscapes: We summarize the quantitative results of fully supervised semantic segmentation on Cityscapes [14] in Table 2, evaluated on the validation set. Due to limited space, visual results are included in the supplementary.

The network trained with SegSort outperforms Softmax consistently. Large objects, , ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, , ‘pole’ and ‘tlight’, are better captured.

Figure 6: Training data for unsupervised semantic segmentation. We produce fine segmentations (right), HED-owt-ucm, from the contours (middle) detected by HED [73], followed by the procedure in gPb-owt-ucm [49].

 

unsup. on train_aug sup. on train_aug sup. on train mIoU f-measure
49.50 40.86
55.86 44.78
71.86 55.70
72.47 56.52
73.35 55.69

 

Table 3: Quantitative results related to unsupervised semantic segmentation on Pascal VOC 2012 validation set. Our unsupervised trained network (2 row) outperforms the baseline (1 row) of directly clustering pretrained features using HED-owt-ucm [73] and achieves performance of its supervised counterpart (5 row). Also, the network fine-tuned from unsupervised pre-trained embeddings (4 row) outperforms the one without (3 row) in both mIoU and boundary f-measure.
Figure 7: Segment retrieval results for unsupervised semantic segmentation on VOC 2012 vadlidation set. Query segments (leftmost) are framed by the same color in clustering. As is observed, the embeddings learned by unsupervised SegSort attend to more visual than semantic similarities compared to the supervised setting. Hairs, faces, blue shirts, and wheels are retrieved successfully. The last query segment fails because the texture around knee is more similar to animal skins.

4.3 Unsupervised Semantic Segmentation

We train the model using our framework without any ground truth labels at any level, pixel-wise or image-wise.

To adapt our approach for unsupervised semantic segmentation, what we need is a good criterion for segmenting an image along visual boundaries, which serves as a pseudo ground truth mask. There is an array of methods that meet the requirement, , SLIC [1] for super-pixels or gPb-owt-ucm [2] for hierarchical segmentation. We choose the HED contour detector [73] pretrained on BSDS500 dataset [2], and follow the procedure in gPb-owt-ucm [2] to produce the hierarchical segmentation, or HED-owt-ucm (Fig. 6).

We train the PSPNet / ResNet-101 network on the same augmented training set on VOC 2012 as in the supervised setting with the same initial learning rate yet for only k iterations. The hyper-parameters remain unchanged.

Note that the contour detector only provides visual boundaries without any concept of semantic segments, yet through our feature learning with segment sorting, our method discovers segments of common features – semantic segmentation without names.

For the sake of performance evaluation, we assume there is a separate annotated image set available during inference. For each segment under query, we assign a label by the majority vote of its nearest neighbors from that annotated set.

Table 3 shows that our unsupervised trained network outperforms the baseline of directly clustering pretrained features using HED-owt-ucm [73] segmentation and further achieves

performance of its supervised counterpart. Together, We also showcase one possible way to make use of the unsupervised learned embedding. The network fine-tuned from unsupervised pre-trained embeddings outperforms the one without. Fig.

7 shows the embeddings learned by unsupervised SegSort attend to more visual than semantic similarities compared to the supervised setting because the fine segmentation formed by contour detectors partitions the image into visually consistent segments. Hairs, faces, blue shirts, and wheels are retrieved successfully. The last query segment fails because the texture around the knee is more similar to animal skins.

Figure 8: We perform a nearest neighbor based hierarchical agglomerative clustering, FINCH [59] on foreground segment prototypes to discover visual groups. Top two rows show random samples from two clusters at the finest level. Bottom table displays clusters at a coarser level of clusters. We show four representative segments per cluster.

Automatic Discovery of Visual Groups. We noticed in the retrieval results that CNNs trained this way can discover visual groups. We wonder if such visual structures actually form different clusters (or fine categories).

We extract all foreground segments in the training set and perform a nearest neighbor based hierarchical agglomerative clustering algorithm FINCH [59]. FINCH merges two points if one is the nearest neighbor of the other (with undirectional link). This procedure can be performed recursively. We start with segment prototypes and performs FINCH to produce , , , , and clusters after each iteration. We visualize some segment groups at the finest level and a coarser level of clusters in Fig. 8.

A bigger picture of how the segments relate to each other from t-SNE [48] can be found in the supplementary.

5 Conclusion

We proposed an end-to-end pixel-wise metric learning approach that is motivated by perceptual organization. We integrated the two essential components, pixel-level and segment-level sorting, in a unified framework, derived from von Mises-Fisher clustering. We demonstrated the proposed approach consistently improves over the conventional pixel-wise prediction approaches for supervised semantic segmentation. We also presented the first attempt for unsupervised semantic segmentation. Intriguingly, the predictions produced by our approach, correct or not, can be inherently explained by the retrieved nearest segments.

Acknowledgements. This research was supported, in part, by Berkeley Deep Drive, NSF (IIS-1651389), DARPA.

References

  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, et al. (2012) SLIC superpixels compared to state-of-the-art superpixel methods. PAMI. Cited by: §2, §4.3.
  • [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. PAMI. Cited by: §1, §4.2, §4.3, §6.4.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2015) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561. Cited by: §2.
  • [4] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra (2005) Clustering on the unit hypersphere using von mises-fisher distributions.

    Journal of Machine Learning Research

    .
    Cited by: §1, Figure 2, §2, §3.1, §3.1.
  • [5] S. Belongie, C. Carson, H. Greenspan, and J. Malik (1998) Color-and texture-based image segmentation using em and its application to content-based image retrieval. In ICCV, Cited by: §2.
  • [6] I. Biederman (1987) Recognition-by-components: a theory of human image understanding.. Psychological review 94 (2), pp. 115. Cited by: §1.
  • [7] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994)

    Signature verification using a” siamese” time delay neural network

    .
    In NIPS, Cited by: §2.
  • [8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915. Cited by: §2, §4.1.
  • [9] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587. Cited by: §2.
  • [10] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille (2016) Attention to scale: scale-aware semantic image segmentation. In CVPR, Cited by: §2.
  • [11] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §2, §3, §4.1.
  • [12] F. R. Chung and F. C. Graham (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §2.
  • [13] D. Comaniciu and P. Meer (2002) Mean shift: a robust approach toward feature space analysis. PAMI. Cited by: §2.
  • [14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: item 5, §4.1, §4.1, §4.2.
  • [15] D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, Cited by: §2.
  • [16] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: item 5, §4.1, §4.1, §4.2.
  • [17] C. Farabet, C. Couprie, L. Najman, and Y. LeCun (2013) Learning hierarchical features for scene labeling. PAMI. Cited by: §2.
  • [18] P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation. IJCV. Cited by: §2.
  • [19] J. Fu, J. Liu, Y. Wang, and H. Lu (2017) Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943. Cited by: §2.
  • [20] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. V. Gehler (2016) Superpixel convolutional networks using bilateral inceptions. In ECCV, Cited by: §2.
  • [21] A. Giusti, D. Ciresan, J. Masci, L.M. Gambardella, and J. Schmidhuber (2013)

    Fast image scanning with deep max-pooling convolutional neural networks

    .
    In ICIP, Cited by: §2.
  • [22] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In NIPS, Cited by: §1, §2, §3.2.
  • [23] S. Gould, R. Fulton, and D. Koller (2009) Decomposing a scene into geometric and semantically consistent regions. In ICCV, Cited by: §2.
  • [24] K. Grauman and T. Darrell (2005) The pyramid match kernel: discriminative classification with sets of image features. In ICCV, Cited by: §2.
  • [25] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In ICCV, Cited by: §4.1.
  • [26] J. A. Hartigan and M. A. Wong (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics). Cited by: §2.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  • [28] X. He, R. S. Zemel, and M. Carreira-Perpindn (2004) Multiscale conditional random fields for image labeling. In CVPR, Cited by: §2.
  • [29] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network. In

    International Workshop on Similarity-Based Pattern Recognition

    ,
    Cited by: §2.
  • [30] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian (1989) A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space, Cited by: §2.
  • [31] J. Hwang, T. Ke, J. Shi, and S. X. Yu (2019) Adversarial structure matching for structured prediction tasks. Cited by: §2, Table 1.
  • [32] J. Hwang and T. Liu (2015) Pixel-wise deep learning for contour detection. arXiv preprint arXiv:1504.01989. Cited by: §1.
  • [33] P. Isola and C. Liu (2013) Scene collaging: analysis and synthesis of natural images with semantic layers. In ICCV, Cited by: §2.
  • [34] T. Ke, J. Hwang, Z. Liu, and S. X. Yu (2018) Adaptive affinity fields for semantic segmentation. In ECCV, Cited by: §2, §4.2, §6.4.
  • [35] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints. In CVPR, Cited by: §2.
  • [36] P. Kohli, P. H. Torr, et al. (2009) Robust higher order potentials for enforcing label consistency. IJCV 82 (3), pp. 302–324. Cited by: §2.
  • [37] S. Kong and C. Fowlkes (2018) Recurrent pixel embedding for instance grouping. In CVPR, Cited by: §2.
  • [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §4.1.
  • [39] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr (2009) Associative hierarchical crfs for object class image segmentation. In ICCV, Cited by: §2.
  • [40] S. Lazebnik, C. Schmid, and J. Ponce (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In CVPR, Cited by: §2.
  • [41] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation. Cited by: §1, §2.
  • [42] G. Lin, A. Milan, C. Shen, and I. Reid (2017) RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In CVPR, Cited by: §2.
  • [43] G. Lin, C. Shen, A. van den Hengel, and I. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, Cited by: §2.
  • [44] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
  • [45] C. Liu, J. Yuen, and A. Torralba (2011) Nonparametric scene parsing via label transfer. PAMI. Cited by: §2.
  • [46] W. Liu, A. Rabinovich, and A. C. Berg (2015) ParseNet: looking wider to see better. arXiv:1506.04579. Cited by: §2.
  • [47] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, §2, §3, §4.1.
  • [48] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research. Cited by: §4.3, §6.2.
  • [49] J. Malik, S. Belongie, T. Leung, and J. Shi (2001) Contour and texture analysis for image segmentation. IJCV. Cited by: §2, Figure 6.
  • [50] T. Malisiewicz and A. A. Efros (2008) Recognition by association via learning per-exemplar distances. In

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1–8. Cited by: §2.
  • [51] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich (2015) Feedforward semantic segmentation with zoom-out features. In CVPR, Cited by: §2.
  • [52] R. M. Neal and G. E. Hinton (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, Cited by: §3.1.
  • [53] G. Papandreou, I. Kokkinos, and P. Savalle (2015) Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, Cited by: §2.
  • [54] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, Cited by: §2.
  • [55] P. Pinheiro and R. Collobert (2014) Recurrent convolutional neural networks for scene labeling. In ICML, Cited by: §2.
  • [56] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §2.
  • [57] B. Russell, A. Efros, J. Sivic, B. Freeman, and A. Zisserman (2009) Segmenting scenes by matching image composites. In NIPS, Cited by: §2.
  • [58] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.1.
  • [59] M. S. Sarfraz, V. Sharma, and R. Stiefelhagen (2019) Efficient parameter-free clustering using first neighbor relations. In CVPR, Cited by: Figure 8, §4.3.
  • [60] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, Cited by: §2.
  • [61] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In ICLR, Cited by: §2.
  • [62] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. PAMI. Cited by: §2.
  • [63] J. Shotton, J. Winn, C. Rother, and A. Criminisi (2009) Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV. Cited by: §2.
  • [64] X. Y. Stella and J. Shi (2003)

    Multiclass spectral clustering

    .
    In ICCV, Cited by: §2.
  • [65] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In CVPR, Cited by: §2.
  • [66] J. M. Tenenbaum and A. Witkin (1983) On the role of structure in vision. Human and machine vision, pp. 481–543. Cited by: §1.
  • [67] J. Tighe and S. Lazebnik (2010) Superparsing: scalable nonparametric image parsing with superpixels. In ECCV, Cited by: §2.
  • [68] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille (2017) Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 2017 ACM on Multimedia Conference, Cited by: §2.
  • [69] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, Cited by: §2.
  • [70] C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. In ICCV, Cited by: §2.
  • [71] Z. Wu, A. A. Efros, and S. X. Yu (2018) Improving generalization via scalable neighborhood component analysis. In ECCV, Cited by: §1, §2.
  • [72] Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance-level discrimination. In CVPR, Cited by: §2.
  • [73] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In ICCV, Cited by: §1, Figure 6, §4.3, §4.3, Table 3.
  • [74] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry (2008) Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding. Cited by: §2.
  • [75] T. Yang, M. D. Collins, Y. Zhu, J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L. Chen (2019) Deeperlab: single-shot image parser. arXiv preprint arXiv:1902.05093. Cited by: §2.
  • [76] J. Yao, S. Fidler, and R. Urtasun (2012) Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In CVPR, Cited by: §2.
  • [77] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. In CVPR, Cited by: §2.
  • [78] S. X. Yu and J. Shi (2004) Segmentation given partial grouping constraints. PAMI. Cited by: §2.
  • [79] Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun (2018) ExFuse: enhancing feature fusion for semantic segmentation. In ECCV, Cited by: §2.
  • [80] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §2, §3, §4.1, §4.1.

6 Supplementary

6.1 Algorithm

We summarize the overall algorithm, trained with supervision, in Algorithm 1. The unsupervised algorithm uses the oversegmentation from object detector as pixel sorting and Equation 8 for loss calculation, which encourages pixel embeddings in each segment to form an isolated cluster.

for number of training iterations do
       Sample a minibatch with images and segmentation masks .
   

    Extract the deep feature embedding

for each image. Compute by concatenating with coordinate features.
       Initialize (segment IDs) by uniformly partitioning each image for regions. /* K-Means clustering. */
       for number of K-Means iterations do
             /* The M step. */
             for each region  do
                  
             end for
            /* The E step. */
            
       end for
      Partition a segment if it contains multiple labels. /* Compute prototypes. */
       for each region  do
            
       end for
      /* Calculate vMF-N loss. */
       Calculate the loss using Equation 10 and back-propagate the errors.
end for
Algorithm 1 Supervised SegSort algorithm.

6.2 t-SNE Embedding Visualization

We visualize the prototype embeddings in the training set using t-SNE [48]. We display the results from supervised and unsupervised SegSort in Figure 9 and 10, respectively. This is done by random sampling prototypes in the training set, reducing the dimension from to using t-SNE [48], and placing the corresponding patches on the 2D canvas wherever possible.

For visualization of supervised SegSort, we observe that most background patches form a large cluster in the center with some small visual clusters. Each stretching arm represents one foreground class, with gradual appearance changes from boundaries to object centers. For examples, cars and trucks are on the rightmost islands while horses, cows, and sheeps on the leftmost.

For visualization of unsupervised SegSort, we observe that clusters are formed more by visual similarities. The cues for clustering are usually color and texture. For examples, wheels are clustered on the rightmost island while animals on the top. Grass and sky are placed on the bottom.

Figure 9: t-SNE visualization of prototype embeddings from supervised SegSort, framed with category color. Best viewed with zoom-in.
Figure 10: t-SNE visualization of prototype embeddings from unsupervised SegSort, framed with category color. Best viewed with zoom-in.

6.3 Study on Inference Latency and Memory

We analyze, compared to Softmax baseline, SegSort’s inference latency and memory as they are of practical concern. We conclude the runtime overhead (-) is manageable and memory overhead () is negligible.

We conduct experiments with various k-means iterations and numbers of nearest neighbors to learn how they influence the inference performance, summarized in Table 4. All experiments (PSPNet inference at single scale) are done using the same GTX 1080 Ti GPU. The overall GPU memory usage overhead is only 1.5% as no extra parameters are introduced. The runtime overhead is -

. We also notice the most runtime overhead is due to k-means instead of kNN (with 36K prototypes), both of which are computed in GPU. With 4 k-means iterations and 11-NN, our method (with

runtime overhead) already improves more than mIoU. We believe this latency/accuracy trade-off is reasonable, particularly with the benefits such as interpretability.

 

Method k-means % time kNN % time Overall time memory mIoU
Softmax - - 170.34 4504 76.96
2 iter 1-NN 11.43 4.78 182.19 4569 77.18
4 iter 1-NN 13.97 4.60 188.61 4573 78.05
4 iter 11-NN 13.72 4.67 189.09 4574 78.50
10 iter 11-NN 20.04 4.38 207.51 4577 78.69

 

Table 4: Ablation study on runtime (ms) and GPU memory (MiB).

6.4 Boundary Evaluation

We explain how we conduct boundary evaluation on semantic segmentation following [2, 34]. We first compute semantic boundaries per category for the semantic predictions and ground truth. We then match boundary pixels between predictions and ground truth with maximum distance of of image diagonal length. The per-category results are summarized by precision, recall, and f-measure in Table 5 and Table 6 on VOC and Cityscapes datasets.

 

Base / Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mean
Deeplabv3 / Softmax 76.88 69.30 70.41 52.87 44.73 61.37 62.44 65.08 38.14 68.08 20.58 54.64 63.41 60.84 62.48 49.75 68.12 30.40 51.05 44.53 56.02
Deeplabv3 / SegSort 81.70 72.53 75.13 62.07 69.34 72.42 66.67 76.10 45.23 70.91 40.72 69.55 65.09 72.75 74.24 53.70 80.07 43.70 70.86 61.98 66.61
PSPNet / Softmax 91.10 75.72 90.72 68.00 68.19 82.00 73.49 81.58 54.96 85.38 40.57 76.17 84.00 77.07 76.55 65.17 87.65 44.63 73.53 61.95 72.97
PSPNet / SegSort 86.46 73.81 84.86 68.65 74.18 81.51 74.48 81.43 57.92 83.24 52.58 75.71 81.29 74.64 79.17 63.95 86.80 43.57 68.01 63.45 73.05
Deeplabv3 / Softmax 64.03 41.80 60.19 36.93 49.53 57.85 52.31 60.25 26.10 54.54 21.65 59.90 55.82 47.89 49.37 21.90 46.70 38.14 53.02 41.34 47.52
Deeplabv3 / SegSort 70.25 47.19 61.98 42.03 58.34 61.81 54.95 64.93 33.87 54.85 30.43 63.99 59.40 54.41 56.76 26.61 48.35 46.85 54.70 57.59 53.20
PSPNet / Softmax 68.07 37.66 63.66 38.90 59.55 62.05 55.28 64.92 29.48 55.87 28.16 65.18 58.09 49.35 53.69 22.75 52.09 41.14 55.74 51.78 51.19
PSPNet / SegSort 70.97 38.66 70.07 46.51 67.99 65.11 61.48 68.46 40.73 61.71 39.15 69.96 63.22 54.04 59.59 30.64 53.69 52.46 58.04 61.14 57.29
Deeplabv3 / Softmax 69.87 52.14 64.90 43.48 47.01 59.56 56.93 62.57 30.99 60.56 21.10 57.15 59.37 53.60 55.16 30.41 55.41 33.83 52.01 42.88 50.90
Deeplabv3 / SegSort 75.55 57.18 67.92 50.12 63.37 66.70 60.25 70.07 38.74 61.86 34.83 66.65 62.11 62.26 64.33 35.59 60.29 45.22 61.74 59.71 58.83
PSPNet / Softmax 77.92 50.30 74.82 49.49 63.58 70.64 63.10 72.30 38.38 67.55 33.24 70.25 68.68 60.17 63.11 33.73 65.34 42.82 63.41 56.41 59.64
PSPNet / SegSort 77.95 50.75 76.76 55.45 70.95 72.39 67.36 74.38 47.83 70.88 44.88 72.73 71.13 62.69 68.00 41.43 66.34 47.61 62.63 62.27 63.71

 

Table 5: Per-class boundary evaluation on Pascal VOC 2012 validation set. From top to bottom: precision, recall, and f-measure, separated by double lines.

 

Method road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mean
Softmax 85.68 77.34 85.60 55.33 49.44 90.34 82.71 77.77 92.95 66.23 95.59 87.50 89.12 92.73 54.55 76.99 72.29 67.87 85.35 78.18
SegSort 89.14 78.81 88.79 55.18 55.45 92.53 87.70 82.32 94.25 70.51 96.07 90.73 87.89 95.61 55.19 77.90 64.79 62.47 86.00 79.54
Softmax 45.81 77.13 58.38 47.94 53.65 58.65 64.86 68.67 65.04 58.90 58.66 73.31 63.08 81.21 57.69 73.35 56.95 54.96 69.71 62.52
SegSort 46.08 79.05 60.75 49.84 57.68 63.37 74.57 75.18 66.28 61.17 60.36 76.53 71.78 83.21 64.58 78.67 65.28 61.61 73.48 66.82
Softmax 59.70 77.24 69.41 51.37 51.46 71.13 72.71 72.94 76.53 62.35 72.71 79.78 73.87 86.59 56.08 75.13 63.71 60.73 76.74 68.96
SegSort 60.76 78.93 72.14 52.37 56.54 75.22 80.60 78.59 77.83 65.51 74.14 83.03 79.02 88.98 59.52 78.29 65.04 62.04 79.25 71.99

 

Table 6: Per-class boundary evaluation on Cityscapes validation set with PSPNet architecture. From top to bottom: precision, recall, and f-measure, separated by double lines.

6.5 Ablation Study

We conduct experiments for the ablation study to understand how different components affect the performance of supervised SegSort. We decide the hyper-parameters of our main experiments using the experiences learned from the ablation studies.

Number of clusters: We study how the number of clusters affects the semantic segmentation performance (Figure 11). We train and test the DeepLabv3+ / MNV2 network with , , , , , , , and clusters in the vMF clustering. The highest performance is at clusters, which are slightly more than the number of categories in the dataset.

Figure 11: We show how the number of clusters affects the segmentation performance. The highest performance is at clusters, which are slightly more than the number of categories in the dataset.

Dimension of embeddings: We study how the dimension of embeddings affects the semantic segmentation performance (Figure 12). We train and test the DeepLabv3+ / MNV2 network with , , , , and embedding dimension. We conclude that as long as the embedding dimension is sufficient, , larger than 8, the performance does not change drastically.

Figure 12: We study how the dimension of embeddings affects the segmentation performance. We conclude that as long as the embedding dimension is sufficient, , larger than 8, the performance does not change drastically.

Number of nearest neighbors: We study how the number of nearest neighbors during inference affects the segmentation performance (Figure 13). We train the PSPNet / ResNet-101 network as described in the main paper and test it using to

(odd numbered) nearest neighbors. We conclude that the segmentation performance is robust to the number of nearest neighbors as the mIoU spans only

.

Figure 13: We study how the number of nearest neighbors during inference affects the segmentation performance. We conclude that the segmentation performance is robust to the number of nearest neighbors as the mIoU spans only .

6.6 Visualization and Test Results on Cityscapes

We present the visual comparison in Figure 14. We observe large objects, such as ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, such as ‘pole’ and ‘tlight’, are also better captured.

We also include the per-category segmentation performance on Cityscapes test set in Table 7. We observe similar performance trends as on the validation set. We conclude that network trained with SegSort outperforms Softmax consistently.

 

Method road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mIoU
Softmax 98.33 84.21 92.14 49.67 55.81 57.62 69.01 74.17 92.70 70.86 95.08 84.21 66.58 95.28 73.52 80.59 70.54 65.54 73.73 76.30
SegSort 98.56 85.84 92.85 52.18 58.44 63.62 73.29 77.93 93.41 72.37 95.10 84.97 67.98 95.41 67.43 80.80 60.69 68.68 74.34 77.05

 

Table 7: Per-class results on Cityscapes test set. We conclude that network trained with SegSort outperforms Softmax consistently.

7 Per-category results on VOC

We present the per-category results in Table 8 for interested readers. We notice that SegSort with DeepLabv3+ / MNV2 captures better fine structures, such as in ‘bike’ and ‘mbike’ while with PSPNet / ResNet-101 enhances more towards detecting small objects, such as in ‘boat’ and ‘plant’.

 

Base / Backbone / Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU
DeepLabv3+ / MNV2 / Softmax 85.02 55.18 80.92 65.87 70.60 89.55 83.39 88.27 35.04 80.30 48.24 79.20 82.13 81.16 81.21 52.59 75.24 47.20 80.20 67.92 72.51
DeepLabv3+ / MNV2 / SegSort 84.80 58.54 81.08 68.92 79.15 89.75 85.24 89.64 34.88 74.60 58.62 84.34 79.07 84.94 85.92 54.65 76.76 50.74 82.95 74.57 74.94
PSPNet / ResNet-101 / Softmax 92.56 66.70 91.10 76.52 80.88 94.43 88.49 93.14 38.87 89.33 62.77 86.44 89.72 88.36 87.48 56.95 91.77 46.23 88.59 77.14 80.12
PSPNet / ResNet-101 / SegSort 92.23 52.68 91.29 80.33 83.92 95.13 90.33 95.44 44.68 90.84 67.37 91.29 91.09 89.66 88.98 67.54 88.06 53.04 87.79 79.97 81.77
DeepLabv3+ / MNV2 / Softmax 85.89 59.20 79.09 61.24 66.47 87.87 85.17 88.80 28.27 78.98 60.67 80.35 83.72 83.90 83.52 59.87 83.43 50.22 74.07 63.91 73.25
DeepLabv3+ / MNV2 / SegSort 79.49 66.32 75.38 66.17 70.71 91.51 84.82 85.54 38.69 74.91 68.99 78.17 80.49 85.08 85.63 60.92 86.47 57.96 73.26 67.39 74.88
PSPNet / ResNet-101 / Softmax 94.01 68.08 88.80 64.87 75.87 95.60 89.59 93.15 37.96 88.20 72.58 89.96 93.30 87.52 86.65 61.90 87.05 60.81 87.13 74.65 80.63
PSPNet / ResNet-101 / SegSort 96.00 67.17 93.37 74.52 77.77 95.07 89.39 93.91 41.31 87.85 73.66 90.15 91.06 85.63 87.86 71.81 90.28 65.99 86.75 75.53 82.41

 

Table 8: Per-class results on Pascal VOC 2012. The first 4 rows with gray colored background are on validation set while the last 4 rows are on testing set. Networks trained with SegSort consistently outperform their parametric counterpart (Softmax) by 1.63 to 2.43%.

7.1 DeepLabv3+ / ResNet-101 Results on VOC

We also train our supervised SegSort using DeepLabv3+ with ResNet-101 backbone with exactly same hyper-parameters as MNV2 backbone. We include the per-category segmentation performance in Table 9. Even though the hyper-parameters might not be optimal, we observe consistent performance improvements over the baseline Softmax method.

 

Base / Backbone / Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU
Deeplabv3 / ResNet-101 / Softmax 90.93 56.70 89.46 73.35 82.13 95.03 87.30 91.88 37.79 83.56 56.32 88.31 83.32 86.11 86.61 58.17 87.65 52.87 88.43 74.19 78.24
Deeplabv3 / ResNet-101 / SegSort 88.78 51.17 88.12 70.45 83.89 95.12 88.74 94.34 43.12 86.24 59.07 88.86 88.11 86.92 87.58 56.91 85.46 55.32 89.01 73.77 78.85

 

Table 9: Per-class results on Pascal VOC 2012 validation set, using Deeplabv3 with Resnet-101 backbone.
Figure 14: Visual comparison on Cityscapes validation set. We observe large objects, such as ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, such as ‘pole’ and ‘tlight’, are also better captured.