SegSort: Segmentation by Discriminative Sorting of Segments
Almost all existing deep learning approaches for semantic segmentation tackle this task as a pixel-wise classification problem. Yet humans understand a scene not in terms of pixels, but by decomposing it into perceptual groups and structures that are the basic building blocks of recognition. This motivates us to propose an end-to-end pixel-wise metric learning approach that mimics this process. In our approach, the optimal visual representation determines the right segmentation within individual images and associates segments with the same semantic classes across images. The core visual learning problem is therefore to maximize the similarity within segments and minimize the similarity between segments. Given a model trained this way, inference is performed consistently by extracting pixel-wise embeddings and clustering, with the semantic label determined by the majority vote of its nearest neighbors from an annotated set. As a result, we present the SegSort, as a first attempt using deep learning for unsupervised semantic segmentation, achieving 76% performance of its supervised counterpart. When supervision is available, SegSort shows consistent improvements over conventional approaches based on pixel-wise softmax training. Additionally, our approach produces more precise boundaries and consistent region predictions. The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.READ FULL TEXT VIEW PDF
Deep learning approaches to generic (non-semantic) segmentation have so ...
Recent deep learning based approaches have shown remarkable success on o...
Semantic segmentation is a fundamental problem in computer vision. It is...
Most semantic segmentation models treat semantic segmentation as a pixel...
We present our approach for robotic perception in cluttered scenes that ...
We present a method that "meta" classifies whether segments (objects)
Detection of curvilinear structures in images has long been of interest....
SegSort: Segmentation by Discriminative Sorting of Segments
Semantic segmentation is usually approached by extending image-wise classification [41, 38] to pixel-wise classification, deployed in a fully convolutional fashion . In contrast, we study the semantic segmentation task in terms of perceiving an image in groups of pixels and associating objects from a large set of images. Particularly, we take the perceptual organization view [66, 6]
that pixels group by visual similarity and objects form by visual familiarity; consequently a representation is developed to best relate pixels and segments to each other in the visual world. Our method, such motivated, not only achieves better supervised semantic segmentation but also presents the first attempt using deep learning forunsupervised semantic segmentation.
We formulate this intuition as an end-to-end metric learning problem. Each pixel in an image is mapped via a CNN to a point in some visual embedding space, and nearby points in that space indicate pixels belonging to the same segments. From all the segments collected across images, clusters in the embedding space form semantic concepts. In other words, we sort segments
with respect to their visual and semantic attributes. The optimal visual representation delivers the right segmentation within individual images and associates segments with the same semantic classes across images, yielding a non-parametric model as its complexity scales with number of segments (exemplars).
We derive our method based on maximum likelihood estimation of a single equation, resulting in a two-stage Expectation-Maximization (EM) framework. The first stage performs a spherical (von Mises-Fisher) K-Means clustering for image segmentation. The second stage adapts the E-step for a pixel-to-segment loss to optimize the metric learning CNN.
As a result, we present the SegSort (Segment Sorting) as a first attempt to apply deep learning for semantic segmentation from the unsupervised perspective. Specifically, we create pseudo segmentation masks aligned with visual cues using a contour detector [2, 32, 73] and train the pixel-wise embedding network to separate all the segments. The unsupervised SegSort achieves performance of its supervised counterpart. We further show that various visual groups are automatically discovered in our framework.
When supervision is available (, supervised semantic segmentation), we segment each image with the spherical K-Means clustering and train the network following the same optimization, but incorporated with Neighborhood Components Analysis criterion [22, 71] for semantic labels.
To summarize our major contributions:
We present the first end-to-end trained non-parametric approach for supervised semantic segmentation, with performance exceeding its parametric counterparts that are trained with pixel-wise softmax loss.
We propose the first unsupervised deep learning approach for semantic segmentation, which achieves performance of its supervised counterpart.
Our segmentation results can be easily understood from retrieved nearest segments and readily interpretable.
Our approach produces more precise boundaries and more consistent region segmentations compared with parametric pixel-wise prediction approaches.
Segmentation and Clustering. Segmentation involves extracting representations from local patches and clustering them based on different criteria, , fitting mixture models [74, 5], mode-finding [13, 4], or graph partitioning [18, 62, 49, 64, 78]. The mode-finding algorithms, , mean shift  or K-Means [26, 4]
, are mostly related. Traditionally, pixels are encoded in a joint spatial-range domain by a single vector with their spatial coordinates and visual features concatenated. Applying mean shift or K-Means filtering can thus converge for each pixel. Spectral graph theory, and in particular the Normalized Cut  criterion provides a way to further integrate global image information for better segmentation. More recently, superpixel approaches  emerge to be a popular pre-processing step that helps reduce the computation, or can be used to refine the semantic segmentation predictions . However, the challenge of perceptual organization is to process information from different levels together to form consensus segmentation. Hence, our proposed approach aims to integrate image segmentation and clustering into end-to-end embedding learning for semantic segmentation.
, tackling the problem via pixel-wise classification. Given limited local context, it may be ambiguous to correctly classify a single pixel, and thus it is common to resort to multi-scale context information[28, 63, 36, 39, 23, 76, 51, 34, 31]. Typical approaches include image pyramids [17, 55, 15, 43, 10, 8] and encoder-decoder structures [3, 56, 42, 19, 54, 77, 79, 11]. Notably, to better capture multi-scale context, PSPNet  performs spatial pyramid pooling [24, 40, 46] at several grid scales, while DeepLab [8, 9, 75] applies the ASPP module (Atrous Spatial Pyramid Pooling) consisting of several parallel atrous convolution [30, 21, 61, 53] with different rates. In this work, we experiment with applying our proposed training algorithm to PSPNet and DeepLabv3+, and show consistent improvements.
Before deep learning takes a leap, non-parametric methods for semantic segmentation are explored. In the unsupervised setting,  proposes data-driven boundary and image grouping, formulated with MRF to enhance semantic boundaries;  extracts superpixels before nearest neighbor search;  performs dense SIFT to find dense deformation fields between images to segment and recognize a query image. With supervision,  learns semantic object exemplars for detection and segmentation.
It is worth noting Kong and Fowlkes  also integrate vMF mean-shift clustering into the semantic segmentation pipeline. However, the clustering with contrastive loss is used for regularizing features and the whole system still relies on softmax loss to produce the final segmentation.
Our work also bears a similarity to the work Scene Collaging , which presents a nonparametric scene grammar for parsing the images into segments for which object labels are retrieved from a large dataset of example images.
and face recognition[65, 69, 60]. Such tasks usually involve open world recognition, since classes during testing might be disjoint from the ones in the training set. Metric learning minimizes intra-class variations and maximizes inter-class variations with pairwise losses , contrastive loss  and triplet loss . Recently, Wu  propose a non-parametric softmax formulation for training feature embeddings to separate every image for unsupervised image recognition and retrieval. The non-parametric softmax is further incorporated with Neighborhood Components Analysis  to improve generalization for supervised image recognition . An important technical point on metric learning is normalization [68, 60] so that features lie on a hypersphere, which is why the vMF distribution is of particular interest.
Our end-to-end learning framework consists of three sequential components: 1) A CNN, , DeepLab , FCN , or PSPNet , that generates pixel-wise embeddings from an image. 2) A clustering method that partitions the pixel-wise embeddings into a fine segmentation, dubbed pixel sorting. 3) A metric learning formulation for separating and grouping the segments into semantic clusters, dubbed segment sorting.
We start with an assumption that the pixel-wise normalized embeddings from the CNN within a segment follow a von Mises-Fisher (vMF) distribution. We thus formulate the pixel sorting with spherical K-Means clustering and the segment sorting with corresponding maximum likelihood estimation. During inference, the segment sorting is replaced with k-nearest neighbor search. We then apply to each query segment the majority label of retrieved segments.
We now give a high level mathematical explanation of the entire optimization process. Let be the set of pixel embeddings where is produced by a CNN centered at pixel . Let be the image segmentation with segments, or indicates if a pixel belongs to a segment . Let be the set of parameters that capture the representative feature of a segment through a predefined distribution (mixture of vMF here). Our main optimization objective can be concluded as:
In pixel sorting, we use a standard EM framework to find the optimal and , with fixed. In segment sorting, we adapt the previous E step for loss calculation through a set of images to optimize , with and fixed. Performing pixel sorting and segment sorting can thus be viewed as a two-stage EM framework.
This section is organized as follows. We first describe the pixel sorting in Sec. 3.1, which includes a brief review of spherical K-Means clustering and creation of aligned segments. We then derive two forms of the segment sorting loss for segment sorting in Sec. 3.2. Finally, we describe the inference procedure in Sec. 3.3. The overall training diagram is illustrated in Fig. 2 and the summarized algorithm can be found in the supplementary.
We briefly review the vMF distribution and its corresponding spherical K-Means clustering algorithm , which is used to segment an image as pixel sorting.
We assume the pixel-wise -dimensional embeddings
(CNN’s last layer features after normalization) within a segment follow a vMF distribution. vMF distributions are of particular interest as it is one of the simplest distributions with properties analogous to those of the multivariate Gaussian for directional data. Its probability density function is given by
where is the normalizing constant where represents the modified Bessel function of the first kind and order . is the mean direction and is the concentration parameter. Larger indicates stronger concentration about . In our particular case, we assume a constant for all vMF distributions to circumvent the expensive calculation of .
The embeddings of an image with segments can thus be considered as a mixture of vMF distributions with a uniform prior, or
where . Let be the hidden variable that indicates a pixel embedding belongs to a particular segment , or . Let be the set of pixel embeddings and be the set of corresponding hidden variables. The log-likelihood of the observed data is thus given by
Since is unknown, the EM framework is used to estimate this otherwise intractable maximum likelihood, resulting in the spherical K-Means algorithm with an assumption of . This assumption holds if all the embeddings within a region are the same (homogeneous), which will be our training objective described in Sec. 3.2.
The E-step that maximizes the likelihood of Eqn. 4 is to assign
with a posterior probability:
In the setting of K-Means, we use hard assignments to update , or . We further denote the set of pixels within a segment as ; hence if or otherwise after hard assignments.
which is the mean direction of pixel embeddings within segment . The spherical K-Means clustering is thus done through alternating updates of (E-step) and (M-step).
One problem of K-Means clustering is the dynamic number of EM steps, which would cause uncertain memory consumption during training. However, we find in practice a small fixed number of EM steps, , iterations, can already produce good segmentations.
If we only use embedding features for K-Means clustering, each resulted cluster is often disconnected and scattered.
As our goal is to spatially segment an image, we concatenate pixel coordinates with the embeddings so that the K-Means clustering is guided by spatiality.
Creating Aligned Segments. Segments that are aligned with different visual cues are critical for producing coherent boundaries. However, segments produced by K-Means clustering do not always conform to the ground truth boundaries. If one segment contains different semantic labels, it clearly contradicts our assumption of homogeneous embeddings within a segment. Therefore, we partition a segment given the ground truth mask in the supervised setting so that each segment contains exactly a single semantic label as illustrated in Fig. 3.
It is easy to see that the segments after partition are always aligned with semantic boundaries. Furthermore, this partition creates small segments of false positives and false negatives which can naturally serve as hard negative examples during loss calculation.
Following our assumption of homogeneous embeddings per segment, the training is therefore to enforce this criterion, which is done by optimizing the CNN parameters for better feature extraction.
We first define a prototype as the most representative embedding feature of a segment. Since the embeddings in a segment follow a vMF distribution, the mean direction vector in Eqn. 6 can naturally be used as the prototype.
and a constant hyperparameter:
As both the embedding and prototype are of unit length, the dot product
becomes the cosine similarity. The numerator indicates the exponential cosine similarity between a pixel embeddingand a particular segment prototype . The denominator includes the exponential cosine similarities w.r.t. all the segment prototypes. The value of indicates the ratio of pixel embedding close to segment compared to all the other segments.
The training objective is thus to maximize the posterior probability of a pixel embedding belonging to its corresponding segment obtained from the K-Means clustering. In other words, we want to minimize the following negative log-likelihood, or the vMF loss:
The total loss is the average over all pixels. As a result, minimizing has two effects: One is expressed by the numerator, where it encourages each pixel embedding to be close to its own segment prototype. The other is from the denominator, where it encourages each embedding feature to be far away from all other segment prototypes.
Note that this vMF loss does not require any ground truth semantic labels. We can therefore use this loss to train the CNN in an unsupervised setting. As the loss pushes every segment as far away as possible, visually similar segments are forced to stay closer on the hypersphere.
To make use of ground truth semantic information, we consider soft neighborhood assignments in the Neighborhood Components Analysis . The idea of soft neighborhood assignments is to encourage the probability of one example selecting its neighbors (excluding itself) of the same category. In our case, we want to encourage the probability of a pixel embedding selecting any other segment in the same category, denoted as , as its neighbors. We can define such probability as follows, adapted from Eqn. 7:
We denote the set of segments w.r.t. pixel as .
Our final loss function is therefore the negative log total probability of pixelselecting a neighbor prototype in the same category:
The total loss is the average over all pixels. Minimizing this loss is to maximize the expected number of pixels correctly classified by associating the right neighbor prototypes. The ground truth labels are thus used for finding the set of same-class segments w.r.t. pixel within a mini-batch (and memory banks). If there is no other segment in the same category, we fall back to the previous vMF loss. Since both vMF and vMF-N losses serve the same purpose for grouping and separating segments by optimizing the CNN feature extraction, we dub them segment sorting losses.
Understandably, an essential component of the segment sorting loss is the existence of semantic neighbor segments (in the numerator) and the abundance of alien segments (in the denominator). That is, the more examples presented at once, the better the optimization. We thus leverage two strategies: 1) We calculate the loss w.r.t. all the segments in the batch as opposed to traditionally image-wise loss function. 2) We use additional memory banks that cache the segment prototypes from previous batches. In our experiments, we cache up to batches. These two strategies help the fragmented segments (produced by segment partition in Fig. 3) connect to other similar segments between different images, or even between different batches.
After training, we calculate and save all the segment prototypes in the training set. We calculate the prototypes using pixels with majority labels within the segments, ignoring other unresolved noisy pixels.
During inference, we again conduct the K-Means clustering and then perform k-nearest neighbor search for each segment to retrieve the labels from segments in the training set. The ablation study on inference runtime and memory can be found in the supplementary.
Our overall framework is non-parametric. We use vMF clustering to organize pixel embeddings into segment exemplars, whose number is proportional to number of images in the training set. The embeddings of exemplars are trained with a nearest neighbor criterion such that the inference can be done consistently, resulting in a non-parametric model.
|Base / Backbone / Method||mIoU||f-measure|
|DeepLabv3+ / MNV2 / Softmax||72.51||50.90|
|DeepLabv3+ / MNV2 / SegSort||74.94||58.83|
|PSPNet / ResNet-101 / Softmax||80.12||59.64|
|PSPNet / ResNet-101 / ASM ||81.43||62.35|
|PSPNet / ResNet-101 / SegSort||81.77||63.71|
|DeepLabv3+ / MNV2 / Softmax||73.25||-|
|DeepLabv3+ / MNV2 / SegSort||74.88||-|
|PSPNet / ResNet-101 / Softmax||80.63||-|
|PSPNet / ResNet-101 / SegSort||82.41||-|
In this section, we demonstrate the efficacy of our Segment Sorting (SegSort) through experiments and visual analyses. We first describe the experimental setup in Section 4.1. Then we summarize all the quantitative and qualitative results of fully supervised semantic segmentation in Section 4.2. Lastly, we present results of the proposed approach for unsupervised semantic segmentation in Section 4.3. Additional experiments including ablation studies, t-SNE embedding visualization, and qualitative results on Cityscapes can be found in the supplementary.
PASCAL VOC 2012  segmentation dataset contains object categories and one background class. The original dataset contains (train) / (val) / (test) images. Following the procedure of [47, 8, 80], we augment the training data with the annotations of , resulting in (train_aug) images.
is a dataset for semantic urban street scene understanding.high quality pixel-level finely annotated images are divided into training, validation, and testing sets with / / images, respectively. It defines categories containing flat, human, vehicle, construction, object, nature, etc.
, respectively, both of which are pre-trained on ImageNet.
We follow closely the training procedures of the base architectures when training the baseline model with the standard pixel-wise softmax cross-entropy loss. The performance of the final model might be slightly worse from what is reported in the original papers mainly due to two reasons: 1) We do not pre-train on any other segmentation dataset, such as MS COCO  dataset. 2) We do not adopt any additional training tricks, such as balance sampling or fine-tuning specific categories.
Hyper-parameters of SegSort. For all the experiments, we use the following hyper-parameters for training SegSort: The dimension of embeddings is . The number of clustering in K-Means are set to and for VOC and Cityscapes, respectively. The EM steps in K-Means are set to and for VOC and Cityscapes, respectively. The concentration constant is set to . During inference, we use the same hyper-parameters for K-Means segmentation and nearest neighbors for predicting categories.
We use different learning rates and iterations for supervised training with SegSort. For VOC 2012, we train the network with initial learning rate for k iterations on train_aug set and with initial learning rate for k iterations on train set. For Cityscapes, we train the network with initial learning rate for the same 90k iterations as the softmax baseline.
Training on VOC 2012 requires more iterations than the baseline procedure mainly because most images only contain very few categories while the network can only compare segments in batches ( batches were cached). We find that enlarging the batch size or increasing memory banks might reduce the training iterations. As a comparison, images from Cityscapes contain ample categories, so the training iterations remain the same.
VOC 2012: We summarize the quantitative results of fully supervised semantic segmentation on Pascal VOC 2012  in Table 1, evaluated using mIoU and boundary evaluation following [2, 34] on both validation and testing set.
We conclude that networks trained with SegSort consistently outperform their parametric counterpart (Softmax) by to in mIoU and by to in mean boundary f-measure. (Per-class results can be found in the supplementary.) We notice that SegSort with DeepLabv3+ / MNV2 captures better fine structures, such as in ‘bike’ and ‘mbike’ while with PSPNet / ResNet-101 enhances more towards detecting small objects, such as in ‘boat’ and ‘plant’.
We present the visual comparison in Fig. 4. We observe prominent improvements on thin structures, such as human legs and chair legs. Also, more consistent region predictions can be found when context is critical, such as wheels in motorcycles and big trunk of buses.
One of the most important features of SegSort is the self-explanatory predictions via nearest neighbor segment retrieval. We therefore demonstrate two examples, correct and incorrect predictions, in Fig. 5. As can be seen, the query segments (on the leftmost) of rider, horse, and horse outlines can retrieve corresponding semantically relevant segments in the training set. For the incorrect example, it can be inferred from the retrieved segments that the number tag on the front of bikes was confused by the other number tags on motorbikes, resulting in false predictions.
Cityscapes: We summarize the quantitative results of fully supervised semantic segmentation on Cityscapes  in Table 2, evaluated on the validation set. Due to limited space, visual results are included in the supplementary.
The network trained with SegSort outperforms Softmax consistently. Large objects, , ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, , ‘pole’ and ‘tlight’, are better captured.
|unsup. on train_aug||sup. on train_aug||sup. on train||mIoU||f-measure|
We train the model using our framework without any ground truth labels at any level, pixel-wise or image-wise.
To adapt our approach for unsupervised semantic segmentation, what we need is a good criterion for segmenting an image along visual boundaries, which serves as a pseudo ground truth mask. There is an array of methods that meet the requirement, , SLIC  for super-pixels or gPb-owt-ucm  for hierarchical segmentation. We choose the HED contour detector  pretrained on BSDS500 dataset , and follow the procedure in gPb-owt-ucm  to produce the hierarchical segmentation, or HED-owt-ucm (Fig. 6).
We train the PSPNet / ResNet-101 network on the same augmented training set on VOC 2012 as in the supervised setting with the same initial learning rate yet for only k iterations. The hyper-parameters remain unchanged.
Note that the contour detector only provides visual boundaries without any concept of semantic segments, yet through our feature learning with segment sorting, our method discovers segments of common features – semantic segmentation without names.
For the sake of performance evaluation, we assume there is a separate annotated image set available during inference. For each segment under query, we assign a label by the majority vote of its nearest neighbors from that annotated set.
performance of its supervised counterpart. Together, We also showcase one possible way to make use of the unsupervised learned embedding. The network fine-tuned from unsupervised pre-trained embeddings outperforms the one without. Fig.7 shows the embeddings learned by unsupervised SegSort attend to more visual than semantic similarities compared to the supervised setting because the fine segmentation formed by contour detectors partitions the image into visually consistent segments. Hairs, faces, blue shirts, and wheels are retrieved successfully. The last query segment fails because the texture around the knee is more similar to animal skins.
Automatic Discovery of Visual Groups. We noticed in the retrieval results that CNNs trained this way can discover visual groups. We wonder if such visual structures actually form different clusters (or fine categories).
We extract all foreground segments in the training set and perform a nearest neighbor based hierarchical agglomerative clustering algorithm FINCH . FINCH merges two points if one is the nearest neighbor of the other (with undirectional link). This procedure can be performed recursively. We start with segment prototypes and performs FINCH to produce , , , , and clusters after each iteration. We visualize some segment groups at the finest level and a coarser level of clusters in Fig. 8.
A bigger picture of how the segments relate to each other from t-SNE  can be found in the supplementary.
We proposed an end-to-end pixel-wise metric learning approach that is motivated by perceptual organization.
We integrated the two essential components, pixel-level and segment-level sorting, in a unified framework, derived from von Mises-Fisher clustering.
We demonstrated the proposed approach consistently improves over the conventional pixel-wise prediction approaches for supervised semantic segmentation.
We also presented the first attempt for unsupervised semantic segmentation.
Intriguingly, the predictions produced by our approach, correct or not, can be inherently explained by the retrieved nearest segments.
Acknowledgements. This research was supported, in part, by Berkeley Deep Drive, NSF (IIS-1651389), DARPA.
Journal of Machine Learning Research. Cited by: §1, Figure 2, §2, §3.1, §3.1.
Signature verification using a” siamese” time delay neural network. In NIPS, Cited by: §2.
International Workshop on Similarity-Based Pattern Recognition, Cited by: §2.
2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
Multiclass spectral clustering. In ICCV, Cited by: §2.
We visualize the prototype embeddings in the training set using t-SNE . We display the results from supervised and unsupervised SegSort in Figure 9 and 10, respectively. This is done by random sampling prototypes in the training set, reducing the dimension from to using t-SNE , and placing the corresponding patches on the 2D canvas wherever possible.
For visualization of supervised SegSort, we observe that most background patches form a large cluster in the center with some small visual clusters. Each stretching arm represents one foreground class, with gradual appearance changes from boundaries to object centers. For examples, cars and trucks are on the rightmost islands while horses, cows, and sheeps on the leftmost.
For visualization of unsupervised SegSort, we observe that clusters are formed more by visual similarities. The cues for clustering are usually color and texture. For examples, wheels are clustered on the rightmost island while animals on the top. Grass and sky are placed on the bottom.
We analyze, compared to Softmax baseline, SegSort’s inference latency and memory as they are of practical concern. We conclude the runtime overhead (-) is manageable and memory overhead () is negligible.
We conduct experiments with various k-means iterations and numbers of nearest neighbors to learn how they influence the inference performance, summarized in Table 4. All experiments (PSPNet inference at single scale) are done using the same GTX 1080 Ti GPU. The overall GPU memory usage overhead is only 1.5% as no extra parameters are introduced. The runtime overhead is -
. We also notice the most runtime overhead is due to k-means instead of kNN (with 36K prototypes), both of which are computed in GPU. With 4 k-means iterations and 11-NN, our method (withruntime overhead) already improves more than mIoU. We believe this latency/accuracy trade-off is reasonable, particularly with the benefits such as interpretability.
|Method||k-means % time||kNN % time||Overall time||memory||mIoU|
|2 iter 1-NN||11.43||4.78||182.19||4569||77.18|
|4 iter 1-NN||13.97||4.60||188.61||4573||78.05|
|4 iter 11-NN||13.72||4.67||189.09||4574||78.50|
|10 iter 11-NN||20.04||4.38||207.51||4577||78.69|
We explain how we conduct boundary evaluation on semantic segmentation following [2, 34]. We first compute semantic boundaries per category for the semantic predictions and ground truth. We then match boundary pixels between predictions and ground truth with maximum distance of of image diagonal length. The per-category results are summarized by precision, recall, and f-measure in Table 5 and Table 6 on VOC and Cityscapes datasets.
|Base / Method||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv||mean|
|Deeplabv3 / Softmax||76.88||69.30||70.41||52.87||44.73||61.37||62.44||65.08||38.14||68.08||20.58||54.64||63.41||60.84||62.48||49.75||68.12||30.40||51.05||44.53||56.02|
|Deeplabv3 / SegSort||81.70||72.53||75.13||62.07||69.34||72.42||66.67||76.10||45.23||70.91||40.72||69.55||65.09||72.75||74.24||53.70||80.07||43.70||70.86||61.98||66.61|
|PSPNet / Softmax||91.10||75.72||90.72||68.00||68.19||82.00||73.49||81.58||54.96||85.38||40.57||76.17||84.00||77.07||76.55||65.17||87.65||44.63||73.53||61.95||72.97|
|PSPNet / SegSort||86.46||73.81||84.86||68.65||74.18||81.51||74.48||81.43||57.92||83.24||52.58||75.71||81.29||74.64||79.17||63.95||86.80||43.57||68.01||63.45||73.05|
|Deeplabv3 / Softmax||64.03||41.80||60.19||36.93||49.53||57.85||52.31||60.25||26.10||54.54||21.65||59.90||55.82||47.89||49.37||21.90||46.70||38.14||53.02||41.34||47.52|
|Deeplabv3 / SegSort||70.25||47.19||61.98||42.03||58.34||61.81||54.95||64.93||33.87||54.85||30.43||63.99||59.40||54.41||56.76||26.61||48.35||46.85||54.70||57.59||53.20|
|PSPNet / Softmax||68.07||37.66||63.66||38.90||59.55||62.05||55.28||64.92||29.48||55.87||28.16||65.18||58.09||49.35||53.69||22.75||52.09||41.14||55.74||51.78||51.19|
|PSPNet / SegSort||70.97||38.66||70.07||46.51||67.99||65.11||61.48||68.46||40.73||61.71||39.15||69.96||63.22||54.04||59.59||30.64||53.69||52.46||58.04||61.14||57.29|
|Deeplabv3 / Softmax||69.87||52.14||64.90||43.48||47.01||59.56||56.93||62.57||30.99||60.56||21.10||57.15||59.37||53.60||55.16||30.41||55.41||33.83||52.01||42.88||50.90|
|Deeplabv3 / SegSort||75.55||57.18||67.92||50.12||63.37||66.70||60.25||70.07||38.74||61.86||34.83||66.65||62.11||62.26||64.33||35.59||60.29||45.22||61.74||59.71||58.83|
|PSPNet / Softmax||77.92||50.30||74.82||49.49||63.58||70.64||63.10||72.30||38.38||67.55||33.24||70.25||68.68||60.17||63.11||33.73||65.34||42.82||63.41||56.41||59.64|
|PSPNet / SegSort||77.95||50.75||76.76||55.45||70.95||72.39||67.36||74.38||47.83||70.88||44.88||72.73||71.13||62.69||68.00||41.43||66.34||47.61||62.63||62.27||63.71|
We conduct experiments for the ablation study to understand how different components affect the performance of supervised SegSort. We decide the hyper-parameters of our main experiments using the experiences learned from the ablation studies.
Number of clusters: We study how the number of clusters affects the semantic segmentation performance (Figure 11). We train and test the DeepLabv3+ / MNV2 network with , , , , , , , and clusters in the vMF clustering. The highest performance is at clusters, which are slightly more than the number of categories in the dataset.
Dimension of embeddings: We study how the dimension of embeddings affects the semantic segmentation performance (Figure 12). We train and test the DeepLabv3+ / MNV2 network with , , , , and embedding dimension. We conclude that as long as the embedding dimension is sufficient, , larger than 8, the performance does not change drastically.
Number of nearest neighbors: We study how the number of nearest neighbors during inference affects the segmentation performance (Figure 13). We train the PSPNet / ResNet-101 network as described in the main paper and test it using to
(odd numbered) nearest neighbors. We conclude that the segmentation performance is robust to the number of nearest neighbors as the mIoU spans only.
We present the visual comparison in Figure 14. We observe large objects, such as ‘bus’ and ‘truck’, are improved thanks to more consistent region predictions while small objects, such as ‘pole’ and ‘tlight’, are also better captured.
We also include the per-category segmentation performance on Cityscapes test set in Table 7. We observe similar performance trends as on the validation set. We conclude that network trained with SegSort outperforms Softmax consistently.
We present the per-category results in Table 8 for interested readers. We notice that SegSort with DeepLabv3+ / MNV2 captures better fine structures, such as in ‘bike’ and ‘mbike’ while with PSPNet / ResNet-101 enhances more towards detecting small objects, such as in ‘boat’ and ‘plant’.
|Base / Backbone / Method||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv||mIoU|
|DeepLabv3+ / MNV2 / Softmax||85.02||55.18||80.92||65.87||70.60||89.55||83.39||88.27||35.04||80.30||48.24||79.20||82.13||81.16||81.21||52.59||75.24||47.20||80.20||67.92||72.51|
|DeepLabv3+ / MNV2 / SegSort||84.80||58.54||81.08||68.92||79.15||89.75||85.24||89.64||34.88||74.60||58.62||84.34||79.07||84.94||85.92||54.65||76.76||50.74||82.95||74.57||74.94|
|PSPNet / ResNet-101 / Softmax||92.56||66.70||91.10||76.52||80.88||94.43||88.49||93.14||38.87||89.33||62.77||86.44||89.72||88.36||87.48||56.95||91.77||46.23||88.59||77.14||80.12|
|PSPNet / ResNet-101 / SegSort||92.23||52.68||91.29||80.33||83.92||95.13||90.33||95.44||44.68||90.84||67.37||91.29||91.09||89.66||88.98||67.54||88.06||53.04||87.79||79.97||81.77|
|DeepLabv3+ / MNV2 / Softmax||85.89||59.20||79.09||61.24||66.47||87.87||85.17||88.80||28.27||78.98||60.67||80.35||83.72||83.90||83.52||59.87||83.43||50.22||74.07||63.91||73.25|
|DeepLabv3+ / MNV2 / SegSort||79.49||66.32||75.38||66.17||70.71||91.51||84.82||85.54||38.69||74.91||68.99||78.17||80.49||85.08||85.63||60.92||86.47||57.96||73.26||67.39||74.88|
|PSPNet / ResNet-101 / Softmax||94.01||68.08||88.80||64.87||75.87||95.60||89.59||93.15||37.96||88.20||72.58||89.96||93.30||87.52||86.65||61.90||87.05||60.81||87.13||74.65||80.63|
|PSPNet / ResNet-101 / SegSort||96.00||67.17||93.37||74.52||77.77||95.07||89.39||93.91||41.31||87.85||73.66||90.15||91.06||85.63||87.86||71.81||90.28||65.99||86.75||75.53||82.41|
We also train our supervised SegSort using DeepLabv3+ with ResNet-101 backbone with exactly same hyper-parameters as MNV2 backbone. We include the per-category segmentation performance in Table 9. Even though the hyper-parameters might not be optimal, we observe consistent performance improvements over the baseline Softmax method.
|Base / Backbone / Method||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv||mIoU|
|Deeplabv3 / ResNet-101 / Softmax||90.93||56.70||89.46||73.35||82.13||95.03||87.30||91.88||37.79||83.56||56.32||88.31||83.32||86.11||86.61||58.17||87.65||52.87||88.43||74.19||78.24|
|Deeplabv3 / ResNet-101 / SegSort||88.78||51.17||88.12||70.45||83.89||95.12||88.74||94.34||43.12||86.24||59.07||88.86||88.11||86.92||87.58||56.91||85.46||55.32||89.01||73.77||78.85|