Many deep learning based applications in computer vision operate on a grid of pixels and use convolutions trained end-to-end. However, popular algorithms have successfully leveraged image segmentation to group pixels into superpixels, reducing the input dimensionality while preserving the semantic content needed to address the task at hand[fulkerson2009class]. Superpixels are efficient image priors that tend to transfer across tasks and reduce the data needed to train models, which can be very beneficial for domain adaptation and weakly supervised settings, e.g. weakly supervised image segmentation [kwak2017weakly]. Graph-based convolutional networks [graphbased] also allow applications of deep learning beyond grid-like inputs. Some works [schuurmans2018efficient] explored the inclusion of superpixels in deep learning pipelines.
The hand-crafted design of superpixels algorithms limits our ability to tune image segmentations to a specific image domains, such as infrared, medical, of spatio-temporal data. Given the focus on efficiency, superpixels have often been designed to operate on color features only; image segmentations could however incorporate higher-level image representations. We consider extensions to a standard superpixel algorithms incorporating higher-level unsupervised or supervised image features. We also study paths to fine-tune a superpixel segmentation algorithm to a specific modality. There has been few research on trainable superpixels. In parallel to our work, Wei-Chih Tu et al. [liulearning] have developed a trainable variant of graph-based superpixel algorithms using trainable superpixel affinities. Our approach is based on a clustering algorithm, which tends to be faster and more suited for real-time applications due to their iterative nature [spix_eval].
Ii SLIC algorithm
Several comparisons indicate that the Simple Linear Iterative Clustering (SLIC) [slic] image segmentation algorithm offers both good speed and performance [spix_eval][neubert2012superpixel]. It uses a clustering approach similar to -means, and usually operates on images in the CIELAB color space. After initialization of the cluster centers along a grid, a two-step iterative process refines clusters until convergence. First, the pixels are assigned to the closest cluster center in a joint -dimensional space of colors (, and ) and spatial ( and ) components. The weighted distance includes a compactness parameter to balance between colors and space. Second, the cluster centers are updated based on the pixel assignments. Finally, after convergence, a simple connected components algorithm enforces connectedness of the image segments.
Iii Augmenting SLIC with deep representations
Iii-a Deep representations
We experiment with SLIC beyond the original
features. Deep representations capturing textures, gradients and edges in the image can be extracted from convolutional neural networks. Their structure is similar to multi-channel images, often having a lower resolution than the original image. Each channel represents an image feature. These features can be unsupervised, as in the case of scattering features[scattering] (Fig. 1), or trained for a particular vision task. Segmentation networks such as ENet [paszke2016enet] have convolutional layers behaving like feature extractors. As we aim to integrate superpixels in a deep architecture, the features can be provided at no extra computational cost. Unsupervised scattering networks are similar to convolutional neural networks whose filters are fixed as wavelets. We use scattering networks with a receptive field of for our experiments on images, generating features maps of size per image channel.
Iii-B Inclusion in SLIC
For a particular pixel, we have image features . To incorporate the image features into SLIC, we augment the number of image channels. The scattering features are upscaled and concatenated with the input image (Fig. 2). The final image of size can be used in the SLIC algorithm, where the clustering space now becomes a larger space. The SLIC color distance is extended and individual feature maps are weighted with coefficients . The distance in the color space between pixel and cluster is then defined as
Our first experiments investigate the impact of scattering features by manually tuning the inclusion of scattering features on the lightness component only. We define binary weights based on the visual appearance of the features. Layers originating from strong edge detectors are left out since they are of no use in clustering. We also varied the relative importance of the extra features compared to the color components and selected the best-scoring approach (out of 10 different ones) for evaluation (Section VII-B).
Iv Trainable superpixel algorithm
Manual selection and weighting of features in the distance measure is a tedious process, requiring visual examination of features and an exhaustive search for optimal weights. In addition, the distance measure (1) might not have enough flexibility to integrate those features properly. We research a trainable superpixel algorithm incorporating a neural network that can tune superpixels to a certain image set.
Iv-a Clustering as a classification problem
The SLIC superpixel algorithm uses a top-down approach: the algorithm iterates over all cluster centers and calculates a distance measure to all pixels in the neighborhood around the cluster center. An equivalent bottom-up approach would be to iterate over all pixels and calculate a distance measure between the pixel and all the clusters in the region around the pixel. The pixel is then assigned to the cluster being the closest in the clustering space of SLIC with components.
This is in fact a classification problem: assign each pixel to one of the clusters in the spatial neighborhood. While SLIC solves this classification problem using a distance measure, we rather avoid to train a regression because distances improving superpixel performance are hard to define. We propose to use a neural classifier for the assignment task: it considers a fixed amount of spatially closest clusters in the neighborhood and assigns the pixel to one those depending on their features.
Iv-B Bottom-up trainable superpixel algorithm
The algorithm (Algorithm 1) works in a similar way to SLIC. Clusters are first initialized on a grid. Then, clusters are formed using a two-step iterative procedure: the first step assigns each pixel to one of the spatially closest clusters, using classification based on the features of these clusters and the pixel of interest. is a parameter: higher means more flexibility at the cost of more computations. Afterwards, the features and position of the newly formed clusters are calculated by averaging the features and positions of the pixels assigned to those clusters. This iterative procedure is done for a fixed amount of iterations. Finally, a connected components algorithm is used to transform clusters into proper superpixels.
A sequential implementation as described here would be slow: the large amount of individual network evaluations limits the performance. We implemented a version that generates large batches and evaluates these on a GPU. The algorithm can also easily be parallelized because every pixel is processed independently.
Iv-C Neural network architecture
The input vector for the classification of a single pixel consists of several parts:
pixel features, for example the pixel color and other features extracted using deep representations.
spatial distances to the closest clusters. In order to have a single neural network for multiple superpixel sizes and compactness parameters, the distance is normalized: , with the pixel distance between pixel and cluster center .
feature differences between the input pixels and cluster centers.
The network outputs a vector of size , where each element
denotes the probability of the pixel belonging to cluster with index. We aim for a small network and look at the problem as a typical classification problem. A fully connected network would not exploit the similarity between different parts of the input vector. An efficient architecture is made up of three parts: normalization, dimensionality reduction and classification (Fig. 3). The Dimensionality Reducer for Pixels (DRP) modules transforms the pixel features to a smaller space, while the Dimensionality Reducer for Clusters (DRC) is applied on the pixel-cluster differences. Weights are shared between similarly-named modules to reduce the number of trainable parameters. The final fully connected network (FC) does the actual classification.
V Generating training labels
The classifier requires training labels, indicating which cluster the pixel should be assigned to according to ground truth. Since no database with superpixel annotations exist, we derive a label set from semantic segmentation databases such as Cityscapes [cityscapes] and BSDS [bsds500] (Fig. 4).
V-a SLIC-based labels
We use the SLIC distance measure as a starting point to produce labels. SLIC replication requires to calculate the SLIC distance measure to the closest clusters of the classifier and pick the closest cluster according to this measure. The pixel label is then set to this cluster. Replicating SLIC would not force the classifier to include the features extracted from deep representations in its decision process. To improve superpixels beyond SLIC, we use ground truth annotations for semantic segmentation to correct wrong labels, where the pixel would be assigned to a cluster in a different ground truth segment. SLIC makes these mistakes when regions have approximately the same colors, but the classifier can use deep representations to discriminate between the two regions. When generating a label for a cluster, we only consider assignment to clusters lying mainly in the same ground truth segment as the pixel being classified.
Ground truth segmentations are typically much larger than superpixels and the amount of pixels being corrected by the ground truth segmentation is small. The classifier thus primarily replicates SLIC and ignores the corrected labels. A multi-label loss could take into account that multiple clusters are good candidates, but we couldn’t achieve satisfactory results using this approach. We solve the problem by using principles of hard-example mining: the set of labels is carefully chosen to improve the training process.
Hard-example mining on SLIC mistakes
We try to train the classifier by only retaining labels that were corrected by the ground truth annotations. Our experiments indicate that this is too strict and degrades superpixel performance.
Hard-example mining at segmentation edges
A less strict method would be to only consider pixels near ground truth edges. Labels in the middle of the ground truth segments have a lot of ambiguity: we cannot be sure whether the assigned cluster is really in the same part of the object. Labels at the edges have more discriminative power. We call these unambiguous labels. Our implementation does not exactly select pixels near the edge; it is easier to count the amount of different ground truth segments of the closest clusters. Thus, we restrict the training set to pixels that have candidate clusters in at least a chosen amount of different ground truth segments.
V-B Weakly supervised labeling
Using the SLIC distance measure to generate pixel labels offers a good starting point but might also restrict the adaptability of the classification network. One could label a pixel to a random cluster in the same segment. This obviously generates very noisy superpixels. Picking the closest cluster in the same segment has the opposite problem: the spatial component is emphasized too much. Again, we leverage the principles of hard-example mining to build a better training set. We limit the training set to pixels having candidate clusters in at least segments, with an optimal to be determined experimentally (Fig. 5). Interestingly, our experiments indicate that a higher value for produces more compact clusters (Fig. 6). The reduced amount of ambiguity increases the importance of the spatial component: the network learns that two pixels next to each other might have very different features, while having very similar spatial distances to the spatially closest clusters.
V-C BSDS ground truth edges
More refined semantic segmentations provide more accurate labels. We considered several semantic segmentation datasets: PASCAL VOC [pascal-voc-2012], Cityscapes [cityscapes] and BSDS500 [bsds500]. Cityscapes and BSDS both have high-quality ground truth annotations, but BSDS has multiple of them for a single image. Typically, object borders in natural images are not clearly delineated and multiple independent ground truth segmentations help to handle these cases. We combine the 5 individual ground truth annotations in a single ground truth edge map (Fig. 7). This also defines a new distance measure: more edges between a pixel and cluster indicate a greater distance and less likelihood to be assigned to that cluster.
Vi Training a distance measure
The proposed network interprets the classification task as a typical deep learning problem. We were not able to replicate the SLIC distance measure exactly, although superpixel output was similar. We note that the SLIC distance measure could be perfectly replicated by squaring each element of the input vector and removing the batchnorm layer: the elements of the input vector then become the individual terms of the SLIC distance measure. By making the different parts of the network independent, the trained modules can be seen as distance functions (Fig. 8). The network then learns a regression by training a classification. We verified that the network can almost perfectly replicate the SLIC distance measure (Table II). When using a single linear layer, the network in fact learns the weights of Equation 1. These weights can then be integrated in the top-down approach of SLIC, resulting in a very efficient trainable superpixel algorithm running on CPU.
Vii Evaluation and results
Superpixel performance is evaluated on 500 BSDS500 [bsds500] color images. Superpixels are evaluated with size , compactness (determined optimal for the standard SLIC) and 5 clustering iterations. We use several metrics common in superpixel evaluation: Boundary recall (Rec) represents the adherence to ground truth boundaries (higher is better). Mean distance to edge (MDE) [benesova2014fast] measures the average distance between the ground truth border and closest superpixel edge (lower is better). Superpixel leakage into different ground truth segments is quantified by the undersegmentation error (UE) (lower is better). Multiple variants exist, we use the definition of Neubert and Protzel [neubert2012superpixel]. The regularity and compactness of superpixels is measured by the compactness (CO) metric [compactness]. More regular superpixels are generally preferred. For a fair comparison, the compactness parameters of different methods are chosen so their resulting output compactness is similar. We define an additional intersection-over-union (IoU) metric similar to the one often used in segmentation benchmarks. This metric measures the maximum achievable performance when using superpixels in a segmentation pipeline.
Vii-B Extended distance measure with manual tuning
As a first experiment, we evaluate the inclusion of scattering features in the extended distance measure for SLIC (Section III). The scattering transformation is applied on the lightness channel of the image (converted to the color space) and we manually select the most important representations. We refer to this method as ‘Manual tuning’ and Table I shows that all metrics are improved. Mainly the mean distance to edge and undersegmentation metrics are impacted: the low-resolution features do not help at a pixel-scale level, but avoid superpixel leakage. The difference is larger at lower compactness values (Fig. 10). Evaluating the methods for their own optimal compactness, improvement of MDE is 9.4% compared the 4.3% improvement for . The approach with scattering features benefits from the increased flexibility, while SLIC performance decreases. Superpixels incorporating deep representations also consistently perform better (Fig. 9): most images are slightly improved. In addition, we experimented with greyscale images and the effect of scattering features is even stronger.
Vii-C Trainable superpixels
Trainable superpixels should be able to improve superpixel quality without having to manually tune the distance measure weights. Quality assessment of the trainable superpixels is a three-stage process: a label set is generated, a classifier is trained on these labels and the superpixel algorithm using the trained classifier is evaluated. We selected the most promising label methods for evaluation on BSDS500 images and tested scattering and ENet features. The 243 scattering features have a receptive field of and spatial dimensions of . The ENet features are extracted from the first convolutional layer, designed to be feature extractor and consisting of filters having a receptive field of . They have a better spatial resolution of size , but there are only 16 features.
We selected a simple network with an architecture as in Fig. 3, where the dimensionality reducers DRP/DRC
are 2-layer networks (hidden layers of 100 and 15 neurons) and the classification networkFC‘Deep learning classification network’. We also test regression architectures as in Fig. 8 with a single linear layer for the distance measure module: this is in fact just a weighted addition of the squared pixel-cluster differences. This approach is called ‘1-layer network’. In addition, we evaluate a network where the single layer is expanded to 3 layers (‘3-layer’).
We trained on several label methods and experimented with different variations of hard-example mining for both SLIC-based labels and weakly supervised labels. Our experiments found that more engineered methods performed better. The best SLIC-based method corrects labeling mistakes with the segmentation ground truth and applies hard-example mining, with parameter , in order to remove ambiguous labels. The best weakly supervised label method also removes ambiguous labels, but with parameter , retaining clusters lying in at least 6 different ground truth segments.
Vii-D Validation loss
As different labeling methods employ different loss functions, we cannot directly compare the values of these loss functions on the validation set. For a single label method, a comparison between network architectures and features is possible and serves as an indication for resulting superpixel quality. TableII
shows that scattering features and ENet features achieve similar validation losses in most cases. Unsurprisingly, the 3-layer regression network performs better than the 1-layer one, and it also performs slightly better than the classification-based network that used batch normalization and dimensionality reduction modules.
Vii-E Superpixel quality
The superpixel quality for each of these methods is compared in Table I. The 3-layer regression-based network, having the lowest validation loss, also achieves the best metric scores. Superpixel quality is improved over standard SLIC and also over the manually tuned method of Section III. Comparing methods visually (Fig. 12) shows that the manually tuned method tends to concentrate superpixels around object borders. This effect is not seen in the trained superpixels. During evaluation of manually tuned superpixels, we already discovered that the extra features mainly influence the mean distance to edge and undersegmentation metrics and the same effect can be seen here. The weakly supervised method has surprisingly similar scores to SLIC and during our tests we noticed that the variation in compactness was much lower (Fig 11).
Superpixels are image priors that tend to transfer across tasks. This works elaborates on a trainable approach for superpixels incorporating deep image representations. We introduce several new ideas not yet addressed in research: we include deep representations in a superpixel algorithm, build a set of superpixel training labels from segmentation annotations and devise a trainable superpixel algorithm. We demonstrate that a simple inclusion of deep representations by extending the SLIC distance measure improves superpixel quality in a consistent way. The trainable approach can surpass the scores of the simple inclusion, but requires appropriate training labels. The performance increase could be limited by the dataset and features used in our experiments. We used natural images, which have a high variability in features. We believe larger performance increases can be achieved by targeting specific modality, such as medical imaging. More specialized features can be incorporated, possibly having a less restricted receptive field than the scattering features. We hope that our analysis paves the way to the inclusion of trainable superpixels in deep learning pipelines.