Generating Superpixels for High-resolution Images with Decoupled Patch Calibration

08/19/2021 ∙ by Yaxiong Wang, et al. ∙ Xi'an Jiaotong University Zhejiang University 0

Superpixel segmentation has recently seen important progress benefiting from the advances in differentiable deep learning. However, the very high-resolution superpixel segmentation still remains challenging due to the expensive memory and computation cost, making the current advanced superpixel networks fail to process. In this paper, we devise Patch Calibration Networks (PCNet), aiming to efficiently and accurately implement high-resolution superpixel segmentation. PCNet follows the principle of producing high-resolution output from low-resolution input for saving GPU memory and relieving computation cost. To recall the fine details destroyed by the down-sampling operation, we propose a novel Decoupled Patch Calibration (DPC) branch for collaboratively augment the main superpixel generation branch. In particular, DPC takes a local patch from the high-resolution images and dynamically generates a binary mask to impose the network to focus on region boundaries. By sharing the parameters of DPC and main branches, the fine-detailed knowledge learned from high-resolution patches will be transferred to help calibrate the destroyed information. To the best of our knowledge, we make the first attempt to consider the deep-learning-based superpixel generation for high-resolution cases. To facilitate this research, we build evaluation benchmarks from two public datasets and one new constructed one, covering a wide range of diversities from fine-grained human parts to cityscapes. Extensive experiments demonstrate that our PCNet can not only perform favorably against the state-of-the-arts in the quantitative results but also improve the resolution upper bound from 3K to 5K on 1080Ti GPUs.



There are no comments yet.


page 1

page 2

page 3

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Superpixel segmentation targets assigning the pixel with similar color or other low-level properties to the same groups, which could be viewed as a clustering procedure on the image, the formed pixel clusters are known as superpixels. Benefiting from the tremendous progress of deep convolution neural networks, many approaches have been proposed to harness deep models to facilitate superpixel segmentation and achieved promising results 

[35, 39, 14]

. The common practice first evenly splits the image into grids and utilize the convolution network to predict a 9-dimension vector for each pixel, which indicates the probabilities of the pixel assigned to its 9 surrounding grids 

[39, 14]. Nevertheless, existing methods often fail to process ultra high-resolution images due to memory limitation of GPUs.

Fig. 1: Time & memory, and performance comparison between our PCNet and the state-of-the-art SCN [39], the footnote ‘t’ and ‘m’ indicates the time and memory cost, respectively. Our method could efficiently acquire superpixels of high-resolution images (5K) while keep competitive boundary precision. Results are evaluated using a single NVIDIA 1080Ti GPU.
Fig. 2: The comparison of different training architectures, SCN [39] outputs the prediction whose resolution is the same as the input. The naïve solution directly predicting high-resolution from the low-resolution image would result in a poor performance. By introducing calibration branch, our PCNet could process higher resolution images and achieve a satisfactory performance at the same time.

Ultra high-resolution problem is a longstanding challenge in computer vision, especially for the pixel-wise tasks like segmentation 

[8, 43, 22, 37], optical flow [5, 34, 18, 25]

, depth estimation 

[13, 4, 12, 40] et al. . For superpixel segmentation, although some deep learning based methods have been proposed, the ultra high-resolution scenario has not been well explored. For instance, the state-of-the-art method, SCN [39], could generate superpixels for low-resolution images, however, when encountering images with higher resolution, its inference speed will be much slower. What’s worse, SCN could not work when the resolution of the fed image is over 3.5K based on NVIDIA 1080Ti GPU, as shown in Fig. 1. To enlarge the tolerate resolution of SCN, a straightforward solution is to predict the high-resolution output from a low-resolution image. Specifically, by introducing additional upsampling layers in the decoder stage, we could enforce the network to learn the association map from a lower-resolution image, as shown in Fig. 2 (b). With such a design principle, the network could successfully process larger size images.

However, too much sacrificed information in the very low-resolution input would obstruct the network to perceive the textural contexts, especially the fine boundary details. As a result, the performance would remarkably deteriorate. For perceiving more boundary details, we propose to calibrate the boundary pixels assignment by directly cropping a patch from the original high-resolution images as additional input, leading to the Patch Calibration Network (PCNet). The core idea of our PCNet is illustrated in Fig. 2 (c), we propose to crop a local patch from the source high-resolution image such that all details could be reserved. We then feed it into the Decoupled Patch Calibration (DPC) branch to perform a calibration for the boundary details of the coarse global prediction. By sharing weights with the main branch, the learned knowledge from DPC could be well transferred to make the main superpixel branch accurately perceive more boundary pixels.

Different from the global input that attempts to perceive the overall boundary layout by classifying the pixels to their object categories, our cropped patches only target helping the network accurately assign the pixels around the boundaries while paying no attention to the included objects. To impose this point, a dynamic guiding training mechanism is designed. Instead of greedily identifying the categories of all pixels in the patch, we design a dynamic mask to encourage the network to only focus on the main boundaries of current patch, which is achieved by degrading the multi-class semantic label to a dynamic binary mask as guidance. With our dynamic guiding training, in each iteration, the network only needs to discriminate the foreground and the background to highlight the main boundary while spare no efforts to identify the multiple categories for all pixels. Such a strategy could not only ease the network optimization but enable the network to accurately identify more boundary pixels.

From Fig. 1, our PCNet could successfully process the 5K resolution images on a single NVIDIA 1080Ti GPU. Comparing to the state-of-the-art SCN, PCNet could surpass it by a small margin even though receiving a 4-time smaller resolution input. Quantitative and qualitative results on Mapillary Vistas [27], BIG [8] and our created Face-Human datasets demonstrate that the proposed method could effectively process high-resolution images and achieve more outstanding performance. In summary, we make the following contributions in this work:

Fig. 3:

Illustration of proposed PCNet. In each iteration, the low-resolution global and local inputs are fed forward the network and predict the 4-time resolution associations, which are supervised by the ground-truth semantic labels and our dynamic guiding mask, respectively. The super-resolution branch serves as an auxiliary module to help recover more texture details during training and is discarded when inference. And ‘conv-#’ indicates the convolution with stride #, the ‘sub-pixel’ stands for the sub-pixel convolution operation with scale 4 

[31]. The ‘SR’ and ‘SP’ mean the super-resolution and superpixel heads, respectively.
  • We contribute the first framework for high-resolution superpixel segmentation. The proposed PCNet could effectively process 5K images and achieves satisfactory performance on three benchmarks.

  • A novel patch calibration training paradigm is proposed, with this architecture, the global image and the local patch could compensate each other and work together to train a more robust model.

  • We design a dynamic guidance training method, a binary mask is dynamically generated and guide the network to focus on the boundaries of semantic regions, which could efficiently benefit the boundary identification.

In the following, we would first introduce the existing works related to this paper in section II and elaborate the details of our PCNet in section III. Section IV would present the experimental results and make comparison with the state-of-the-art methods. Finally, the conclusions would be given in section V.

Ii Related Work

Although the high-resolution superpixel segmentation has not been well studied, the superpixel segmentation for general images has a long line of research and has made important progress in recent years. Besides, the high-resolution segmentation is a neighbor tasks for the high-resolution superpixel segmentation, since the superpixel could be viewed as over-segmentation of image. In the following two subsections, we will subsequently present the existing works of superpixel segmentation and high-resolution segmentation, respectively.

Ii-a Superpixel Segmentation

Superpixel segmentation can be viewed as a clustering procedure on the image, the key of this problem is to estimate the assignment of each pixel to its potential clustering centers [24, 1, 26, 35, 39, 2, 36, 28, 38, 30, 17, 23]. Traditional methods model the assignment using the clustering theory [19, 1, 26, 2] or graph technique [11, 24]. In general, the clustering-based methods usually use the clustering strategy to compute the connectivity between the anchor pixel and its neighbors, which is straightforward for superpixel segmentation. The well-known superpixel method SLIC [1] adapt the -means for superpixel segmentation and is a classic algorithm in superpixel community. Liu et al.  [26] extend the SLIC to compute content-sensitive superpixels, the designed model could generate the small superpixels in content-dense regions and large superpixels in content-sparse regions. Li et al.  [19] explicitly utilize the connection between the optimization objectives of weighted -means and normalized cuts by introducing a elaborately designed high deimensional space. While the graph-based approaches formulate the superpixel segmentation as a graph-partitioning problem, and the superpixel segmentation is performed by estimating the connectivity strength between the pixels. FELZENSZWALB et al.  [11] utilize a graph-based representation of the image and define a predicate for measuring the evidence for boundary between two regions, the authors design an efficient superpixel segmentation algorithm named FH based on the predicate. In [24], the Liu et al.

propose a novel objective function for superpixel segmentation, and the segmentation is then given by the graph topology that maximizes the objective function under the matroid constraint. Inspired by the success of deep neural networks on many vision tasks, researchers recently attempt to harness deep convolutional networks to facilitate superpixel segmentation. Tu

et al.  [35]

propose to improve the superpixel segmentation by the learned pixel-wise deep features from fully connected networks. Jampani

et al.  [14] propose a soft clustering mechanism and incorporate it with deep architecture, they develop the first end-to-end deep solution for superpixel task. In [39], the authors further simplify the framework in [14] and contribute a faster superpixel segmentation algorithm.

Ii-B High-resolution Segmentation

The high-resolution superpixel segmentation is rarely explored, while its neighbor task, segmentation for high-resolution images has been studied by many works [8, 6, 42]. Cheng et al.  [8] fuse the multi-scale information and iteratively refine the given coarse segmentation map to produce a high-resolution prediction. Chen et al.  [6] utilize two branches to capture the global and the local context, which are fused to predict the final segmentation, the proposed framework is memory-efficient. In [29], Sarker et al. propose a prior and sub-strips based mechanism to address the very high-resolution image segmentation. Instead of giving high-resolution prediction from low resolution, Wang et al. design a parallel and hierarchic architecture, named HRNet [37], to maintain high-resolution information through the whole process. Lin et al.  [21]

maintain a long-range residual connection to exploits all the information available along the down-sampling process, which could enable the network to capture the high-level semantic features and the fine-grained textual features simultaneously. In 

[41], Zhao et al. utilize a cascade paradigm to incorporate multi-resolution branches for high-resolution image segmentation, the authors introduce a cascade features fusion unit to perform high-quality segmentation. Zheng et al.  [7] employ a global-local joint learning strategy to perform efficient segmentation for ultra high-resolution remote sensing images. In [44], Zhou et al. focus on the medical images and propose a novel high-resolution multi-scale encoder-decoder network, the authors introduce multi-scale dense connections to exploit comprehensive semantic information. To address the high-resolution segmentation, they integrated a high-resolution pathways to collect the high-resolution semantic information for boundary localization. Liao et al.  [20] further explore the segmentation of 3D high-resolution magnetic resonance angiography (MRA) images of the human cerebral vasculature and develop two minimal path approaches. Li et al. propose an edge-embedded marker-based watershed algorithm for high spatial resolution remote sensing image segmentation [16], the authors utilize the confidence embedded method to detect the edge information and propose a two-stage model, where the first stage extract maker image while the second stage integrates the edge information into the labeling process with the edge pixels assigned the lowest priority and lastly labeled, such that the edge pixels become the candidates of the boundary pixels and more precise objects boundaries can be acquired.

Iii Methodology

Fig. 3 shows the schematic illustration of our proposed PCNet. In the training stage, two types of inputs, i.e. , the global image and the cropped local patch , are fed into the main branch and the decoupled patch calibration branch, respectively. The inputs are down-sampled 4 times in the contracting path and upsampled 6 times in the expansive path. Consequently, both branches output association maps with shape , which are supervised by the semantic label and our designed dynamic guiding mask. A super-resolution branch serves as the auxiliary module to help the network restore more details from the low-resolution inputs, and is discarded during inference. We share weights between two branches to propagate the learned knowledge.

Iii-a Patch Calibration Arch.

As shown in Fig. 3, for memory and computation efficiency, our PCNet predicts the high-resolution association maps with low resolution inputs using an encoder-decoder paradigm. Instead of using the symmetric layers in encoder and decoder, we additionally introduce a sub-pixel layer [31] with scale factor 4 before the prediction layer to help produce an association map with shape . During training, the main branch is responsible to distill the global boundary layout from , which is obtained by simply resizing the original high-resolution image. While the decoupled patch calibration (DPC) branch focuses on calibrating the global results by capturing finer boundaries from the cropped local patch . By sharing the learned weights of DPC branch, the main branch could simultaneously perceive the panoramic boundary layout and the fine boundary details, efficiently preventing the performance from deteriorating.

The outputs of the two branches are supervised in a similar manner. Give the association prediction and the corresponding label of the global input, the superpixel training is performed by first computing the center of any superpixel from the surrounding pixels:


and then reconstruct the property of any pixel according to the superpixel neighbors:


where the is the set of adjacent superpixels of , indicates the probability of pixel assigned to superpixel . Thus, the network is optimized towards minimizing the distance between the ground-truth label and the reconstructed one. Following Yang et al.  [39], the 2-dimension spatial coordinate is also considered, thus, the full superpixel loss reads:


where stands for the cross entropy loss, and is the reconstructed vector from according to Eq. 12.

To restore more details from the low-resolution input , a super-resolution branch is further introduced to give a reconstruction of the high-resolution version , which is optimized by a masked reconstruction loss, the details would be elaborated in subsection III-C. The full loss for global branch is:

Fig. 4: Performance comparison of PCNet with the multi-class label and the dynamic mask.

Iii-B Decoupled Patch Calibration Branch

Our PCNet attempts to produce high-resolution outputs from low-resolution inputs, such a paradigm allows us to train a network that could tolerate higher resolution images. However, the lost textural details and the blur boundary contexts in the down-sampled input make the network difficult to well identify the boundary pixels. To remedy this weakness, we design our decoupled patch calibration (DPC) branch. Particularly, the local patch is first cropped from the original high-resolution data anchored on a boundary pixel, and forward through the DPC branch to capture finer boundary details. Finally, we endue the main branch with the ability of fine boundary perceiving by sharing the learned weights of DPC.

Given the output of DPC, the ground-truth semantic label is a straightforward choice to serve as the supervision, which attempts to benefit the superpixel segmentation by classifying the pixel to its object category. This strategy could success on global images but did not work well on the local input, since the cropped patch usually only covers a part of the objects in our practice, as shown in Fig. 3. The lack of global context makes the multi-class classification procedure much difficult. Actually, unlike the semantic segmentation that needs to capture the global context to identify complete objects, the superpixel segmentation mainly concerns whether the boundaries are accurately identified [39, 14, 33]. In other terms, the superpixel network only needs to discriminate the adjacent regions for boundary perceiving, and it’s unnecessary to identify all object categories of the pixels. Motivated by this consideration, we propose our dynamic guiding mask to supervise the local association prediction. Formally, we first sample the semantic label corresponding to the local patch and find the class with the longest boundaries. Then, the dynamic guiding mask is defined as follows:

0:  The training set, network PCNet with parameters , optimizer(

), hyperparameters

0:  optimized network parameters
1:  while not converged do
2:     for  do
3:        # prepare inputs, is a boundary pixel
4:        , , = , , ;
5:        Generate dynamic mask from Eq. 4;
6:        , = , ;
7:        # forward
8:         = PCNet(); # for global input
9:         = PCNet(); # for local input
10:        # loss compuatation
11:        Compute , from Eq. 3;
12:        Compute , from Eq. 5;
13:        Compute from Eq. 68;
14:        # accumulate the above losses as full loss
16:        # backward and update parameters
17:        backward()
18:        optimizer.step()
19:     end for
20:  end while
Algorithm 1

Training pseudocode of PCNet (PyTorch-like).

With our dynamic guiding mask, the multi-class object recognition is degraded to the salient region detection, such that the network only needs to discriminate the class and other classes, which eases the network optimization. If the foreground category could be well perceived, most boundaries of current patch could be identified due to the choice of class . For the ignored boundaries, they could be highlighted in other iterations, since the local patches are randomly cropped and the guiding mask is dynamically generated. In our practice, our dynamic guiding mask could contribute much more performance gains comparing the mutli-class label , as shown in Fig. 4. It is worth noting that the dynamic guiding training strategy is not suitable for the main branch, since the boundary of global input is rich enough, too many boundaries would be ignored when performing Eq. 4. What’s more important, the randomness of is much smaller, which means the longest class could be the same with high probability in different iterations, as a result, the ignored boundaries could not be well captured during training.

At training stage, the dynamic binary mask would replace to supervise the local output in Eq 3. The super-resolution is also employed as an auxiliary branch in DPC to help restore more textures. Thus, the full loss for DPC branch is: , where are the local patch reconstruction and is the high-resolution version of .

Iii-C Training

The network is trained by a combination of superpixel loss, super-resolution loss, and our proposed local discrimination loss. The super-resolution loss is a masked reconstruction loss:


where is the binary mask to indicate the boundary pixels. To obtain , we first extract the boundaries from the ground-truth semantic label and dilate it with kernel to include more boundary contexts.

Besides the superpixel loss and the super-resolution loss, we further design a local discrimination loss to highlight the boundary pixels in hidden-feature level. To be specific, let be the pixel-wise embedding map produced by the sub-pixel layer. Since the ground-truth label is available during training, we could sample a small local patch surrounding a boundary pixel from . For simplicity, we only sample the patches covering two different semantic regions, i.e. , is a groups of features from two categories: , where . Intuitively, we attempt to make the features in the same categories be closer, while the embeddings from different classes should be far away from each other. To this end, we evenly partition the features in the same categories into two groups, , and minimize the intra-dispersion while maximize the inter-dispersion:


where and are the average representation and compactness measure for features : . Taking all of the sampled patches into consideration, the local discrimination loss is formulated as:


By this loss, the boundary pixels could be distinguished beforehand in hidden-feature level, which could ease the following superpixel module to identify the semantic boundaries. In our practice, equipping the LD loss on the local embedding did not contribute more performance gains, therefore, our LD loss only acts on the global embedding.

(a) Images
(b) Annotations
Fig. 5: Two exemplar samples from the collected Face-Human dataset, the top and the bottom rows respectively exhibit a human sample and a face sample. The key identity-regions are masked for privacy security.
(a) Face-Human
(b) Mapillary-Vistas
(c) BIG
(a) Mapillary-Vistas
(b) Face-Human
(c) BIG
Fig. 6: The performance comparison on three high-resolution benchmarks. The top row shows the BR-BP curves of all models trained on Face-Human while evaluated on Face-Human, Mapillary-Vistas, and BIG datasets from left to right. And the bottom row analogously exhibits the performance comparison of all models trained on Mapillary-Vistas dataset.
Fig. 7: The performance comparison on popular datasets BSDS500 and NYUv2
(a) Inputs
(b) GT labels
(c) SLIC [1]
(d) SNIC [2]
(e) SCN [39]
(f) PCNet
Fig. 8: Qualitative results of four superpixel methods, SLIC, SNIC, SCN, and our PCNet. The top to bottom rows subsequently exhibits the results of Face-Human, Mapillary-Vistas, and BIG datasets.
(a) 22 division
(b) 44 division
Fig. 9: The visual results of division-and-conquer strategy with 22 division and 44 division. The dashed lines indicate the boundaries between different image patches.

Overall, our full training loss is formulated as:


where are trade-off hyperparameters. The training procedure of our PCNet is summarized in Algorithm 1. When inference, the super-resolution branch is discarded.

Iv Experiments

We conduct extensive experiments on five datasets including three high-resolution datasets and two common benchmarks whose samples are with regular size. We systematically make comparison with state-of-the-art and classic methods to evaluate the performance of our PCNet.

Iv-a Datasets

To thoroughly validate the efficiency of our proposed method, we collect a very high-resolution dataset, named Face-Human. The Face-Human dataset comprises 250 face and 250 human images in total, the size of images ranges from to . All samples are carefully given the pixel-wise annotations by experts, specifically, the face images are manually labelled as 21 classes, while the human samples are assigned 24 labels. Fig. 5 gives two examples of the image and the corresponding label format. Our collected Face-Human dataset covers a large interval resolution and is challenging enough to evaluate the very high-resolution superpixel and image segmentation. In our experiments, 300 images are adopted for training, 80 for validation and 120 for test purpose.

Furthermore, we also randomly sample a subset of Mapillary Vistas [27], which contains 600 samples with resolution ranging from to , to validate the superiority of our method on a public benchmark. Of these, 400 images are for training, 50 and 150 images are used as validation and test data, respectively. Besides the above two datasets, we also employ the BIG [8] dataset to evaluate the performance. The BIG dataset contains 150 images whose resolution ranges from to

, the annotations of image keeps the same guidelines as PASCAL VOC 2012 

[10]. Since the BIG dataset is very tiny, we use this dataset for testing only to evaluate the generality of the trained models.

Besides the high-resolution datasets, we also conduct experiments on widely used superpixel benchmarks, BSDS500 [3] and NYUv2 [32] to further validate the effectiveness of our proposed method. BSDS500 comprises 200 training, 100 validation and 200 test images, and each image is annotated by multiple semantic labels from different experts. To make a fair comparison, we follow previous works [39, 14, 35]

and treat each annotation as an individual sample. Consequently, 1,087 training, 546 validation samples and 1,063 testing samples could be obtained. NYUv2 is an indoor scene understanding dataset and contains 1,449 images with object instance labels. To evaluate the superpixel methods, Stutz

et al.  [33] remove the unlabelled regions near the boundary and collect a subset of 400 test images with size 608448 for superpixel evaluation.

Fig. 10: Performance Comparison of our PCNet and the SCN [39] with divide-and-conquer strategy.
Fig. 11: Ablation study and the patch size discussion in local discrimination loss on Face-Human dataset, where the BS indicates the baseline method, SR stands for superpixel resolution branch, the PCNet-P# means the PCNet equipped with LD loss of patch size #.

Iv-B Implementation Details

During training, we use the inputs with to predict the association maps and reconstructions for images. We first resize the original high-resolution image to and randomly crop a patch as the global samples , while the local patch obtained by directly cropping a patch from the original images. Both the global and local samples are down-sampled 4 times to serve as the global and the local inputs, respectively. The encoder comprises 5 blocks, each block except the first down-samples the resolution by 2 times using the convolution with stride 2, while the decoder first restores the resolution as by the deconvolution operations, and a sub-pixel convolution is followed to output a feature map, which is further fed into the super-resolution and the superpixel prediction heads to reconstruct the image and give the association prediction. It is worth noting that the super-resolution branch is discarded during inference. And our local discrimination loss is performed on the pixel embedding in main branch to highlight the boundary pixels, we set the patch size as 5, i.e. . The hyperparameters, , are set as 0.1, and 0.5, respectively. The networks are trained using the adam optimization [15] for 4k iterations with batch size 8, and start training with initial learning rate 5e-5, which is discounted by 10 every 2K iterations. When inference, for a test image, we first down-sample it by 4 times and feed it into the network to produce the association with the original resolution.

Several methods are considered for performance comparison, including classic methods, SLIC [1], LSC [19], ERS [24], SEEDS [9], SNIC [2] and the state-of-the-art deep method, SCN [39]. We simply use the OpenCV implementation for methods SLIC, LSC, and SEEDS. For other methods, we use the official implementations from the authors. As for another excellent method SSN [14], since it could only process 1K resolution images in our practice, therefore, we don’t include this method for comparison.

Iv-C Comparison with the state-of-the-arts

It is much important for superpixel segmentation to accurately identify the boundaries, therefore, we use the boundary recall (BR) and boundary precision (BP) to evaluate the performance [33].

To thoroughly evaluate the model, we train the network on Mapillary-Vistas or Face-Human datasets, and test on all three benchmarks to access the ability and generality of the models, especially for the deep model, SCN [39] and our proposed PCNet. Fig. 6 exhibits the BR-BP curves on three benchmarks, where the top row shows the performance comparison of all models trained on Face-Human and evaluated on three test sets, while the bottom row analogously exhibits the comparison on Mapillary-Vistas dataset. Benefiting from the differentiable convolution neural network, the deep learning methods, SCN and PCNet, perform much better than the traditional methods, SLIC [1], SNIC [2], SEEDS [9], and ERS [24]. Comparing with the state-of-the-art method, SCN, our PCNet performs slightly better on Face-Human and comparable on Mapillary-Vistas dataset. For the generality, our method perform better on the BIG dataset from Fig. 6 I(c) and II(c). When generalizing to the Face-Human or Mapillary-Vistas, PCNet is comparable with SCN.

Besides the high resolution datasets, we also conduct experiments on two widely-used datasets, BSDS500 [3] and NYUv2 [32] whose samples are with regular size. Following Yang [39] and Jampani [14], we train the model on the BSDS500 dataset and test on on BSDS500 and NYUv2 dataset to evaluate the performance and the generality. The results are reported in Fig. 7, we can see that the performance our PCNet on BSDS500 is still comparable with the model SCN, while performs worse than the SSN. When generalized to the NYUv2 dataset, our model could achieve a slightly better performance that the SOTA methods SCN and SSN, which validates the effectiveness of our PCNet. What’s more, our PCNet could also achieve a more outstanding inference efficiency due the lower resolution input.

Fig. 6 suggests that our PCNet could achieve a slightly better or comparable performance with SCN using only a resolution version of the test image. The lower resolution input could not only allow the network to process very high-resolution 5K images but significantly accelerate the inference due to the fewer float calculation during forwarding, just as shown in Fig. 1. Fig. 8 subsequently visualizes the superpixel results on Face-Human, Mapillary-Vistas and Big datasets from top to bottom rows, intuitively showing the superiority of our method.

Iv-D Ablation Study

In this subsection, we will first discuss about the division-and-conquer strategy to further clarify the motivation of our framework, then the contribution of each component in our PCNet is validated by a series of experiments.

Iv-D1 Discussion about the division-and-conquer strategy

Since superpixel generation is an over-segmentation representation, we can generate superpixels by divide-and-conquer strategy for high-resolution images. For example, we can divide the input image as four non-overlapping sub-images, and run superpixel generation for each sub-image to get a superpixel label map. Then, we can combine the four maps by re-indexing each map to get the superpixels for the complete image. Intuitively, the division-and-conquer strategy is an straightforward choice for high-resolution superpixel segmentation. Actually, at the beginning of our practice, we have tried the division-and-conquer solution using the SOTA SCN model, but the results are not satisfactory. The first weakness of this strategy is that There would be an obvious line between patches; Sincethe patch boundary blocks the superpixel merging. if a compact region is split during patch division, it is not possible to form a superpixel for the pixels in different patches, as shown in Fig. 9. Besides, the performance of division-and-conquer is also unsatisfactory. As shown in Fig. 10, although the boundary recall is acceptable, the accuracy is too low. Therefore, we abandon this naïve strategy, and propose our PCNet to generate superpixels in an end-to-end fashion, which also achieves much better performance.

Iv-D2 Validate the contribution of each component

To validate the contribution of each module in PCNet, we conduct ablation study on Face-Human dataset to study their respective contribution including the DPC branch, super-resolution branch, and the local discrimination loss.

The results are reported in Fig. 11, the baseline (BS) method is the naïve strategy shown in Fig. 2

(b), which only employs the global image and uses the loss function Eq. 

3 to train the network. As shown in Fig. 11, the baseline method performs very poorly, since the very low-resolution input sacrifices too many textures, especially the boundary contexts. With our DPC branch, more detailed boundaries could be perceived, consequently, the performance gets improved. The super-resolution branch further enables the network to restore more details from the low-resolution input. From Fig. 11, the performance steps further when the super-resolution branch is equipped. Fig. 4 also suggests that our dynamic guiding training is more outstanding that the naïve choice, i.e. , the ground-truth multi-class label, which validates the effectiveness of our dynamic guiding training mechanism. When the local discrimination loss is further applied, we could achieve our best results.

Besides the validation of each component in our method, we also give a discussion about the performance effects of the patch size in our LD loss. In our standard PCNet, we sample patches in the baseline LD loss. In this section, we vary the patch size from 3 to 11 to study the performance difference. The results are reported in the right figure of Fig. 11, where the PCNet-P# indicates the PCNet with different patch sizes in LD loss. From Fig. 11, the LD loss with smaller patch size 3 contributes fewer performance gains. When the patch size increases to 5, the performance gets improved. However, further enlarging the patch size to 7 or 11 does not make the performance step further but perform a litter worse than PCNet-P5. We guess the reason stems from that the too large patch size would introduce more pixels that are not close enough to the boundary, leading to the LD loss fails to well focus on the boundary contexts. Therefore, we set the patch size as 5 in our PCNet.

V Conclusion

This work proposes the first framework to perform the ultra high-resolution superpixel segmentation. For memory and computation efficiency, we employ a low-to-high prediction strategy that produces a high-resolution association map from a low-resolution input. Consequently, our proposed PCNet could tolerate higher resolution images and infer with a faster speed. To compensate boundary details lost in the down-sampled inputs, we design a decoupled patch calibration branch to calibrate the boundary pixel assignment of the global output. A dynamic guiding mask is further proposed to enforce the DPC branch to focus on perceiving the boundaries. To accurately identify more boundary pixels, we propose a local discrimination loss to highlight the pixel embeddings around the boundaries. We conduct extensive experiments on two public benchmarks and our collected Face-Human dataset to evaluate the performance, our proposed PCNet could efficiently process very high-resolution 5K images while maintain the comparable performance with the state-of-the-art SCN.


  • [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurélien Lucchi, Pascal Fua, and Sabine Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell.(TPAMI), 34(11):2274–2282, 2012.
  • [2] Radhakrishna Achanta and Sabine Süsstrunk. Superpixels and polygons using simple non-iterative clustering. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 4895–4904, June 2017.
  • [3] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, 2011.
  • [4] Abhishek Badki, Alejandro J. Troccoli, Kihwan Kim, Jan Kautz, Pradeep Sen, and Orazio Gallo. Bi3d: Stereo depth estimation via binary classifications. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1597–1605. IEEE, June 2020.
  • [5] Aviram Bar-Haim and Lior Wolf. Scopeflow: Dynamic scene scoping for optical flow. In IEEE Conference on Computer Vision and Pattern Recognition, pages 7995–8004. IEEE, June 2020.
  • [6] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8924–8933, June 2019.
  • [7] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8924–8933, 2019.
  • [8] Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-Keung Tang. Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), pages 8887–8896. IEEE, June 2020.
  • [9] Michael Van den Bergh, Xavier Boix, Gemma Roig, and Luc Van Gool. SEEDS: superpixels extracted via energy-driven sampling. Int. J. Comput. Vis. (IJCV), 111(3):298–314, 2015.
  • [10] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. Internaltional Journal of Computer Vision (IJCV), 111(1):98–136, 2015.
  • [11] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. Int. J. Comput. Vis.(IJCV), 59(2):167–181, 2004.
  • [12] Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Barron. Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV), pages 7627–7636. IEEE, October 2019.
  • [13] Pasquale Iervolino, Raffaella Guida, Antonio Iodice, and Daniele Riccio. Flooding water depth estimation with high-resolution SAR. IEEE Trans. Geosci. Remote. Sens., 53(5):2295–2307, 2015.
  • [14] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. Superpixel sampling networks. In European Conference on Computer Vision (ECCV), pages 363–380, Sep. 2018.
  • [15] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), May 2015.
  • [16] Deren Li, Guifeng Zhang, Zhaocong Wu, and Lina Yi. An edge embedded marker-based watershed algorithm for high spatial resolution remote sensing image segmentation. IEEE Trans. Image Process., 19(10):2781–2787, 2010.
  • [17] Hua Li, Sam Kwong, Chuanbo Chen, Yuheng Jia, and Runmin Cong. Superpixel segmentation based on square-wise asymmetric partition and structural approximation. IEEE Trans. Multim., 21(10):2625–2637, 2019.
  • [18] Ruoteng Li, Robby T. Tan, Loong Fah Cheong, Angelica I. Avilés-Rivero, Qingnan Fan, and Carola Schönlieb. Rainflow: Optical flow under rain streaks and rain veiling effect. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 7303–7312. IEEE, October 2019.
  • [19] Zhengqin Li and Jiansheng Chen.

    Superpixel segmentation using linear spectral clustering.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1356–1363, 2015.
  • [20] Wei Liao, Karl Rohr, Chang-Ki Kang, Zang-Hee Cho, and Stefan Wörz. Automatic 3d segmentation and quantification of lenticulostriate arteries from high-resolution 7 tesla MRA images. IEEE Trans. Image Process., 25(1):400–413, 2016.
  • [21] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), pages 5168–5177, June 2017.
  • [22] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Real-time high-resolution background matting. CoRR, abs/2012.07810, 2020.
  • [23] Jiaying Liu, Wenhan Yang, Xiaoyan Sun, and Wenjun Zeng. Photo stylistic brush: Robust style transfer via superpixel-based bipartite graph. IEEE Trans. Multim., 20(7):1724–1737, 2018.
  • [24] Ming-Yu Liu, Oncel Tuzel, Srikumar Ramalingam, and Rama Chellappa. Entropy rate superpixel segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2097–2104, June 2011.
  • [25] Pengpeng Liu, Irwin King, Michael R. Lyu, and Jia Xu.

    Flow2stereo: Effective self-supervised learning of optical flow and stereo matching.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6647–6656. IEEE, June 2020.
  • [26] Yong-Jin Liu, Cheng-Chi Yu, Minjing Yu, and Ying He. Manifold SLIC: A fast method to compute content-sensitive superpixels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 651–659, 2016.
  • [27] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In IEEE International Conference on Computer Vision, ICCV, pages 5000–5009, October 2017.
  • [28] Xiao Pan, Yuanfeng Zhou, Zhonggui Chen, and Caiming Zhang. Texture relative superpixel generation with adaptive parameters. IEEE Trans. Multim., 21(8):1997–2011, 2019.
  • [29] Mausoom Sarkar, Milan Aggarwal, Arneh Jain, Hiresh Gupta, and Balaji Krishnamurthy. Document structure extraction using prior based high resolution hierarchical semantic segmentation. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, European Conference on Computer Vision (ECCV), pages 649–666, August 2020.
  • [30] Cheng Shi and Chi-Man Pun.

    Multiscale superpixel-based hyperspectral image classification using recurrent neural networks with stacked autoencoders.

    IEEE Trans. Multim., 22(2):487–501, 2020.
  • [31] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), pages 1874–1883, June 2016.
  • [32] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Version (ECCV), pages 746–760, Oct. 2012.
  • [33] David Stutz, Alexander Hermans, and Bastian Leibe. Superpixels: An evaluation of the state-of-the-art. Comput. Vis. Image Underst., 166:1–27, 2018.
  • [34] Zachary Teed and Jia Deng. RAFT: recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision (ECCV), pages 402–419, August 2020.
  • [35] Wei-Chih Tu, Ming-Yu Liu, Varun Jampani, Deqing Sun, Shao-Yi Chien, Ming-Hsuan Yang, and Jan Kautz. Learning superpixels with segmentation-aware affinity loss. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 568–576, June 2018.
  • [36] Hui Wang, Jianbing Shen, Junbo Yin, Xingping Dong, Hanqiu Sun, and Ling Shao. Adaptive nonlocal random walks for image superpixel segmentation. IEEE Trans. Circuits Syst. Video Technol., 30(3):822–834, 2020.
  • [37] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), March 2020.
  • [38] Yufeng Wang, Wenrui Ding, Baochang Zhang, Hongguang Li, and Shuo Liu. Superpixel labeling priors and MRF for aerial video segmentation. IEEE Trans. Circuits Syst. Video Technol., 30(8):2590–2603, 2020.
  • [39] Fengting Yang, Qian Sun, Hailin Jin, and Zihan Zhou. Superpixel segmentation with fully convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13961–13970, June 2020.
  • [40] Haokui Zhang, Ying Li, Yuanzhouhan Cao, Yu Liu, Chunhua Shen, and Youliang Yan. Exploiting temporal consistency for real-time video depth estimation. In IEEE International Conference on Computer Vision (ICCV), pages 1725–1734. IEEE, October 2019.
  • [41] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision (ECCV), pages 418–434, September 2018.
  • [42] Zhuo Zheng, Yanfei Zhong, Junjue Wang, and Ailong Ma. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, (CVPR), pages 4095–4104, June 2020.
  • [43] Peng Zhou, Brian L. Price, Scott Cohen, Gregg Wilensky, and Larry S. Davis. Deepstrip: High-resolution boundary refinement. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10555–10564. IEEE, June 2020.
  • [44] Sihang Zhou, Dong Nie, Ehsan Adeli, Jianping Yin, Jun Lian, and Dinggang Shen. High-resolution encoder-decoder networks for low-contrast medical image segmentation. IEEE Trans. Image Process., 29:461–475, 2020.