## 1 Introduction

In many computer vision related applications, such as image classification and segmentation, there is a important need for fully automated results.

To this end, a commonly employed strategy is to take inspiration from other data, in a supervised way when ground truth annotations are available. In this context, non-local methods have provided accurate results on various applications. In these methods, image regions are independently considered, generally using a square patch defined for each pixel, capturing a local pattern buades2005. For exemplar-based classification or segmentation, matching algorithms are usually used to search for similar patterns in the available data, to then transfer the associated information, at the pixel or image scale after a global decision process.This search for similar patterns is generally performed for each image patch, and
fast matching algorithms have been developed
*e.g.*, PatchMatch barnes2009,
TreeCANN olonetsky2012
or FLANN muja2014scalable,
to efficiently exploit a large number of example images in a reduced computational time.
To find similar content with such matching algorithms,
it is common to extract features from dense image patches or local interest points.
These features are usually designed to be robust to transformations such as scaling, viewpoint or illumination changes.
Standard descriptors include features based on gradient information such as SIFT lowe2004 or HoG dalal2005; lsvm-pami, or binary patterns, *e.g.*, BRIEF calonder2010brief.

In the last few years, convolutional neural networks have also been able to extract particularly relevant features, and have shown promising results in many applications related to image processing

lecun2015deep. However, although some recent architectures such as U-Net may learn from relatively small datasets ronneberger2015u, these methods rely on costly supervised learning strategies, and often require very large annotated datasets to be trained efficiently. Besides, the results of these models are generally difficult to interpret, and can be very sensitive to small perturbations of their inputs

moosavidezfooli2016universal. In many cases, such as medical imaging, these drawbacks can limit the potential of automated processing pipelines. Therefore, there is still an important need for methods that can perform without learning steps, as well as with limited training data and computing power.In this context of fast image search and matching requirements, many works have first focused on hierarchical approaches using prior image over-segmentation
into regular grids, *e.g.*, lazebnik2006beyond.
To go further, other methods proposed to group similar pixels into connected components of homogeneous colors called superpixels,
drastically reducing the number of elements to process
while preserving contours and spatial structure stutz2018.
Therefore, a process applied at such over-segmentation scale can be close to the optimal pixel-wise result.
Several works have used superpixels in non-local frameworks, *e.g.*, gould2008; tighe2010

, or in unsupervised learning-based superpixel matching approaches using random forests

conze2017hierarchical; kanavati2017supervoxel. Nevertheless, the geometrical irregularity of such decompositions giraud2017_gef (*i.e.*, in terms of shape, adjacency or contour smoothness) can become an issue, since neighborhood information is crucial to compute accurate matches in terms of context.

Other approaches have attempted to use superpixel neighboring information, *e.g.*, pei2014; sawhney2014.
Among them, the SuperPatchMatch (SPM) framework giraud2017_spm partially addresses this issue
with a superpixel neighborhood structure called superpatch and a dedicated metric to compare two structures having different geometry and number of elements.
However, SPM remains sub-optimal
in terms of computational complexity and matching accuracy.
The framework enables the search for matches at the same superpatch scale and it only computes features within each superpixel region, poorly capturing contour information.
Several works indeed highlighted the need for accurate superpixel-wise features neubert2015benchmarking; zhang2018consistent; tilquin2018robust,
while most image descriptors are locally computed on a regular square neighborhood.

### Contributions

In this paper, we address the limitations of previous non-local methods only focusing on intra-region information within superpixels or superpixel neighborhoods giraud2017_spm, by introducing a novel dual superpixel neighborhood descriptor called dual superpatch (DSP), containing two independent descriptor sets (see Sec. 3).

First, intra-superpixel features capture color or texture information within cropped superpixel regions in order to avoid influence of pixel contours or inaccurate superpixel borders (see Sec. 3.1). Then, to capture structure information, for instance in terms of contour orientations, we extract a relatively regular grid of specific descriptors at superpixel interfaces (see Sec. 3.2). To efficiently compare such irregular dual descriptors, having different geometry and number of elements, we also propose new distances and optimizations, significantly reducing the computational complexity.

The SuperPatchMatch (SPM) search algorithm giraud2017_spm
with our accurate dual superpatch (DSPM),
performs more relevant superpixel matching. To go further, we also extend DSPM to the search of matches at multiple scales (see Sec. 4),
and propose a framework to perform automatic labeling using exemplar-based images with ground truth labels.
In our framework, the comparison of DSP at different scales can be easily performed since we consider reduced spatial information, *i.e.*, sets of barycenter positions.
This way, we are able to match similar objects at different sizes in heterogeneous datasets.

Finally, to show the robustness of our framework, especially compared to giraud2017_spm, we consider several matching and exemplar-based labeling experiments on a standard face dataset huang2007db (see Sec. 5).

## 2 The SuperPatchMatch Framework

In this section, we first recall the SuperPatchMatch (SPM) framework initially introduced by giraud2017_spm, which constitutes the basis of our approach.

### 2.1 The SuperPatch Structure

To generalize standard patch-based frameworks to irregular image decompositions, giraud2017_spm proposed the superpatch structure. As for square patches of pixels defined around each pixel, a superpatch , associated to a superpixel , contains the adjacent neighbors of a superpixel with respect to a fixed radius . The proximity is simply computed according to the superpixel spatial barycenters such that:

(1) |

where and respectively denote the spatial barycenters of superpixels and . This way, the superpatch structure only includes the most significantly neighboring superpixels, using reduced spatial information. In Fig. 1, we show a superpatch example, defined for a superpixel , containing its adjacent superpixels to provide a superpixel neighborhood descriptor.

### 2.2 SuperPatch Comparison Distance

A comparison distance is also proposed in giraud2017_spm to measure the similarity between two superpatches. The main issue to design such distance is that the two structures are very likely to have different geometry and number of elements. Hence, there is no one-to-one association between superpixels, contrary to pixels within regular patches. To preserve the ability to compare patterns, the spatiality must be taken into account, and giraud2017_spm proposed to simply consider the proximity of superpixel barycenters after registration on the central superpixels. In the following, we consider two superpatches and , for instance in two images and . A weight , measures the relative displacement between the registered barycenters and of superpixels and , with respect to central superpixels and , and is a scaling parameter set to , with and the respective number of pixels and superpixels in the image. This way a superpixel only compares to the closest ones in .

The distance between two superpatches and is finally defined as:

(2) |

where also weights the influence of according to its spatial distance to such that , and is the distance between the superpixel features and . Note that any distance and feature can be considered in Eq. (2).

The comparison process between two superpatches having different number of element and geometry is illustrated in Fig. 2. The weight in Eq. (2) weights the feature distance between a superpixel in and a superpixel in after registration on their central barycenters. In this Figure, the weights corresponding to the bottom superpixel of are represented within each superpixel of .

### 2.3 SuperPatchMatch Correspondence Algorithm

Non-local methods have soon highlighted the need for fast patch-based matching algorithms to perform the search of correspondences within large areas, *e.g.*, library of example images.
A significant breakthrough has been obtained with PatchMatch (PM) barnes2009,
a fast partly random matching algorithm, providing for each patch of an image , a match in an image .
PM has very interesting properties such as no requirements for learning or preprocessing steps,
and its complexity only depends on the size of the image to process ,
enabling to search for matches in a large set of example images.

The PM algorithm starts from random associations and iteratively refines them with a sequential processing of all image patches. The refinement process is mainly based on the fast propagation of good matches from spatially adjacent neighbors. Large regions are indeed very likely to correspond between images. According to the scan order, which is reversed at each iteration, the shifted correspondences of two spatially adjacent patches are considered as new candidates. Also, random patches are tested near the current best match in . Note that the PM process being partly random, several processes carried out in parallel can potentially provide different matches.

The SuperPatchMatch (SPM) method generalizes PM to superpatches giraud2017_spm, to provide a fast matching algorithm of superpixels. The method mainly requires to adapt the propagation of matches based on adjacent neighbors since there is no more consistent geometry between adjacent superpixels, contrary to the standard pixel grid case. The propagation step of SPM is illustrated in Fig. 3 in the case of two images and . The adjacent neighbors are considered to lead to new matches while respecting the relative orientation between superpixels in and , to favor the matching of larger regions.

### Limitations

Nevertheless, the default SPM distance
in Eq. (2)
has a quadratic complexity, *i.e.*, each superpixel of a superpatch is compared to all the ones of the other superpatch, and may result in important computational time. Besides, this framework only considers intra-superpixel descriptors.
Although, they may be sufficient to capture information within regions,
the superpatch does not explicitly focus on capturing contours or gradient information.
This may be an issue since such information generally lies at the border of a superpixel,
and can be shared between two regions, reducing the relevance of the descriptor.
Finally, the SPM framework does not consider multi-scale information that would allow to capture objects of different sizes.

We address all these issues in the following sections with the proposed multi-scale superpatch matching framework that uses new dual superpixel descriptors.

## 3 Dual Superpixel Descriptors

In this section, we propose an approach to relevantly extract superpixel neighborhood descriptors.
We introduce a dual descriptor to efficiently capture around a superpixel,
both feature region content and
contour information, potentially lying at their borders.
Features are computed within superpixel regions and additional contour features are extracted at multiple interfaces between adjacent superpixels.
The proposed dual descriptor of superpixel neighborhood is called Dual SuperPatch (DSP), is denoted
for a superpixel ,
and is represented in Fig. 4
on the same decomposition example used in Fig. 1.
In the following, we present the extraction approach for
region (R), *i.e.*, intra-superpixel,
and interfaces (I) descriptors,
and we propose a general framework to compare different DSP.

### 3.1 Superpixel Region Descriptors

The superpatch formulation of giraud2017_spm considers features computed on each whole superpixel region contained into the neighborhood structure. However, superpixels tend to capture homogeneous regions, so pixels at thin contours can be arbitrarily associated to superpixels, leading to altered descriptors. A reduced block area or spatial weighting from the superpixel barycenter could not be applied since these approaches do not guarantee to relevantly extract information when superpixels have very irregular shapes neubert2015benchmarking.

In this work, we propose to consider the superpixel region information with an offset of pixels to its borders. This way, we take into account almost all the region, while being robust to inaccurate superpixel borders or contour information, that will be considered in another specific interface descriptors within our DSP (see Sec. 3.2). In Fig. 4, the considered regions for superpixels are represented in blue. To each region , feature and spatial information are considered, so a dual superpatch contains a set of tuples .

To demonstrate the issue of considering the whole superpixel region to extract features, we consider two decompositions of images containing regions with different oriented textures (see Fig. 5). The superpatch radius is set to to only consider intra-region information, where HoG descriptors dalal2005 are computed. For each superpixel of the left image, we compute in an exhaustive manner its closest match in terms of superpixel content in the right image. If the texture are similar, we consider the matching as accurate (1), otherwise inaccurate (0). We report in Tab. 1 the average matching accuracy on all superpixels of the left image, according to different values and several levels of Gaussian noise applied to both images after decomposition. We perform this evaluation using the decompositions obtained with a texture-aware method giraud2019_nnsc, and also the ground truth ones, perfectly capturing texture changes. This experiment highlights the need to restrict the area to extract superpixel information since superpixel decompositions may be not perfectly accurate. Moreover, even on perfectly fitting decompositions, the inaccurate gradient information lying on superpixel borders is captured with

and may impact the results with a noise variance superior to 50, while we do not take into account texture in other superpixels with

.Superpixel decompositions | Ground truth decompositions | |||||||

Noise variance | 0 | 50 | 100 | 125 | 0 | 50 | 100 | 125 |

(a) Matching using only color information at intra-superpixel regions () | (b) Matching using only contour features at superpixel interfaces () |

*i.e.*, approximately capturing the first adjacency ring.

Fast comparison distance. The comparison between two sets of superpixel region descriptors can be performed in a more computationally efficient manner than with Eq. (2). We propose to only select one superpixel to compare for each superpixel . To do so, the superpixel barycenter is first registered by the displacement between central superpixels and , and we denote this new position by which is computed such that . In Fig. 2, these correspond to the red superpixel barycenters. Then, we project the registered barycenters on the decomposition of the image from where is extracted. In Fig. 2, the black superpixel containing a red dot would be selected to compare to superpixel of region . This corresponding superpixel containing in the compared image, is denoted , and its associated intra-region is denoted . This way, we significantly reduce the distance complexity, while potentially increasing the comparison accuracy (see Sec. 5). The comparison between two region descriptors and is defined using barycenter projections such that:

(3) |

Note that barycenters falling outside the image limits are projected to select the closest superpixel on the image boundary.

A similar projected distance was suggested in giraud2017_spm, in a non-symmetric formulation. In our dual superpatch comparison model, we consider a symmetric projected distance on intra-region descriptors defined as:

(4) |

### 3.2 Superpixel Interface Descriptors

To efficiently capture image contour information,
we propose to also consider specific descriptors at superpixel interfaces.
These can be easily extracted with a low complexity by considering the presence of at least three superpixels in a pixels neighborhood.
To avoid over detection, a larger area can be neglected after selection of an interface point.
This way, we directly obtain a relatively regular grid of potential interest points in terms of contours,
without introducing further scaling or thresholding parameters.
In Fig. 4, these interface regions denoted are represented as green squares.
On these regions, specific contour descriptors can be computed,
*e.g.*, HoG
dalal2005.

Acceleration of quadratic distance. Since interface regions do not provide a dense decomposition of the image domain, the distance Eq. (3) using projections cannot be used to fastly compare two sets of interface descriptors. A quadratic one-to-many distance such as Eq. (2) could be used, but at the expense of important computational cost. To address this issue, we propose a one-to-one association for each interface descriptor . Each is only compared to the spatially closest one in the other dual superpatch. This way, the framework only requires exhaustive spatial distances between interface barycenters. The distance is computed as for Eq. (3), where , the selected interface descriptor in for is defined as . Finally, as for Eq. (4), the distance is also computed from to to obtain a symmetric distance.

### 3.3 General Dual SuperPatch Comparison Framework

Our dual superpatch (DSP) , for a superpixel , is described by a set of intra-superpixel regions and superpixel interfaces descriptors such that . Note that and can have a different number of elements. To relevantly measure the similarity of two DSP and having different geometry and number of elements, we propose the following general DSP comparison distance:

(5) |

with the fast distance on descriptors, using barycenter projections Eq. (4) for intra-region , and selection of closest descriptor for interfaces . and a setting parameter. Note that any feature can be considered in or . Therefore, can have an intuitive tuning using the same or normalized descriptors for both and .

In Fig. 6, we show matching results obtained with our generalized model using average RGB color as intra-region features (Fig. 6(a), ) and HoG descriptors as interface features (Fig. 6(b), ). Hence, we highlight the general aspect of our framework that allows to either focus on intra-region (Fig. 6(a)) or interface information (Fig. 6(b)). In Sec. 5, we further demonstrate the performances obtained using these complementary descriptors.

## 4 Multi-Scale Dual SuperPatchMatch

In this section, we propose to extend the SPM framework with our dual descriptor (DSPM), to perform the search of DSP at multiple scales. We first show how to compare two DSP of different sizes, then we propose a multi-scale fusion strategy.

### 4.1 Dual SuperPatch Rescale

In Sec. 3, we showed how to compare two DSP extracted with the same radius size . Nevertheless, the proposed distance Eq. (5) can easily adapt to DSP of different sizes, since the spatial information is only measured by barycenter positions denoted with . We consider two DSP and , with different DSP extraction radius and in Eq. (1). To compare them, all spatial information contained in can be adjusted according to the ratio between the radiuses, such that:

(6) |

This way,
similar DSP can be searched at various scales, *e.g.*, in example images.
Note that features and remain unchanged by this scaling transformation.

### 4.2 Multi-Scale Exemplar-based Framework

Most non-local methods perform a search for similar content in a heterogeneous dataset, with no prior information on the targeted object size. In giraud2017_spm, no multi-scale strategy is proposed since it considers an exemplar-based labeling experiment on linearly registered images huang2007db.

Here, we introduce a generalized multi-scale exemplar-based framework allowing to search for similar DSP of different radiuses, in order to capture objects of different sizes. The proposed DSP structure indeed enables to perform a simple automatic rescale, presented in Sec. 4.1. Hence, multiple DSP sizes can be considered in a set of examples images . A set is considered for setting the DSP radius in .

In giraud2017_spm, a supervised labeling framework based on the non-local means algorithm buades2005 was introduced to merge the information of multiple superpatch matches computed in a library of example images for an image to process. A label map is computed for a superpixel , for all different labels , such that:

(7) |

where is the set of matches for computed at scale and having a ground truth label , and is the normalization factor , with a weight depending on the DSP similarity giraud2017_spm. The final label of a superpixel is computed as .

In Fig. 7, we represent matching results obtained without (gray lines) and with (green lines) our multi-scale strategy. We can see that the best match between searches at scales enables to catch the larger flower with similar colors, instead of the one at the same scale.

## 5 Experimental Validations

In this section, we present several quantitative experiments to demonstrate the interest of the proposed DSPM framework. We first validate the behavior of our model on standard images with respect to the method parameters. Then, we propose larger scale segmentation experiments on a standard face dataset.

### 5.1 Parameter Settings

The proposed method was implemented with MATLAB using C-MEX code on a standard Linux computer with cores at GHz and GB of RAM. The number of DSPM iterations is set to 5, and we use a -norm as distance between features as in giraud2017_spm. To avoid over detection, interfaces are detected at least each pixels. Default parameters are set such that , the border offset for region features , , the trade-off parameter between intra-region and interface distances in Eq. (5), and superpatch radius . The considered descriptors in and are reported according to the application.

### 5.2 Influence of Parameters

To demonstrate the interest of each contribution, we consider a matching experiment on standard images Baboon, Barbara, House, Lena and Peppers, each decomposed with two superpixel methods: SLIC achanta2012 and SNIC achanta2017snic. For each superpixel in a given decomposition, we compute the closest DSP match in the other one. A robust descriptor should indeed be robust to variations in the segmentations. In terms of features, we compute normalized cumulative RGB histogram with bins per canal on intra-regions , and HoG dalal2005 on a local pixels window for interface region descriptors .

(a) | (b) |

We evaluate the matching accuracy by the average distance between the superpixel barycenters and the one of their match in the other decomposition.
In Fig. 8(a) and Fig. 8(b),
we respectively report the average distance with respect to the radius parameter
and with respect to the parameter for .
On the first hand, Fig. 8(a) shows that the accuracy logically increases with the superpatch radius,
and with each contribution, *i.e.*, symmetrical projected distance (Eq. (4)), offset to border (), and interface descriptors.
On the other hand, Fig. 8(b) illustrates the interest of using interface descriptors in Eq. (5) in conjunction with cropped regions for
, and that
a balanced trade-off parameter provides the best matching accuracy.
In Fig. 9, we also show an example of matching result for DSPM with default parameters compared to SPM.

(a) SLIC | (b) SNIC | (c) SPM | (d) DSPM |

Radius | 50 | 25 | 33 | 75 | 100 | argmax | argmax |
---|---|---|---|---|---|---|---|

w/o (6) | |||||||

w/ (6) |

### 5.3 Segmentation and Labeling Experiments

Validation framework. We evaluate the proposed DSPM approach on the same face labeling experiment than giraud2017_spm. The considered Labeled Faces in the Wild (LFW) dataset huang2007db, contains 1500 training and 927 testing images of pixels, linearly registered with huang2007, and already decomposed into approximately 250 superpixels. LFW contains decompositions and labeling ground truths, so comparisons with state-of-the-art methods do not depend on the superpixel decompositions. Note that to fairly compare with giraud2017_spm, we use the same HoG implementation on a regular grid lsvm-pami and compute 50 DSP matches by 50 independent DSPM processes for each superpixel, merged in Eq. (7).

Radius | 25 | 33 | 50 | 75 | 100 | Argmax | Average |
---|---|---|---|---|---|---|---|

features.c | |||||||

w/o (6) | |||||||

w/ (6) | |||||||

imgradient | |||||||

w/ (6) | 94.42 | 94.33 | |||||

w/o rescale | 93.21 | 00.00 |

Multi-scale validation. In this experiment, the goal is to validate the interest of our proposed multi-scale matching strategy introduced in Sec. 4. To this end, we have manually applied random scaling transformations to the registered training images. Each image and its corresponding decomposition has been either downsampled or upsampled randomly by a factor of or

(with no interpolation). As a result, faces depicted in the images can appear up to twice as big or small compared to their initial scales.

Hence, the dataset contains face patterns at different scales that would not be efficiently captured using the same DSP radius in and .We apply DSPM with = and . In Tab. 2, we report the labeling accuracy for each radius and for the multi-scale label fusion proposed in Sec. 4.2.
From these results, when , we can observe that the performance for smaller or larger radiuses is always better after applying the rescale strategy, *i.e.*, with Eq. (6).
The multi-scale fusion averaging the result over the different radius sizes performs better than the default and than most single scales.
Moreover, by considering only larger scales, *i.e.*, , thus more precise DSP comparisons, we obtain the highest labeling accuracy, demonstrating the framework ability of the framework to merge information from multi-scale matching.
In future works, we plan to apply this approach on non-registered datasets where objects might naturally appear at different scales.

Comparison to state-of-the-art methods. In Fig. 10, we first compare the influence of the number of selected ANN for each superpixel, then merged in the label fusion process Eq. (7) for the proposed DSPM method and SPM giraud2017_spm using Eq. (3). For all ANN numbers, DSPM provides the best results, and is already reaching of accuracy with only ANN. Note that for both methods, a plateau is reached around ANN.

Method | Superpixel accuracy | Pixel accuracy |
---|---|---|

Spatial CRF kae2013 | not reported | |

CRBM kae2013 | not reported | |

GLOC kae2013 | not reported | |

DCNN liu2015 | not reported | |

SPM giraud2017_spm w/ (2) | ||

SPM giraud2017_spm w/ (3) | ||

DSPM |

In Tab. 3

, we also compare the performance of the proposed DSPM method with the results of state-of-the-art ones, mostly based on supervised (deep) learning approaches.

In kae2013, several approaches are used to label the LFW dataset such as a spatial conditional random field (CRF) and a conditionnal restricted Boltzmann machine (CRBM). The GLOC (GLObal and LOCal) method

kae2013 is also proposed to jointly use both CRF and CRNM approaches to introduce global shape priors in the training process. Finally, in liu2015, a deep convolutional neural network (DCNN) is proposed and dedicated to the face labeling application. Note that the results in Table 3 are the ones reported by the authors. For SPM, the results correspond to the initial framework using costly quadratic comparisons with Eq. (2), and the results reported by the authors using non-symmetric projected distances with Eq. (3). DSPM reports the best compared labeling accuracy at both superpixel and pixel-wise level. Labeling examples compared to SPM are also represented in Fig. 11. The proposed DSP enables to relevantly capture the context of a superpixel neighborhood in terms of texture and structure. Moreover, without further optimizations on non fully multi-threaded code, DSPM performs in less than s per subject, against s for SPM using Eq. (2). Note that compared state-of-the-art approaches may provide faster computational times but at the expense of previous costly learning-based processes.Our method is particularly interesting due to its simplicity of use, parameter tuning,
and interpretability compared to learning-based approaches, while providing more accurate results than SPM.
Besides, any feature can be directly used in the method, even more advanced descriptors, *e.g.*, zhang2018consistent; tilquin2018robust, eventually based on previously trained deep learning architectures.

Image | Superpixels | Ground truth | SPM | DSPM |

## 6 Conclusion

In this work, we addressed some important limitations of existing superpixel matching frameworks, in terms of robustness and computational complexity. We introduced the dual superpatch, a new superpixel neighborhood descriptor containing both intra-region and interface information that are respectively robust to the inaccuracy of superpixel borders and capture contour structures. We also proposed optimized distances and a multi-scale framework to search for similar dual superpatches in an image dataset. Our validations showed an accuracy improvement with our method on matching and exemplar-based labeling applications. The relevance of the proposed dual approach should benefit to all superpixel-based non-local approaches and future works will focus on applying the method to heterogeneous computer vision and medical datasets, and tackling other applications such as superpixel-based image editing.

Comments

There are no comments yet.