DASC: Robust Dense Descriptor for Multi-modal and Multi-spectral Correspondence Estimation

Establishing dense correspondences between multiple images is a fundamental task in many applications. However, finding a reliable correspondence in multi-modal or multi-spectral images still remains unsolved due to their challenging photometric and geometric variations. In this paper, we propose a novel dense descriptor, called dense adaptive self-correlation (DASC), to estimate multi-modal and multi-spectral dense correspondences. Based on an observation that self-similarity existing within images is robust to imaging modality variations, we define the descriptor with a series of an adaptive self-correlation similarity measure between patches sampled by a randomized receptive field pooling, in which a sampling pattern is obtained using a discriminative learning. The computational redundancy of dense descriptors is dramatically reduced by applying fast edge-aware filtering. Furthermore, in order to address geometric variations including scale and rotation, we propose a geometry-invariant DASC (GI-DASC) descriptor that effectively leverages the DASC through a superpixel-based representation. For a quantitative evaluation of the GI-DASC, we build a novel multi-modal benchmark as varying photometric and geometric conditions. Experimental results demonstrate the outstanding performance of the DASC and GI-DASC in many cases of multi-modal and multi-spectral dense correspondences.


page 1

page 4

page 8

page 10

page 11

page 12

page 13

page 14


Deep Self-Convolutional Activations Descriptor for Dense Cross-Modal Correspondence

We present a novel descriptor, called deep self-convolutional activation...

MMGSD: Multi-Modal Gaussian Shape Descriptors for Correspondence Matching in 1D and 2D Deformable Objects

We explore learning pixelwise correspondences between images of deformab...

Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition

Establishing correspondences is a fundamental task in variety of image p...

Deep Multi-Spectral Registration Using Invariant Descriptor Learning

In this paper, we introduce a novel deep-learning method to align cross-...

FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence

We present a descriptor, called fully convolutional self-similarity (FCS...

Correspondence Networks with Adaptive Neighbourhood Consensus

In this paper, we tackle the task of establishing dense visual correspon...

Dense Correspondences Across Scenes and Scales

We seek a practical method for establishing dense correspondences betwee...

1 Introduction

Recently, many computer vision and computational photography problems have been reformulated to overcome their inherent limitations by leveraging multi-modal and multi-spectral images. Typical examples of other imaging modalities include near-infrared (NIR) image

[1, 2] and dark flash image [3]. More broadly, flash and no-flash images [4], blurred images [5, 6], and images taken under different radiometric conditions [7] can also be considered as multi-modal [8].

Establishing dense visual correspondences for multi-modal and multi-spectral images is a key enabler for realizing such tasks. In general, the performance of correspondence algorithms relies primarily on two components: appearance descriptor and optimization scheme. Traditional dense correspondence methods for estimating depth [9] or optical flow [10, 11] fields, in which input images are acquired in a similar imaging condition, have been dramatically advanced in recent studies. To define a matching fidelity term, they typically assume that multiple images share a similar visual pattern, e.g., color, gradient, and structural similarity. However, when it comes to multi-spectral and multi-modal images, such properties do not hold as shown in Fig. 1, and thus conventional descriptors or similarity measures often fail to capture reliable matching evidence.

(a) Image 1
(b) Image 2
(c) DAISY [12]
(d) DASC
Fig. 1: Some challenging multi-modal and multi-spectral images such as (from top to bottom) RGB-NIR, flash-noflash images, two images with different exposures, and blur-sharp images. The images in the third and fourth column are the results obtained by warping images in the second column to images in the first column with dense correspondence maps estimated by using DAISY [12] and our DASC descriptor, respectively.

This leads to a poor matching quality as shown in Fig. 2. Furthermore, substantial geometric variations, which often appear in images captured under wide-baseline conditions, make the matching task even more challenging. Although employing a powerful optimization technique could help estimate a reliable solution with a spatial context [13, 14, 15], an optimizer itself cannot address an inherent limitation without suitable matching descriptors [16].

Our method starts from an observation that a local internal layout of self-similarities is less sensitive to photometric distortions, even when an intensity distribution of an anatomical structure is not maintained across different imaging modalities [17]. That is, the local self-similarity (LSS) descriptor would be beneficial to overcoming inherent limitations of existing descriptors in establishing correspondences between multi-modal or multi-spectral images. Several approaches based on the LSS have been presented for multi-modal and multi-spectral image registration [18, 19], but they do not scale well to estimating dense correspondences for multi-modal and multi-spectral images, and their matching performance is still poor.

In this paper, we propose a novel local descriptor, called dense adaptive self-correlation (DASC), designed for establishing dense multi-modal and multi-spectral correspondences. It is defined with a series of patch-wise similarities within a local support window. The similarity is computed with an adaptive self-correlation measure, which encodes an intrinsic structure while providing the robustness against modality variations. To further improve the matching quality and runtime efficiency, we propose a randomized receptive field pooling strategy using sampling patterns that select two patches within the local support window. A linear discriminative learning is employed for obtaining an optimal sampling pattern. The computational redundancy that arises when computing densely sampled descriptors over an entire image is dramatically reduced by applying fast edge-aware filtering [20].

Furthermore, in order to address geometric variation problems such as the scale and rotation, we propose the geometry-invariant DASC (GI-DASC) descriptor that leverages the efficiency and effectiveness of the DASC through a superpixel-based representation. Specifically, we infer an initial geometric field with corresponding scale and rotation of reliable sparse key-points obtained using weighted maximally self-dissimilarity (WMSD), and then propagate the initial geometric field on a superpixel graph. After transforming sampling patterns according to geometric fields on each superpixel, the DASC is efficiently computed with the transformed sampling patterns on each superpixel extended subimage. Compared to conventional geometry-invariant methods for dense correspondence [21, 22], which have been focusing on employing powerful optimization schemes, the GI-DASC provides geometric and photometric robustness on the descriptor itself.

Experimental results show that the DASC outperforms conventional area-based and feature-based approaches on various benchmarks including modality variations; (1) Middlebury stereo benchmark containing illumination and exposure variations [23], (2) multi-modal and multi-spectral dataset including RGB-NIR images [8, 1], different exposure [8, 7], flash-noflash images [7], and blurry images [5, 6], and (3) MPI optical flow benchmark containing specular reflections, motion blur, and defocus blur [10]. We also show that the GI-DASC outperforms existing geometry-invariant methods on a novel multi-modal benchmark.

1.1 Contribution

(a) RGB image
(b) NIR image
(c) Matching cost in A
(d) Matching cost in B
(e) Matching cost in C
Fig. 2: Examples of matching cost comparison. Multi-spectral RGB and NIR images have locally non-linear deformation as depicted in A, B, and C. Matching costs computed with different descriptors along A, B, and C’s scan-lines are plotted in (c)-(e). Unlike conventional descriptors, the proposed DASC descriptor yields a reliable global minimum.

The contributions of this paper can be summarized as follows. First, to the best of our knowledge, our approach is the first attempt to design an efficient, dense descriptor for matching multi-modal and multi-spectral images, even under varying geometric conditions. Second, unlike a center-biased dense max pooling, we propose a randomized receptive field pooling with sampling patterns optimized via a discriminative learning, making the descriptor more robust to matching outliers incurred by different imaging modalities. Third, we propose an efficient computational scheme that significantly improves the runtime efficiency of the proposed dense descriptor. Fourth, a geometry-invariant dense descriptor is also proposed, which provides a geometric robustness as a descriptor itself.

This manuscript extends its preliminary version [24]. It newly adds (1) a scale and rotation invariant extension of the DASC, called GI-DASC; (2) a new multi-modal benchmark with a ground truth annotation, captured under varying photometric and geometric conditions; and (3) an intensive comparative study with existing geometry invariant methods using various datasets. The source code of our work (including DASC and GI-DASC) and the new multi-modal benchmark are available at our project webpage [25].

2 Related Work

2.1 Feature Descriptors

As a pioneering work, the scale invariant feature transform (SIFT) was first introduced by Lowe [26] to estimate robust sparse feature correspondence under geometric and photometric variations. Based on the intensity comparison, fast binary descriptors, such as binary robust independent elementary features (BRIEF) [27] and fast retina keypoint (FREAK) [28], have been proposed. Unlike these sparse descriptors, Tola et al. developed a dense descriptor, called DAISY [12], which re-designs conventional sparse descriptors, i.e., SIFT, to efficiently compute densely sampled descriptors over an entire image. Although these conventional gradient-based and intensity comparison-based descriptors show satisfactory performance for small photometric deformation, they cannot properly describe multi-modal and multi-spectral images that often exhibit severe non-linear deformation.

To estimate correspondences in multi-modal and multi-spectral images, some variants of the SIFT have been developed [29], but these gradient-based descriptors have an inherent limitation similar to the SIFT, especially when an image gradient varies across different modality images. Schechtman and Irani introduced the LSS descriptor [17] for the purpose of template matching, and achieved impressive results in object detection and retrieval. Torabi et al. employed the LSS as a multi-spectral similarity metric to register human region of interests (ROIs) [19]. The LSS also has been applied to the registration of multi-spectral remote sensing images [30]. For multi-modal medical image registration, Heinrich et al. proposed a modality independent neighborhood descriptor (MIND) [18] inspired by the LSS. However, none of these approaches scale very well to dense matching tasks for multi-modal and multi-spectral images due to a low discriminative power and a huge complexity.

Recently, several approaches started to employ deep convolutional neural networks (CNNs)

[31] for estimating correspondences. For designing explicit, discriminative feature descriptors, intermediate activations from CNN architecture are extracted [32, 33, 34, 35], and they have been shown to be effective for patch-level tasks. However, even though CNN-based descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in multi-modal images, since they use shared convolutional kernels across images which lead to inconsistent responses similar to conventional descriptor [36, 35]. Furthermore, they are unable to provide dense descriptors in the image due to a prohibitively high computational complexity.

2.2 Area-based Similarity Measures

As surveyed in [37]

, the mutual information (MI), leveraging the entropy of the joint probability distribution function (PDF), has been popularly applied to a registration of multi-modal medical images. However, the MI is sensitive to local radiometric variation since it formulates the intensity variation in a global manner using the joint entropy computed over an entire image. In

[38], this issue can be alleviated to some extent by leveraging a locally adaptive weight obtained from SIFT matching, called MI+SIFT in this paper, but its performance is still limited against the multi-modal variation [39]. Although cross-correlation based methods such as an adaptive normalized cross-correlation (ANCC) [40] show satisfactory results for locally linear variations, they show a limitation under severe modality variations. Irani et al.

employed the cross-correlation on the Laplacian energy map for measuring multi-sensor image similarity

[41], but it also shows a limitation for general image matching tasks. A robust selective normalized cross-correlation (RSNCC) [8] was proposed for the dense alignment between multi-modal images, but its performance is still unsatisfactory due to an inherent limitation of intensity based similarity measure.

(a) LSS descriptor [17]
(b) DASC descriptor
Fig. 3: Demonstration of the LSS [17] and the DASC descriptor. Within the support window, solid and dotted line box depict source and target patch, respectively. Unlike a center-biased dense max pooling on each in the LSS descriptor, the DASC descriptor incorporates a randomized receptive field pooling using sampling pattern on , optimized by a discriminative learning.

2.3 Geometry-Invariant Dense Correspondences

Based on the SIFT flow (SF) [13] optimization, many methods have been proposed to alleviate geometric variation problems, including deformable spatial pyramid (DSP) [14], scale-less SIFT flow (SLS) [42], scale-space SIFT flow (SSF) [43], and generalized DSP (GDSP) [22]. However, they have a critical limitation as huge computational complexity derived from dramatically large search space in geometry-invariant dense correspondence. A generalized PatchMatch (GPM) [44] was proposed for efficient matching leveraging a randomized search scheme. The DAISY Filter Flow (DFF) [21], which exploits DAISY descriptor [12] with PatchMatch Filter (PMF) [45], was proposed to provide geometric invariance. However, their weak spatial smoothness often induces mismatched results. The scale invariant descriptor (SID) [46] was proposed to encode geometric robustness on the descriptor itself, but it is not tailored to multi-modal matching. Segmentation-aware approach [47] was proposed to provide geometric robustness for descriptors, e.g., SIFT [26] or SID [46], but it may have a negative effect on the discriminative power of the descriptor.

3 Background

Let us define an image as for pixel , where is a discrete image domain. Given the image , a dense descriptor is defined on a local support window centered at pixel with a feature dimension . Conventionally, descriptors were computed based on the assumption that there is a common underlying visual pattern which is shared by two images. However, as shown in Fig. 2, multi-spectral images such as a pair of RGB-NIR have a nonlinear photometric deformation even within a small window, e.g., gradient reverse and intensity order variation. More seriously, there are outliers including structure divergence caused by shadow or highlight. In these cases, conventional descriptors using an image gradient (SIFT [26]) or an intensity comparison (BRIEF [27]) cannot capture coherent matching evidences, resulting erroneous local minima in estimating dense correspondences.

Unlike these conventional descriptors, the LSS descriptor measures a correlation between two patches and centered at two pixels and within a local support window [17]. As shown in Fig. 3(a), it discretizes the correlation surface on a log-polar grid, generates a set of bins, and then stores a maximum correlation value within each bin. Formally, for is a

feature vector, and

can be computed as follows:


where with a log radius for and a quantized angle for with and . In that case, . The correlation surface is typically computed using a simple similarity metric such as the sum of squared difference (SSD) with a normalization factor :


This LSS descriptor has been shown to be robust in cross-domain object detection [17], but it provides unsatisfactory results in densely matching multi-modal images as shown in Fig. 2. It is because the max pooling strategy performed in each loses matching details, leading to a poor discriminative power. Furthermore, the center-biased correlation measure cannot handle severe outliers effectively, which frequently exist in multi-modal and multi-spectral images. In terms of a computational complexity, there exists no efficient computational scheme designed for dense matching descriptor.

4 The DASC Descriptor

4.1 Randomized Receptive Field Pooling

Instead of using a center-biased max pooling of the LSS descriptor in Fig. 3(a), our DASC descriptor incorporates a randomized receptive field pooling with sampling patterns in such a way that a pair of two patches are randomly selected within a local support window. It is motivated by three observations; 1) In multi-spectral and multi-modal images, there frequently exist non-informative regions which are locally degraded, e.g., shadows or outliers. 2) Center-biased pooling is very sensitive to a degradation of a center patch, and cannot deal with a homogeneous or salient center pixel which does not contain self-similarities [17]. 3) From the relationship between Census transform [48] and BRIEF [27] descriptor, it is shown that the randomness enables a descriptor to encode structural information more robustly.

Our approach encodes a similarity between patch-wise receptive fields sampled from log-polar circular point set as shown in Fig. 3(b). It is defined as where the number of points is defined as , and has a higher density of points near a center pixel, similar to DAISY descriptor [12]. Given points in , there exist candidate sampling patterns, leading to a dramatically high-dimension descriptor. However, many of the sampling pattern pairs might not be useful in describing a local support window. Therefore, we employ a randomized approach to extract sampling patterns from pattern candidates. Our descriptor for is encoded with a set of patch similarity between two patches based on sampling patterns that are selected from :


where and are selected sampling patterns at pixel . Note that the sampling patterns are fixed for all pixels in an image. Namely, all pixels share the same set of offset vectors for , enabling a fast computation of dense descriptors, which will be detailed in Sec. 4.3. Although the DASC descriptor uses only sparse patch-wise pairs in a local support window, many of patches are overlapped when computing patch similarities between the sparse pairs, allowing the descriptor to consider the majority of pixels in the support window and reflect original image attributes effectively.

(a) Middlebury [23]
(b) Multi-modal [8]
(c) MPI SINTEL [10]
Fig. 4: Visualization of patch-wise receptive fields of the DASC descriptor learned from the training set built with the Middlebury benchmark [23], multi-modal benchmark [8], and the MPI SINTEL benchmark [10]. Similar to [49], we stacked all patch-wise receptive fields learned from each training image, and normalized them with the maximal value.

4.1.1 Sampling pattern learning

Finding an optimal sampling pattern is a critical issue in the DASC descriptor. With the assumption that there is no single hand-craft feature that always provides the robustness to all circumstances [49], we employ a discriminative learning to obtain optimal sampling patterns within a local support window. Given candidate sampling patterns , our goal is to select the best sampling patterns which derive an important spatial layout.

Our approach exploits support vector machines (SVMs) with a linear kernel

[50]. For learning, we build a dataset , where are support window pairs in multi-modal or multi-spectral images, and is the number of training samples. is a binary label that becomes 1 if two patches are matched, or 0 otherwise. The training data set was built with images captured under varying illumination conditions and/or with imaging devices [23, 10, 8]. In experiments, .

First, the feature that describes two support window pairs and is defined


where is a Gaussian parameter, and is the DASC descriptor. The decision function

to classify training dataset

into matching and non-matching can be represented as


where the weight indicates an amount of contribution of each candidate sampling pattern, and is a bias. Learning can be formulated as minimizing


where the hinge loss function

and represents a regularization parameter. We use LIBSVM [50] to minimize this objective function. The encodes the importance of corresponding sampling pattern towards the final decision [51]. Therefore, we rank top sampling patterns based on value, and use them in our descriptor, which is denoted as .

Fig. 4 visualizes learned patch-wise receptive fields of the DASC. It looks similar to the Gaussian weighting, which has been proven to be effective in terms of a structural encoding of descriptor in many literatures [49, 52]. According to training set, it learns optimal receptive fields.

(a) Window 1
(b) Window 2
(c) Gradient orientation
(d) DAISY [12] descriptor
(e) BRIEF [27] descriptor
(f) LSS [17] descriptor
(g) DASC descriptor
Fig. 5: Visualization of support window pairs on multi-spectral RGB and NIR images denoted as ‘A’ in Fig. 2 having gradient orientation variations, and descriptors for these window pairs. Conventional descriptors such as DAISY [12], BRIEF [27], and LSS [17] vary across modality variations. Unlike those methods, our DASC descriptor remains unchanged to modality variations.

4.2 Adaptive Self-Correlation Measure

With estimated sampling patterns , the DASC descriptor measures a patch similarity using an adaptive self-correlation (ASC) measure in order to robustly encode a local internal layout of self-similarities. For the sake of simplicity, we omit in the correlation metric from here on, as it is repeatedly computed for all . For , the adaptive self-correlation between two patches and centered at pixels and is computed as follows:


where and and weighted averages on and are defined as and .

The weight represents how similar two pixels and are, and is normalized, i.e., . It can be defined with any kind of edge-aware weights [53, 20, 54]. This weighted sum better handles outliers and local variations in patches compared to other patch-wise similarity metrics. It is worth noting that the adaptive self-correlation used here is conceptually similar to the ANCC [40], but our descriptor employs the correlation metric for measuring self-similarity within a single image which is used for matching two or more images later, while the ANCC is used to directly measure inter-similarity between different images.

Finally, our patch-wise similarity between and is computed with a truncated exponential function, which has been widely used in the robust estimator [55]:


where is a bandwidth of Gaussian kernel and is a truncation parameter. Here, a absolute value of is used to mitigate the effect of intensity reverses. The correlation for is normalized with an unit norm for all .

Fig. 5 represents examples of visualizing the results of various descriptors. The conventional descriptors show the sensitivity to modality variations, however the DASC shows the robustness against multi-modal variations.

4.3 Efficient Computation for Dense Descriptor

Fig. 6: Efficient computation framework of the DASC descriptor. In order to reduce a computational load in computing the adaptive self-correlation, it re-arranges the sampling pattern and employs fast EAF scheme. The DASC descriptor is then computed with re-indexing.

For densely constructing our descriptor on an entire image, we should compute for all patch pairs belonging to for each pixel . Thus, a straightforward computation can be extremely time-consuming. In this section, we present an efficient method for computing the DASC descriptor. To compute all weighted sums in (7) for efficiently, we employ a constant-time edge-aware filter (EAF), e.g., the guided filter (GF) [20]. However, the symmetric weight varies for each , and thus computing the numerator in (7) is still very time-consuming.

To alleviate these limitations, we simplify (7) by considering only the weight from the source patch so that a fast computation of (7) using fast edge-aware filter is feasible. It should be noted that such an asymmetric weight approximation also has been used in cost aggregation for stereo matching [56]. We also found that in our descriptor, a performance gap between using the asymmetric weight and the symmetric weight is negligible, which will be shown in Sec. 6.2.5. For efficient description, we also re-arrange the sampling pattern to referenced-biased pairs . (7) is then approximated as follows:


where . Furthermore, which means weighted average of with a guidance image . It is worth noting that the robustness of can be still applied to since their difference is just weight factors.


Algorithm 1: Dense Adaptive Self-Correlation (DASC)


Input : image , candidate sampling patterns , training patch pairs dataset .
Output : the DASC descriptor volume .
Offline Procedure
Compute using (4) for possible candidate sampling patterns on training support window pairs .
Learn a weight by optimizing (6).
Select the maximal sampling patterns in terms of , denoted as .
Online Procedure
Compute for all pixel .
Compute .
for do
Re-arrange as .
Compute .
Compute .
Compute .
Estimate and using (9) and (8).
Compute the DASC descriptor by re-indexing sampling patterns such that .
end for


We then decompose numerator and denominator in (9) after some arithmetic derivations such that


where , , and . While the and can be computed on image domain once, , , and should be computed on each offset. However, the weight is fixed for all offsets, thus it can be shared in all offsets. All these components can be efficiently computed using a constant-time edge-aware filter (EAF) [20]. Finally, the dense descriptor is computed with re-indexing as though the robust function in (8). Fig. 6 describes our efficient method for computing the DASC descriptor. Algorithm 1 summarizes the efficient computation of the DASC descriptor.

4.3.1 Comparison of symmetric and asymmetric version of adaptive self-correlation measure

This section analyzes the performance of the DASC descriptor when using the symmetric weight of in (7) and with the asymmetric weight of in (9). The symmetric weight case in the DASC can also be computed similar to Sec. 4.3. After re-arranging the sampling pattern as , the (7) can be then decomposed as similar in (10)


where , , and . The denominator can be easily computed on overall image once. However, compared to the asymmetric measure in (9), in , , and varies for each . Furthermore, it should be computed with a range distance using 6-D vector (or 2-D vector), when an input is a color image (or an intensity image). It significantly increases a computational burden needed for employing constant-time EAFs [20, 57]. A performance gap between using the symmetric measure and the asymmetric measure in the DASC descriptor is negligible, which will be shown in Sec. 6.2.5.

Fig. 7: Efficient computation framework of the geometry-invariant DASC (GI-DASC) descriptor. To leverage the efficient computation scheme of the DASC, we employ a superpixel-based description with inferred geometric fields on each superpixel using the WMSD detection.

4.4 Computational Complexity Analysis

The computational complexity of the DASC descriptor on the brute-force implementation becomes , where , , and represent an image size, a patch size, and a descriptor dimension, respectively. With our efficient computation model, our approach removes the complexity dependency on the patch size , i.e., due to fast constant-time EAF. Furthermore, since there exist repeated offsets, the complexity is further reduced as for .

5 Geometry-Invariant DASC Descriptor

Similar to the DAISY [12], the DASC descriptor is not appropriate to deal with geometric variations. In this section, we propose the geometry-invariant DASC descriptor, called GI-DASC, that addresses severe geometric variations as well as image modality variations. A key idea is to geometrically transform sampling patterns used to measure the patch similarity according to scale and rotation fields when computing the DASC descriptor. To estimate the scale and rotation fields, we first infer initial geometric fields only for sparse points. These initial fields are then fitted and propagated through a superpixel graph. Finally, the GI-DASC descriptor is efficiently computed with geometrically transformed sampling patterns in a manner similar to computing the DASC descriptor, except the fact that the descriptor computation is done for each superpixel independently.

(a) Sampling patterns
(b) Index set
Fig. 8: Demonstration of sampling patterns for the WMSD detector and the index set for the most smallest value . It enables us to extract reliable feature points with corresponding geometric fields (scale and rotation ).

Adopting the superpixel-based geometry field inference has the following three reasons. First, the reliable geometry field can be estimated reliably only at distinctive pixels. Second, the geometric fields tend to vary smoothly, except object boundaries. Third, the transformed sampling patterns should be fixed for each superpixel so that the computational scheme based on the fast EAF [20] can be used for efficiently obtaining the GI-DASC for each superpixel. Fig. 7 represents the overview of the GI-DASC.

5.1 Initial Sparse Geometric Field Inference

Conventional feature detectors, e.g., SIFT [26], are very sensitive to multi-modal and multi-spectral deformation. In order to extract sparse features with distinctive geometric information available, we employ maximal self-dissimilarity (MSD) thanks to its robustness for modality deformation [58]. We propose weighted MSD (WMSD) that improves the performance of the MSD in terms of both complexity and robustness by employing an weighted similarity measure and an efficient computation scheme similar to the DASC.

Similar to used in the DASC, the log-polar circular point set is defined for feature detector. The sampling pattern is then defined in such a way that the source patch is always located at center pixel and the target patches are located at other neighboring points as shown in Fig. 8(a). In order to consider the scale deformation, we build the Gaussian image pyramid for , where is the -th Gaussian kernel with a sigma and is the number of pyramids. After re-arranging the sampling pattern as , The self-dissimilarity measure for is computed using weighted sum of squared difference (SSD) with a guidance image such that


where , , and . Similar to the DASC, (12) can be computed efficiently using constant time EAF [20, 57].

We extract the index set for the most smallest value for all , i.e., nearest neighbors for center patch in Fig. 8(b). It should be noted that parameter trades distinctiveness and computational efficiency [58]. We then compute feature response map by estimating the summation of for such that


For feature response maps , the local maxima are obtained by the non maximal suppression, which compares to its neighbors on the current scale and neighbors on the and scales. Similar to SIFT [26], a feature point is detected only if has an extreme value compared to all of these neighbors, and its scale is defined with , where is a sparse discrete image domain.

A canonical orientation is further associated to by constructing a histogram with angles for weighted by as


where is the Kronecker delta function. Then, we simply choose the direction corresponding to the highest bin in the histogram, i.e., . The WMSD detector is summarized in Algorithm 2.


Algorithm 2: Weighted Maximal Self-Dissimilarity (WMSD)


Input : image , feature detection sampling patterns .
Output : feature points with scale , rotation .
for do
Compute with the Gaussian kernel .
Compute for all pixel .
for do
Compute for .
Compute .
Estimate .
end for
Extract the index set among for all .
Build response map as .
end for
Detect feature points from with scale factor .
Compute the orientation for from .


5.2 Superpixel Graph-Based Propagation

In order to infer dense geometric fields from sparse geometric fields ( and for ), we decompose the image as superpixel , where is the number of superpixels. The geometric field and are fitted on each superpixel as the average of sparse geometric fields and for . Note that this fitting operation is performed only when exists, i.e., the superpixel includes sparse feature points (at least, 1). Finally, the and are constructed for all superpixels.

Similar to [59], our approach then formulates an inference of dense geometric fields and as a constrained optimization problem where surface-fitted sparse geometric fields and are interpreted as soft constraints. For the sake of simplicity, we omit and since they can be computed using the same method. The energy function of our superpixel-based propagation is defined as follows:


where is a regularization parameter. Here, the first term encodes the dissimilarity between final geometric fields and initial sparse geometric fields . is an index function, which is 1 for valid (constraint) superpixel, and 0 otherwise. The second term imposes the constraint that two adjacent superpixels and may have similar geometric fields according to surperpixel feature affinity , which will be described in the following section.

(a) Image 1
(b) Image 2
(c) Superpixel 1
(d) Superpixel 2
Fig. 9: Examples of a superpixel graph-based propagation. With each superpixel graph in (c), (d) for input images in (a), (b), sparse geometric fields (scale , rotation ) in (e)-(h) are propagated into dense geometric fields (scale , rotation ) in (i)-(l).

5.2.1 Superpixel feature affinity

Our approach employs a superpixel feature composed of an appearance and a spatial feature. First, appearance feature

is defined as the average and standard deviation for intensities of pixels within superpixels. In experiments, we used RGB, Lab, and YCbCr space for a color image, thus

. For an NIR image, appearance feature is defined on 1-channel intensity domain such that

. Note that directly constructing an affinity matrix with intensity values may lead to inaccurate results due to intensity variations. However, the effect on such variations can be greatly reduced, since the appearance feature is defined as an aggregated form within a superpixel and the affinity value is measured within the same image domain. Second, spatial feature

is defined as a spatial centroid coordinate within superpixels. Based on these superpixel features, a superpixel feature affinity between two adjacent superpixel and is computed as


where and denote coefficients for controlling the spatial coherence of neighboring superpixels.

5.2.2 Solver

The minimum of the energy function (15) can be obtained with the following linear system


where , where , and .

This linear system with a Laplacian matrix can be easily solved with conventional linear solvers [60]. Fig. 9 shows examples of our superpixel graph-based propagation.

5.3 Efficient Dense Descriptor on Superpixels

The sampling patterns are transformed with corresponding geometric fields and as shown in Fig. 10. Specifically, for the -th superpixel , the sampling pattern is transformed from with a scale factor and a rotation factor ,


where the scale matrix and the rotation matrix is defined with rotation . In a similar way, is also estimated from . Finally, is estimated. Furthermore, the patch size is enlarged as .


Algorithm 3: Geometric-Invariant DASC (GI-DASC)


Input : image , feature detection sampling patterns , sampling patterns .
Output : the GI-DASC descriptor volume .
Extract feature points with scale and rotation using Algorithm 2.
Decompose the image into superpixels .
Compute a surface fitting for geometric field and on superpixels .
Compute a Laplacian matrix with confidences and weights .
Compute dense geometric fields and .
for do
Transform the sampling pattern into .
Compute the GI-DASC descriptor for and using Algorithm 1.
end for


(a) Superpixel extended subimage
(b) Sampling pattern
Fig. 10: Sampling pattern transformation in the GI-DASC descriptor. The sampling patterns is transformed as with and on superpixel , which is applied equally for all . It provides the geometric robustness on each superpixel.

The -th superpixel extended subimage in Fig. 10(a) is filtered by a Gaussian filtering with the sigma similar to scale-space theory used in the SIFT [26]. Then, our GI-DASC descriptor for () is encoded with a set of patch similarity between two patches from a transformed sampling pattern on each superpixel such that


for . Finally, the dense GI-DASC descriptor is efficiently computed for all the superpixels . Algorithm 3 summarizes how to compute the GI-DASC descriptor.

(a) Support window size
(b) Descriptor dimension
(c) Patch size
(d) Log-polar circular point
Fig. 11: Average bad-pixel error rate on Middlebury benchmark [23] of DASC+LRP descriptor with WTA optimization as varying support window size , descriptor dimension , patch size , and log-polar circular point (). In each experiment, all other parameters are fixed as initial values in Sec. 6.1.

6 Experimental Results and Discussions

6.1 Experimental Environments

In experiments, the DASC descriptor was implemented with the following same parameter settings for all datasets: where is the support window size, and for candidate sampling patterns. We set the smoothness parameter in the GF [20]. For the GI-DASC, the following parameters were used for all datasets: . The number of superpixels is set to about . We implemented the DASC and GI-DASC descriptor in C++ on Intel Core i- CPU at GHz.

The DASC descriptor was evaluated with other state-of-the-art descriptors, e.g., SIFT [26], DAISY [12], BRIEF [27], and LSS [17], and other area-based approaches, e.g., ANCC [40], MI+SIFT111For a fair evaluation, we compared only the similarity measure in [38] without further techniques. [38], and RSNCC [8]. We also compared the DASC using a randomized pooling (DASC+RP) with the DASC using a learned randomized pooling (DASC+LRP). Furthermore, the state-of-the-art geometry robust methods such as SID [46], SegSID [46], SegSF [47], GPM [44], DSP [14], and SSF [43] were also compared to the GI-DASC descriptor. For learning the DASC, we built training sets from benchmark databases used in each experiment, and these training sets were excluded from experiments.

(a) Illumination variation
(b) Exposure variation
Fig. 12: Average bad-pixel error rate for original LSS [17], LSS without max-pooling, LSS with ASC, LSS using randomized-pooling with fixed center pixel, and the DASC descriptor on Middlebury benchmark [23].
(a) Illumination variation
(b) Exposure variation
Fig. 13: Average bad-pixel error rate for the DASC descriptor as varying EAF including Box, Gaussian, Bilateral [61], FastBilateral [53], Domain Transform [54], FastGF [62], and GF [20] on Middlebury benchmark [23].

6.2 Parameter and Component Analysis

6.2.1 Parameter sensitivity analysis

(a) Graffiti
(b) Trees
(c) Bikes
(d) Leuven
Fig. 14: Evaluation of the WMSD detection compared to conventional feature detections, such as SIFT [26], MSER [63], FAST [64], and MSD [58]. The WMSD provides reliable feature detection performance, thus providing reliable hypothesis for initial sparse geometric fields.
Fig. 15: Evaluation of the WMSD detection compared to conventional rotation estimations. Compared to conventional gradient-based rotation estimation (SIFT [26] and SURF [65]) or intensity-based rotation estimation (BRISK [66] and ORB [67]), our WMSD-based rotation estimation (with the DASC descriptor) shows the best performance.

Fig. 11 intensively analyzed the performance of the DASC descriptor as varying associated parameters, including support window size , descriptor dimension , patch size , and the number of log-point circular point . To evaluate the quantitative performance, we measured an average bad-pixel error rate on Middlebury benchmark [23]. The larger the support window size , the matching quality is improved but the accuracy gain is saturated around . Using a larger descriptor dimension yields a better performance since the descriptor encodes more information. Considering the trade-off between efficiency and robustness, is set in experiments. When the patch size increases, the matching quality is degraded since a series of similarity values measured with large patches may lose locally discriminative details. The number of log-polar circular point does not affect the performance much, since optimal patterns can be sampled even from small .

6.2.2 Component-wise performance gain analysis

The DASC is originally motivated by the LSS concept from [17]. The DASC consists of three key ingredients: adaptive self-correlation (ASC), randomized pooling (RP), and learning sampling pattern. In this context, we analyzed an accuracy gain of the DASC over the LSS on the Middlebury benchmark as shown in Fig. 12. Note that all experiments were done using LSS without max pooling, ‘LSS(wo/max)’. The original LSS method [17] uses the SSD for measuring the patch similarity. We replaced the patch similarity of the LSS method with the ASC, named ‘LSS(ASC)’, and then measured its matching accuracy. As expected, the ASC improves the performance compared to the SSD used in the original LSS. We also evaluated the LSS using a randomized pooling with fixed center pixel, ‘LSS(ASC+RPF)’, and the LSS using a learne