1 Introduction
Recently, many computer vision and computational photography problems have been reformulated to overcome their inherent limitations by leveraging multimodal and multispectral images. Typical examples of other imaging modalities include nearinfrared (NIR) image
[1, 2] and dark flash image [3]. More broadly, flash and noflash images [4], blurred images [5, 6], and images taken under different radiometric conditions [7] can also be considered as multimodal [8].Establishing dense visual correspondences for multimodal and multispectral images is a key enabler for realizing such tasks. In general, the performance of correspondence algorithms relies primarily on two components: appearance descriptor and optimization scheme. Traditional dense correspondence methods for estimating depth [9] or optical flow [10, 11] fields, in which input images are acquired in a similar imaging condition, have been dramatically advanced in recent studies. To define a matching fidelity term, they typically assume that multiple images share a similar visual pattern, e.g., color, gradient, and structural similarity. However, when it comes to multispectral and multimodal images, such properties do not hold as shown in Fig. 1, and thus conventional descriptors or similarity measures often fail to capture reliable matching evidence.
This leads to a poor matching quality as shown in Fig. 2. Furthermore, substantial geometric variations, which often appear in images captured under widebaseline conditions, make the matching task even more challenging. Although employing a powerful optimization technique could help estimate a reliable solution with a spatial context [13, 14, 15], an optimizer itself cannot address an inherent limitation without suitable matching descriptors [16].
Our method starts from an observation that a local internal layout of selfsimilarities is less sensitive to photometric distortions, even when an intensity distribution of an anatomical structure is not maintained across different imaging modalities [17]. That is, the local selfsimilarity (LSS) descriptor would be beneficial to overcoming inherent limitations of existing descriptors in establishing correspondences between multimodal or multispectral images. Several approaches based on the LSS have been presented for multimodal and multispectral image registration [18, 19], but they do not scale well to estimating dense correspondences for multimodal and multispectral images, and their matching performance is still poor.
In this paper, we propose a novel local descriptor, called dense adaptive selfcorrelation (DASC), designed for establishing dense multimodal and multispectral correspondences. It is defined with a series of patchwise similarities within a local support window. The similarity is computed with an adaptive selfcorrelation measure, which encodes an intrinsic structure while providing the robustness against modality variations. To further improve the matching quality and runtime efficiency, we propose a randomized receptive field pooling strategy using sampling patterns that select two patches within the local support window. A linear discriminative learning is employed for obtaining an optimal sampling pattern. The computational redundancy that arises when computing densely sampled descriptors over an entire image is dramatically reduced by applying fast edgeaware filtering [20].
Furthermore, in order to address geometric variation problems such as the scale and rotation, we propose the geometryinvariant DASC (GIDASC) descriptor that leverages the efficiency and effectiveness of the DASC through a superpixelbased representation. Specifically, we infer an initial geometric field with corresponding scale and rotation of reliable sparse keypoints obtained using weighted maximally selfdissimilarity (WMSD), and then propagate the initial geometric field on a superpixel graph. After transforming sampling patterns according to geometric fields on each superpixel, the DASC is efficiently computed with the transformed sampling patterns on each superpixel extended subimage. Compared to conventional geometryinvariant methods for dense correspondence [21, 22], which have been focusing on employing powerful optimization schemes, the GIDASC provides geometric and photometric robustness on the descriptor itself.
Experimental results show that the DASC outperforms conventional areabased and featurebased approaches on various benchmarks including modality variations; (1) Middlebury stereo benchmark containing illumination and exposure variations [23], (2) multimodal and multispectral dataset including RGBNIR images [8, 1], different exposure [8, 7], flashnoflash images [7], and blurry images [5, 6], and (3) MPI optical flow benchmark containing specular reflections, motion blur, and defocus blur [10]. We also show that the GIDASC outperforms existing geometryinvariant methods on a novel multimodal benchmark.
1.1 Contribution
The contributions of this paper can be summarized as follows. First, to the best of our knowledge, our approach is the first attempt to design an efficient, dense descriptor for matching multimodal and multispectral images, even under varying geometric conditions. Second, unlike a centerbiased dense max pooling, we propose a randomized receptive field pooling with sampling patterns optimized via a discriminative learning, making the descriptor more robust to matching outliers incurred by different imaging modalities. Third, we propose an efficient computational scheme that significantly improves the runtime efficiency of the proposed dense descriptor. Fourth, a geometryinvariant dense descriptor is also proposed, which provides a geometric robustness as a descriptor itself.
This manuscript extends its preliminary version [24]. It newly adds (1) a scale and rotation invariant extension of the DASC, called GIDASC; (2) a new multimodal benchmark with a ground truth annotation, captured under varying photometric and geometric conditions; and (3) an intensive comparative study with existing geometry invariant methods using various datasets. The source code of our work (including DASC and GIDASC) and the new multimodal benchmark are available at our project webpage [25].
2 Related Work
2.1 Feature Descriptors
As a pioneering work, the scale invariant feature transform (SIFT) was first introduced by Lowe [26] to estimate robust sparse feature correspondence under geometric and photometric variations. Based on the intensity comparison, fast binary descriptors, such as binary robust independent elementary features (BRIEF) [27] and fast retina keypoint (FREAK) [28], have been proposed. Unlike these sparse descriptors, Tola et al. developed a dense descriptor, called DAISY [12], which redesigns conventional sparse descriptors, i.e., SIFT, to efficiently compute densely sampled descriptors over an entire image. Although these conventional gradientbased and intensity comparisonbased descriptors show satisfactory performance for small photometric deformation, they cannot properly describe multimodal and multispectral images that often exhibit severe nonlinear deformation.
To estimate correspondences in multimodal and multispectral images, some variants of the SIFT have been developed [29], but these gradientbased descriptors have an inherent limitation similar to the SIFT, especially when an image gradient varies across different modality images. Schechtman and Irani introduced the LSS descriptor [17] for the purpose of template matching, and achieved impressive results in object detection and retrieval. Torabi et al. employed the LSS as a multispectral similarity metric to register human region of interests (ROIs) [19]. The LSS also has been applied to the registration of multispectral remote sensing images [30]. For multimodal medical image registration, Heinrich et al. proposed a modality independent neighborhood descriptor (MIND) [18] inspired by the LSS. However, none of these approaches scale very well to dense matching tasks for multimodal and multispectral images due to a low discriminative power and a huge complexity.
Recently, several approaches started to employ deep convolutional neural networks (CNNs)
[31] for estimating correspondences. For designing explicit, discriminative feature descriptors, intermediate activations from CNN architecture are extracted [32, 33, 34, 35], and they have been shown to be effective for patchlevel tasks. However, even though CNNbased descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in multimodal images, since they use shared convolutional kernels across images which lead to inconsistent responses similar to conventional descriptor [36, 35]. Furthermore, they are unable to provide dense descriptors in the image due to a prohibitively high computational complexity.2.2 Areabased Similarity Measures
As surveyed in [37]
, the mutual information (MI), leveraging the entropy of the joint probability distribution function (PDF), has been popularly applied to a registration of multimodal medical images. However, the MI is sensitive to local radiometric variation since it formulates the intensity variation in a global manner using the joint entropy computed over an entire image. In
[38], this issue can be alleviated to some extent by leveraging a locally adaptive weight obtained from SIFT matching, called MI+SIFT in this paper, but its performance is still limited against the multimodal variation [39]. Although crosscorrelation based methods such as an adaptive normalized crosscorrelation (ANCC) [40] show satisfactory results for locally linear variations, they show a limitation under severe modality variations. Irani et al.employed the crosscorrelation on the Laplacian energy map for measuring multisensor image similarity
[41], but it also shows a limitation for general image matching tasks. A robust selective normalized crosscorrelation (RSNCC) [8] was proposed for the dense alignment between multimodal images, but its performance is still unsatisfactory due to an inherent limitation of intensity based similarity measure.2.3 GeometryInvariant Dense Correspondences
Based on the SIFT flow (SF) [13] optimization, many methods have been proposed to alleviate geometric variation problems, including deformable spatial pyramid (DSP) [14], scaleless SIFT flow (SLS) [42], scalespace SIFT flow (SSF) [43], and generalized DSP (GDSP) [22]. However, they have a critical limitation as huge computational complexity derived from dramatically large search space in geometryinvariant dense correspondence. A generalized PatchMatch (GPM) [44] was proposed for efficient matching leveraging a randomized search scheme. The DAISY Filter Flow (DFF) [21], which exploits DAISY descriptor [12] with PatchMatch Filter (PMF) [45], was proposed to provide geometric invariance. However, their weak spatial smoothness often induces mismatched results. The scale invariant descriptor (SID) [46] was proposed to encode geometric robustness on the descriptor itself, but it is not tailored to multimodal matching. Segmentationaware approach [47] was proposed to provide geometric robustness for descriptors, e.g., SIFT [26] or SID [46], but it may have a negative effect on the discriminative power of the descriptor.
3 Background
Let us define an image as for pixel , where is a discrete image domain. Given the image , a dense descriptor is defined on a local support window centered at pixel with a feature dimension . Conventionally, descriptors were computed based on the assumption that there is a common underlying visual pattern which is shared by two images. However, as shown in Fig. 2, multispectral images such as a pair of RGBNIR have a nonlinear photometric deformation even within a small window, e.g., gradient reverse and intensity order variation. More seriously, there are outliers including structure divergence caused by shadow or highlight. In these cases, conventional descriptors using an image gradient (SIFT [26]) or an intensity comparison (BRIEF [27]) cannot capture coherent matching evidences, resulting erroneous local minima in estimating dense correspondences.
Unlike these conventional descriptors, the LSS descriptor measures a correlation between two patches and centered at two pixels and within a local support window [17]. As shown in Fig. 3(a), it discretizes the correlation surface on a logpolar grid, generates a set of bins, and then stores a maximum correlation value within each bin. Formally, for is a
feature vector, and
can be computed as follows:(1) 
where with a log radius for and a quantized angle for with and . In that case, . The correlation surface is typically computed using a simple similarity metric such as the sum of squared difference (SSD) with a normalization factor :
(2) 
This LSS descriptor has been shown to be robust in crossdomain object detection [17], but it provides unsatisfactory results in densely matching multimodal images as shown in Fig. 2. It is because the max pooling strategy performed in each loses matching details, leading to a poor discriminative power. Furthermore, the centerbiased correlation measure cannot handle severe outliers effectively, which frequently exist in multimodal and multispectral images. In terms of a computational complexity, there exists no efficient computational scheme designed for dense matching descriptor.
4 The DASC Descriptor
4.1 Randomized Receptive Field Pooling
Instead of using a centerbiased max pooling of the LSS descriptor in Fig. 3(a), our DASC descriptor incorporates a randomized receptive field pooling with sampling patterns in such a way that a pair of two patches are randomly selected within a local support window. It is motivated by three observations; 1) In multispectral and multimodal images, there frequently exist noninformative regions which are locally degraded, e.g., shadows or outliers. 2) Centerbiased pooling is very sensitive to a degradation of a center patch, and cannot deal with a homogeneous or salient center pixel which does not contain selfsimilarities [17]. 3) From the relationship between Census transform [48] and BRIEF [27] descriptor, it is shown that the randomness enables a descriptor to encode structural information more robustly.
Our approach encodes a similarity between patchwise receptive fields sampled from logpolar circular point set as shown in Fig. 3(b). It is defined as where the number of points is defined as , and has a higher density of points near a center pixel, similar to DAISY descriptor [12]. Given points in , there exist candidate sampling patterns, leading to a dramatically highdimension descriptor. However, many of the sampling pattern pairs might not be useful in describing a local support window. Therefore, we employ a randomized approach to extract sampling patterns from pattern candidates. Our descriptor for is encoded with a set of patch similarity between two patches based on sampling patterns that are selected from :
(3) 
where and are selected sampling patterns at pixel . Note that the sampling patterns are fixed for all pixels in an image. Namely, all pixels share the same set of offset vectors for , enabling a fast computation of dense descriptors, which will be detailed in Sec. 4.3. Although the DASC descriptor uses only sparse patchwise pairs in a local support window, many of patches are overlapped when computing patch similarities between the sparse pairs, allowing the descriptor to consider the majority of pixels in the support window and reflect original image attributes effectively.
4.1.1 Sampling pattern learning
Finding an optimal sampling pattern is a critical issue in the DASC descriptor. With the assumption that there is no single handcraft feature that always provides the robustness to all circumstances [49], we employ a discriminative learning to obtain optimal sampling patterns within a local support window. Given candidate sampling patterns , our goal is to select the best sampling patterns which derive an important spatial layout.
Our approach exploits support vector machines (SVMs) with a linear kernel
[50]. For learning, we build a dataset , where are support window pairs in multimodal or multispectral images, and is the number of training samples. is a binary label that becomes 1 if two patches are matched, or 0 otherwise. The training data set was built with images captured under varying illumination conditions and/or with imaging devices [23, 10, 8]. In experiments, .First, the feature that describes two support window pairs and is defined
(4) 
where is a Gaussian parameter, and is the DASC descriptor. The decision function
to classify training dataset
into matching and nonmatching can be represented as(5) 
where the weight indicates an amount of contribution of each candidate sampling pattern, and is a bias. Learning can be formulated as minimizing
(6) 
where the hinge loss function
and represents a regularization parameter. We use LIBSVM [50] to minimize this objective function. The encodes the importance of corresponding sampling pattern towards the final decision [51]. Therefore, we rank top sampling patterns based on value, and use them in our descriptor, which is denoted as .Fig. 4 visualizes learned patchwise receptive fields of the DASC. It looks similar to the Gaussian weighting, which has been proven to be effective in terms of a structural encoding of descriptor in many literatures [49, 52]. According to training set, it learns optimal receptive fields.
4.2 Adaptive SelfCorrelation Measure
With estimated sampling patterns , the DASC descriptor measures a patch similarity using an adaptive selfcorrelation (ASC) measure in order to robustly encode a local internal layout of selfsimilarities. For the sake of simplicity, we omit in the correlation metric from here on, as it is repeatedly computed for all . For , the adaptive selfcorrelation between two patches and centered at pixels and is computed as follows:
(7) 
where and and weighted averages on and are defined as and .
The weight represents how similar two pixels and are, and is normalized, i.e., . It can be defined with any kind of edgeaware weights [53, 20, 54]. This weighted sum better handles outliers and local variations in patches compared to other patchwise similarity metrics. It is worth noting that the adaptive selfcorrelation used here is conceptually similar to the ANCC [40], but our descriptor employs the correlation metric for measuring selfsimilarity within a single image which is used for matching two or more images later, while the ANCC is used to directly measure intersimilarity between different images.
Finally, our patchwise similarity between and is computed with a truncated exponential function, which has been widely used in the robust estimator [55]:
(8) 
where is a bandwidth of Gaussian kernel and is a truncation parameter. Here, a absolute value of is used to mitigate the effect of intensity reverses. The correlation for is normalized with an unit norm for all .
Fig. 5 represents examples of visualizing the results of various descriptors. The conventional descriptors show the sensitivity to modality variations, however the DASC shows the robustness against multimodal variations.
4.3 Efficient Computation for Dense Descriptor
For densely constructing our descriptor on an entire image, we should compute for all patch pairs belonging to for each pixel . Thus, a straightforward computation can be extremely timeconsuming. In this section, we present an efficient method for computing the DASC descriptor. To compute all weighted sums in (7) for efficiently, we employ a constanttime edgeaware filter (EAF), e.g., the guided filter (GF) [20]. However, the symmetric weight varies for each , and thus computing the numerator in (7) is still very timeconsuming.
To alleviate these limitations, we simplify (7) by considering only the weight from the source patch so that a fast computation of (7) using fast edgeaware filter is feasible. It should be noted that such an asymmetric weight approximation also has been used in cost aggregation for stereo matching [56]. We also found that in our descriptor, a performance gap between using the asymmetric weight and the symmetric weight is negligible, which will be shown in Sec. 6.2.5. For efficient description, we also rearrange the sampling pattern to referencedbiased pairs . (7) is then approximated as follows:
(9) 
where . Furthermore, which means weighted average of with a guidance image . It is worth noting that the robustness of can be still applied to since their difference is just weight factors.


Algorithm 1: Dense Adaptive SelfCorrelation (DASC)  


Input : image , candidate sampling patterns , training patch pairs dataset .  
Output : the DASC descriptor volume .  
Offline Procedure  
Compute using (4) for possible candidate sampling patterns on training support window pairs .  
Learn a weight by optimizing (6).  
Select the maximal sampling patterns in terms of , denoted as .  
Online Procedure  
Compute for all pixel .  
Compute .  
for do  
Rearrange as .  
Compute .  
Compute .  
Compute .  
Estimate and using (9) and (8).  
Compute the DASC descriptor by reindexing sampling patterns such that .  
end for  

We then decompose numerator and denominator in (9) after some arithmetic derivations such that
(10) 
where , , and . While the and can be computed on image domain once, , , and should be computed on each offset. However, the weight is fixed for all offsets, thus it can be shared in all offsets. All these components can be efficiently computed using a constanttime edgeaware filter (EAF) [20]. Finally, the dense descriptor is computed with reindexing as though the robust function in (8). Fig. 6 describes our efficient method for computing the DASC descriptor. Algorithm 1 summarizes the efficient computation of the DASC descriptor.
4.3.1 Comparison of symmetric and asymmetric version of adaptive selfcorrelation measure
This section analyzes the performance of the DASC descriptor when using the symmetric weight of in (7) and with the asymmetric weight of in (9). The symmetric weight case in the DASC can also be computed similar to Sec. 4.3. After rearranging the sampling pattern as , the (7) can be then decomposed as similar in (10)
(11) 
where , , and . The denominator can be easily computed on overall image once. However, compared to the asymmetric measure in (9), in , , and varies for each . Furthermore, it should be computed with a range distance using 6D vector (or 2D vector), when an input is a color image (or an intensity image). It significantly increases a computational burden needed for employing constanttime EAFs [20, 57]. A performance gap between using the symmetric measure and the asymmetric measure in the DASC descriptor is negligible, which will be shown in Sec. 6.2.5.
4.4 Computational Complexity Analysis
The computational complexity of the DASC descriptor on the bruteforce implementation becomes , where , , and represent an image size, a patch size, and a descriptor dimension, respectively. With our efficient computation model, our approach removes the complexity dependency on the patch size , i.e., due to fast constanttime EAF. Furthermore, since there exist repeated offsets, the complexity is further reduced as for .
5 GeometryInvariant DASC Descriptor
Similar to the DAISY [12], the DASC descriptor is not appropriate to deal with geometric variations. In this section, we propose the geometryinvariant DASC descriptor, called GIDASC, that addresses severe geometric variations as well as image modality variations. A key idea is to geometrically transform sampling patterns used to measure the patch similarity according to scale and rotation fields when computing the DASC descriptor. To estimate the scale and rotation fields, we first infer initial geometric fields only for sparse points. These initial fields are then fitted and propagated through a superpixel graph. Finally, the GIDASC descriptor is efficiently computed with geometrically transformed sampling patterns in a manner similar to computing the DASC descriptor, except the fact that the descriptor computation is done for each superpixel independently.
Adopting the superpixelbased geometry field inference has the following three reasons. First, the reliable geometry field can be estimated reliably only at distinctive pixels. Second, the geometric fields tend to vary smoothly, except object boundaries. Third, the transformed sampling patterns should be fixed for each superpixel so that the computational scheme based on the fast EAF [20] can be used for efficiently obtaining the GIDASC for each superpixel. Fig. 7 represents the overview of the GIDASC.
5.1 Initial Sparse Geometric Field Inference
Conventional feature detectors, e.g., SIFT [26], are very sensitive to multimodal and multispectral deformation. In order to extract sparse features with distinctive geometric information available, we employ maximal selfdissimilarity (MSD) thanks to its robustness for modality deformation [58]. We propose weighted MSD (WMSD) that improves the performance of the MSD in terms of both complexity and robustness by employing an weighted similarity measure and an efficient computation scheme similar to the DASC.
Similar to used in the DASC, the logpolar circular point set is defined for feature detector. The sampling pattern is then defined in such a way that the source patch is always located at center pixel and the target patches are located at other neighboring points as shown in Fig. 8(a). In order to consider the scale deformation, we build the Gaussian image pyramid for , where is the th Gaussian kernel with a sigma and is the number of pyramids. After rearranging the sampling pattern as , The selfdissimilarity measure for is computed using weighted sum of squared difference (SSD) with a guidance image such that
(12) 
where , , and . Similar to the DASC, (12) can be computed efficiently using constant time EAF [20, 57].
We extract the index set for the most smallest value for all , i.e., nearest neighbors for center patch in Fig. 8(b). It should be noted that parameter trades distinctiveness and computational efficiency [58]. We then compute feature response map by estimating the summation of for such that
(13) 
For feature response maps , the local maxima are obtained by the non maximal suppression, which compares to its neighbors on the current scale and neighbors on the and scales. Similar to SIFT [26], a feature point is detected only if has an extreme value compared to all of these neighbors, and its scale is defined with , where is a sparse discrete image domain.
A canonical orientation is further associated to by constructing a histogram with angles for weighted by as
(14) 
where is the Kronecker delta function. Then, we simply choose the direction corresponding to the highest bin in the histogram, i.e., . The WMSD detector is summarized in Algorithm 2.



Algorithm 2: Weighted Maximal SelfDissimilarity (WMSD)  


Input : image , feature detection sampling patterns .  
Output : feature points with scale , rotation .  
for do  
Compute with the Gaussian kernel .  
Compute for all pixel .  
for do  
Compute for .  
Compute .  
Estimate .  
end for  
Extract the index set among for all .  
Build response map as .  
end for  
Detect feature points from with scale factor .  
Compute the orientation for from .  

5.2 Superpixel GraphBased Propagation
In order to infer dense geometric fields from sparse geometric fields ( and for ), we decompose the image as superpixel , where is the number of superpixels. The geometric field and are fitted on each superpixel as the average of sparse geometric fields and for . Note that this fitting operation is performed only when exists, i.e., the superpixel includes sparse feature points (at least, 1). Finally, the and are constructed for all superpixels.
Similar to [59], our approach then formulates an inference of dense geometric fields and as a constrained optimization problem where surfacefitted sparse geometric fields and are interpreted as soft constraints. For the sake of simplicity, we omit and since they can be computed using the same method. The energy function of our superpixelbased propagation is defined as follows:
(15) 
where is a regularization parameter. Here, the first term encodes the dissimilarity between final geometric fields and initial sparse geometric fields . is an index function, which is 1 for valid (constraint) superpixel, and 0 otherwise. The second term imposes the constraint that two adjacent superpixels and may have similar geometric fields according to surperpixel feature affinity , which will be described in the following section.
5.2.1 Superpixel feature affinity
Our approach employs a superpixel feature composed of an appearance and a spatial feature. First, appearance feature
is defined as the average and standard deviation for intensities of pixels within superpixels. In experiments, we used RGB, Lab, and YCbCr space for a color image, thus
. For an NIR image, appearance feature is defined on 1channel intensity domain such that. Note that directly constructing an affinity matrix with intensity values may lead to inaccurate results due to intensity variations. However, the effect on such variations can be greatly reduced, since the appearance feature is defined as an aggregated form within a superpixel and the affinity value is measured within the same image domain. Second, spatial feature
is defined as a spatial centroid coordinate within superpixels. Based on these superpixel features, a superpixel feature affinity between two adjacent superpixel and is computed as(16) 
where and denote coefficients for controlling the spatial coherence of neighboring superpixels.
5.2.2 Solver
The minimum of the energy function (15) can be obtained with the following linear system
(17) 
where , where , and .
5.3 Efficient Dense Descriptor on Superpixels
The sampling patterns are transformed with corresponding geometric fields and as shown in Fig. 10. Specifically, for the th superpixel , the sampling pattern is transformed from with a scale factor and a rotation factor ,
(18) 
where the scale matrix and the rotation matrix is defined with rotation . In a similar way, is also estimated from . Finally, is estimated. Furthermore, the patch size is enlarged as .


Algorithm 3: GeometricInvariant DASC (GIDASC)  


Input : image , feature detection sampling patterns , sampling patterns .  
Output : the GIDASC descriptor volume .  
Extract feature points with scale and rotation using Algorithm 2.  
Decompose the image into superpixels .  
Compute a surface fitting for geometric field and on superpixels .  
Compute a Laplacian matrix with confidences and weights .  
Compute dense geometric fields and .  
for do  
Transform the sampling pattern into .  
Compute the GIDASC descriptor for and using Algorithm 1.  
end for  

The th superpixel extended subimage in Fig. 10(a) is filtered by a Gaussian filtering with the sigma similar to scalespace theory used in the SIFT [26]. Then, our GIDASC descriptor for () is encoded with a set of patch similarity between two patches from a transformed sampling pattern on each superpixel such that
(19) 
for . Finally, the dense GIDASC descriptor is efficiently computed for all the superpixels . Algorithm 3 summarizes how to compute the GIDASC descriptor.
6 Experimental Results and Discussions
6.1 Experimental Environments
In experiments, the DASC descriptor was implemented with the following same parameter settings for all datasets: where is the support window size, and for candidate sampling patterns. We set the smoothness parameter in the GF [20]. For the GIDASC, the following parameters were used for all datasets: . The number of superpixels is set to about . We implemented the DASC and GIDASC descriptor in C++ on Intel Core i CPU at GHz.
The DASC descriptor was evaluated with other stateoftheart descriptors, e.g., SIFT [26], DAISY [12], BRIEF [27], and LSS [17], and other areabased approaches, e.g., ANCC [40], MI+SIFT^{1}^{1}1For a fair evaluation, we compared only the similarity measure in [38] without further techniques. [38], and RSNCC [8]. We also compared the DASC using a randomized pooling (DASC+RP) with the DASC using a learned randomized pooling (DASC+LRP). Furthermore, the stateoftheart geometry robust methods such as SID [46], SegSID [46], SegSF [47], GPM [44], DSP [14], and SSF [43] were also compared to the GIDASC descriptor. For learning the DASC, we built training sets from benchmark databases used in each experiment, and these training sets were excluded from experiments.
6.2 Parameter and Component Analysis
6.2.1 Parameter sensitivity analysis
Fig. 11 intensively analyzed the performance of the DASC descriptor as varying associated parameters, including support window size , descriptor dimension , patch size , and the number of logpoint circular point . To evaluate the quantitative performance, we measured an average badpixel error rate on Middlebury benchmark [23]. The larger the support window size , the matching quality is improved but the accuracy gain is saturated around . Using a larger descriptor dimension yields a better performance since the descriptor encodes more information. Considering the tradeoff between efficiency and robustness, is set in experiments. When the patch size increases, the matching quality is degraded since a series of similarity values measured with large patches may lose locally discriminative details. The number of logpolar circular point does not affect the performance much, since optimal patterns can be sampled even from small .
6.2.2 Componentwise performance gain analysis
The DASC is originally motivated by the LSS concept from [17]. The DASC consists of three key ingredients: adaptive selfcorrelation (ASC), randomized pooling (RP), and learning sampling pattern. In this context, we analyzed an accuracy gain of the DASC over the LSS on the Middlebury benchmark as shown in Fig. 12. Note that all experiments were done using LSS without max pooling, ‘LSS(wo/max)’. The original LSS method [17] uses the SSD for measuring the patch similarity. We replaced the patch similarity of the LSS method with the ASC, named ‘LSS(ASC)’, and then measured its matching accuracy. As expected, the ASC improves the performance compared to the SSD used in the original LSS. We also evaluated the LSS using a randomized pooling with fixed center pixel, ‘LSS(ASC+RPF)’, and the LSS using a learne