Deep Self-Convolutional Activations Descriptor for Dense Cross-Modal Correspondence

03/21/2016 ∙ by Seungryong Kim, et al. ∙ 0

We present a novel descriptor, called deep self-convolutional activations (DeSCA), designed for establishing dense correspondences between images taken under different imaging modalities, such as different spectral ranges or lighting conditions. Motivated by descriptors based on local self-similarity (LSS), we formulate a novel descriptor by leveraging LSS in a deep architecture, leading to better discriminative power and greater robustness to non-rigid image deformations than state-of-the-art cross-modality descriptors. The DeSCA first computes self-convolutions over a local support window for randomly sampled patches, and then builds self-convolution activations by performing an average pooling through a hierarchical formulation within a deep convolutional architecture. Finally, the feature responses on the self-convolution activations are encoded through a spatial pyramid pooling in a circular configuration. In contrast to existing convolutional neural networks (CNNs) based descriptors, the DeSCA is training-free (i.e., randomly sampled patches are utilized as the convolution kernels), is robust to cross-modal imaging, and can be densely computed in an efficient manner that significantly reduces computational redundancy. The state-of-the-art performance of DeSCA on challenging cases of cross-modal image pairs is demonstrated through extensive experiments.



There are no comments yet.


page 2

page 6

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many computer vision and computational photography applications, images captured under different imaging modalities are used to supplement the data provided in color images. Typical examples of other imaging modalities include near-infrared

[1, 2, 3] and dark flash [4] photography. More broadly, photos taken under different imaging conditions, such as different exposure settings [5], blur levels [6, 7], and illumination [8], can also be considered as cross-modal [9, 10].

Establishing dense correspondences between cross-modal image pairs is essential for combining their disparate information. Although powerful global optimizers may help to improve the accuracy of correspondence estimation to some extent

[11, 12], they face inherent limitations without help of suitable matching descriptors [13]. The most popular local descriptor is scale invariant feature transform (SIFT) [14], which provides relatively good matching performance when there are small photometric variations. However, conventional descriptors such as SIFT often fail to capture reliable matching evidences in cross-modal image pairs due to their different visual properties [9, 10].

Recently, convolutional neural networks (CNNs) based features [15, 16, 17, 18, 19] have emerged as a robust alternative with high discriminative power. However, CNN-based descriptors cannot satisfactorily deal with severe cross-modality appearance differences, since they use shared convolutional kernels across images which lead to inconsistent responses similar to conventional descriptors [19, 20]. Furthermore, they do not scale well for dense correspondence estimation due to their high computational complexity. Though recent works [21] propose an efficient method that extracts dense outputs through the deep CNNs, they do not extract dense CNN features for all pixels individually. More seriously, their methods were usually designed to perform a specific task only, e.g., semantic segmentation, not to provide a general purpose descriptor like ours.

(a) cost function in A
(b) cost function in B
(c) cost function in C
Figure 1: Examples of matching cost profiles, computed with different descriptors along the scan lines of A, B, and C for image pairs under severe non-rigid deformations and illumination changes. Unlike other descriptors, DeSCA yields reliable global minimum.

To address the problem of cross-modal appearance changes, feature descriptors have been proposed based on local self-similarity (LSS) [22], which is motivated by the notion that the geometric layout of local internal self-similarities is relatively insensitive to imaging properties. The state-of-the-art descriptor for cross-modal dense correspondence, called dense adaptive self-correlation (DASC) [10], makes use of LSS and has demonstrated high accuracy and speed on cross-modal image pairs. However, DASC suffers from two significant shortcomings. One is its limited discriminative power due to a limited set of patch sampling patterns used for modeling internal self-similarities. In fact, the matching performance of DASC may fall well short of CNN-based descriptors on images that share the same modality. The other major shortcoming is that the DASC descriptor does not provide the flexibility to deal with non-rigid deformations, which leads to lower robustness in matching.

In this paper, we introduce a novel descriptor, called deep self-convolutional activations (DeSCA), that overcomes the shortcomings of DASC while providing dense cross-modal correspondences. This work is motivated by the observation that local self-similarity can be formulated in a deep convolutional architecture to enhance discriminative power and gain robustness to non-rigid deformations. Unlike the DASC descriptor that selects patch pairs within a support window and calculates the self-similarity between them, we compute self-convolutional activations that more comprehensively encode the intrinsic structure by calculating the self-similarity between randomly selected patches and all of the patches within the support window. These self-convolutional responses are aggregated through spatial pyramid pooling in a circular configuration, which yields a representation less sensitive to non-rigid image deformations than the fixed patch selection strategy used in DASC. To further enhance the discriminative power and robustness, we build hierarchical self-convolutional layers resembling a deep architecture used in CNN, together with nonlinear and normalization layers. For efficient computation of DeSCA over densely sampled pixels, we calculate the self-convolutional activations through fast edge-aware filtering.

DeSCA resembles a CNN in its deep, multi-layer, and convolutional structure. In contrast to existing CNN-based descriptors, DeSCA requires no training data for learning convolutional kernels, since the convolutions are defined as the local self-similarity between pairs of image patches, which yields its robustness to cross-modal imaging. Fig. 1 illustrates the robustness of DeSCA for image pairs across non-rigid deformations and illumination changes. In the experimental results, we show that DeSCA outperforms existing area-based and feature-based descriptors on various benchmarks.

2 Related Work

2.0.1 Feature Descriptors

Conventional gradient-based descriptors, such as SIFT [14] and DAISY [23], as well as intensity comparison-based binary descriptors, such as BRIEF [24]

, have shown limited performance in dense correspondence estimation between cross-modal image pairs. Besides these handcrafted features, several attempts have been made using machine learning algorithms to derive features from large-scale datasets

[15, 25]. A few of these methods use deep convolutional neural networks (CNNs) [26], which have revolutionized image-level classification, to learn discriminative descriptors for local patches. For designing explicit feature descriptors based on a CNN architecture, immediate activations are extracted as the descriptor [15, 16, 17, 18, 19], and have been shown to be effective for this patch-level task. However, even though CNN-based descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in cross-modal image correspondence because they are derived from convolutional layers using shared patches or volumes [19, 20]. Furthermore, they cannot in practice provide dense descriptors in the image domain due to their prohibitively high computational complexity.

To estimate cross-modal correspondences, variants of the SIFT descriptor have been developed [27], but these gradient-based descriptors maintain an inherent limitation similar to SIFT in dealing with image gradients that vary differently between modalities. For illumination invariant correspondences, Wang et al. proposed the local intensity order pattern (LIOP) descriptor [28], but severe radiometric variations may often alter the relative order of pixel intensities. Simo-Serra et al. proposed the deformation and light invariant (DaLI) descriptor [29] to provide high resilience to non-rigid image transformations and illumination changes, but it cannot provide dense descriptors in the image domain due to its high computational time.

Schechtman and Irani introduced the LSS descriptor [22] for the purpose of template matching, and achieved impressive results in object detection and retrieval. By employing LSS, many approaches have tried to solve for cross-modal correspondences [30, 31, 32]. However, none of these approaches scale well to dense matching in cross-modal images due to low discriminative power and high complexity. Inspired by LSS, Kim et al. recently proposed the DASC descriptor to estimate cross-modal dense correspondences [10]. Though it can provide satisfactory performance, it is not able to handle non-rigid deformations and has limited discriminative power due to its fixed patch pooling scheme.

2.0.2 Area-Based Similarity Measures

A popular measure for registration of cross-modal medical images is mutual information (MI) [33]

, based on the entropy of the joint probability distribution function, but it provides reliable performance only for variations undergoing a global transformation

[34]. Although cross-correlation based methods such as adaptive normalized cross-correlation (ANCC) [35] produce satisfactory results for locally linear variations, they are less effective against more substantial modality variations. Robust selective normalized cross-correlation (RSNCC) [9] was proposed for dense alignment between cross-modal images, but as an intensity based measure it can still be sensitive to cross-modal variations. Recently, DeepMatching [36] was proposed to compute dense correspondences by employing a hierarchical pooling scheme like CNN, but it is not designed to handle cross-modal matching.

Figure 2: Illustration of (a) LSS [22]

using center-biased dense max pooling, (b) DASC

[10] using patch-wise receptive field pooling, and (c) our DeSCA. Boxes, formed by solid and dotted lines, depict source and target patches. DeSCA incorporates a circular spatial pyramid pooling on hierarchical self-convolutional activations.

3 Background

Let us define an image as for pixel , where is a discrete image domain. Given the image , a dense descriptor with a feature dimension of is defined on a local support window of size .

Unlike conventional descriptors, relying on common visual properties across images such as color and gradient, LSS-based descriptors provide robustness to different imaging modalities since internal self-similarities are preserved across cross-modal image pairs [22, 10]. As shown in Fig. 2(a), the LSS discretizes the correlation surface on a log-polar grid, generates a set of bins, and then stores the maximum correlation value of each bin. Formally, it generates an

feature vector

for , with computed as


where log-polar bins are defined as with a log radius for and a quantized angle for with and . is a correlation surface between a patch and of size , computed using sum of square differences. Each pair of and is associated with a unique index . Though LSS provides robustness to modality variations, its significant computation does not scale well for estimating dense correspondences in cross-modal images.

Inspired by the LSS [22], the DASC [10] encodes the similarity between patch-wise receptive fields sampled from a log-polar circular point set as shown in Fig. 2(b). It is defined such that , which has a higher density of points near a center pixel, similar to DAISY [23]. The DASC is encoded with a set of similarities between patch pairs of sampling patterns selected from such that for :


where and are the selected sampling pattern from at pixel . The patch-wise similarity is computed with an exponential function with a bandwidth of , which has been widely used for robust estimation [37]. is computed using an adaptive self-correlation measure. While the DASC descriptor has shown satisfactory results for cross-modal dense correspondence [10], its randomized receptive field pooling has limited descriptive power and does not accommodate non-rigid deformations.

4 The DeSCA Descriptor

4.1 Motivation and Overview

Inspired by DASC [10], our DeSCA descriptor also measures an adaptive self-correlation between two patches. We, however, adopt a different strategy for selecting patch pairs, and build self-convolutional activations that more comprehensively encode self-similar structure to improve the discriminative power and the robustness to non-rigid image deformation (Sec. 4.2). Motivated by the deep architecture of CNN-based descriptors [19], we further build hierarchical self-convolution activations to enhance the robustness of the DeSCA descriptor (Sec. 4.4). Densely sampled descriptors are efficiently computed over an entire image using a method based on fast edge-aware filtering (Sec. 4.3). Fig. 2(c) illustrates the DeSCA descriptor, which incorporates a circular spatial pyramid pooling on hierarchical self-convolutional activations.

Figure 3: Computation of single self-convolutional activation (SiSCA). (a) A local support window of size with random samples. (b) For each random patch, a self-convolutional surface is computed using an adaptive self-correlation measure. (c) A self-convolutional activation is then obtained through circular spatial pyramid pooling (C-SPP). (d) The activation from C-SPP is concatenated as 1-D feature vector.
Figure 4: Examples of the circular spatial pyramidal bins . The total number of bins is , where represents the pyramid level.

4.2 SiSCA: Single Self-Convolutional Activation

To simultaneously leverage the benefits of self-similarity in DASC [10] and the deep convolutional archiecture of CNNs while overcoming the limitations of each method, our approach builds self-convolutional activations. Unlike DASC [10], the feature response is obtained through circular spatial pyramid pooling. We start by describing a single-layer version of DeSCA, which we denote as SiSCA.

4.2.1 Self-Convolutions

To build a self-convolutional activation, we randomly select points from a log-polar circular point set defined within a local support window . We convolve a patch centered at the -th point with all patches , which is defined for and as Fig. 3(b). Similar to DASC [10], the similarity between patch pairs is measured using an adaptive self-correlation, which is known to be effective in addressing cross-modality. With omitted for simplicity, is computed as follows:


for and . and represent weighted averages of and . Similar to DASC [10], the weight represents how similar two pixels and are, and is normalized, i.e., . It may be defined using any form of edge-aware weighting [38, 39].

4.2.2 Circular Spatial Pyramid Pooling

To encode the feature responses on the self-convolutional surface, we propose a circular spatial pyramid pooling (C-SPP) scheme, which pools the responses within each hierarchical spatial bin, similar to a spatial pyramid pooling (SPP) [20, 40, 41] but in a circular configuration. Note that many existing descriptors also adopt a circular pooling scheme thanks to its robustness based on a higher pixel density near a central pixel [22, 23, 24]. We further encodes more structure information with a C-SPP.

Figure 5: Efficient computation of self-convolutional activations on the image. (a) An image with a doubled support window and random samples. (b) 1-D vectorial self-convolutional surface. (c) Self-convolutional activations. (d) Activations after C-SPP. With an efficient edge-aware filtering and activation reformulation, self-convolutonal activations are computed efficiently in a dense manner.

The circular pyramidal bins are defined from log-polar circular bins , where indexes all pyramidal level and all bins in each level as in Fig. 4. The circular pyramidal bin at the top of pyramid, i.e., , first encompasses all of bins . At the second level, i.e., , it is defined by dividing into quadrants. For further lower pyramid levels, i.e., , the circular pyramidal bins are defined differently according to whether

is odd or even. For an odd

, the bins are defined by dividing bins in upper level into two parts along the radius. For an even , they are defined by dividing bins in upper level into two parts with respect to the angle. The set of all circular pyramidal bins is denoted such that for , where the number of circular spatial pyramid bins is defined as .

As illustrated in Fig. 3(c), the feature responses are finally max-pooled on the circular pyramidal bins of each self-convolutional surface , yielding a feature response


This pooling is repeated for all , yielding accumulated activations where indexes for all and .

Interestingly, LSS [22] also uses the max pooling strategy to mitigate the effects of non-rigid image deformation. However, max pooling in the 2-D self-correlation surface of LSS [22] loses fine-scale matching details as reported in [10]. By contrast, DeSCA employs circular spatial pyramid pooling in the 3-D self-correlation surface that provides a more discriminative representation of self-similarities, thus maintaining fine-scale matching details as well as providing robustness to non-rigid image deformations.

4.2.3 Non-linear Gating and Nomalization

The final feature responses are passed through a non-linear and normalization layer to mitigate the effects of outliers. With accumulated activations

, the single self-convolution activiation (SiSCA) descriptor is computed for through a non-linear gating layer:


where is a Gaussian kernel bandwidth. The size of features obtained from the SiSCA becomes . Finally, for each pixel is normalized with an L-2 norm for all .

Figure 6: Visualization of SiSCA and DeSCA descriptor. Our architecture consists of a hierarchical self-convolutional layer, circular spatial pyramid pooling layer, non-linear gating layer, and normalization layer.

4.3 Efficient Computation for Dense Description

The most time-consuming part of DeSCA is in constructing self-convolutional surfaces for and , where computations of (3) are needed for each pixel . Straightforward computation of a weighted summation using in (3) would require considerable processing with a computational complexity of , where represents the image size (height and width ). To expedite processing, we utilize fast edge-aware filtering [38, 39] and propose a pre-computation scheme for convolutional surfaces.

Similar to DASC [10], we compute efficiently by first rearranging the sampling patterns into reference-biased pairs . can then be expressed as


where , , and . can be efficiently computed using any form of fast edge-aware filter [38, 39] with the complexity of . is then simply obtained from by re-indexing sampling patterns.

Though we remove the computational dependency on patch size , computations of (6) are still needed to obtain the self-convolutional activations, where many sampling pairs are repeated. To avoid such redundancy, we first compute self-convolutional activation for with a doubled local support window of size . A doubled local support window is used because (6) is computed with patch and the minimum support window size for to cover all samples within is as shown in Fig. 5(b). After the self-convolutional activation for is computed once over the image domain, can be extracted through an index mapping process, where the indexes for are estimated from .


Algorithm 1: Deep Self-Convolutional Activations (DeSCA) Descriptor


Input : image , random samples .
Output : DeSCA descriptor .
Compute for a doubled support window by using (6).
Estimate from according to the index mapping process.
for do hierarchical aggregation using average pooling
Determine a circular pyramidal point .
Compute by using an average pooling for on .
end for
for do hierarchical spatial aggregation using C-SPP
Determine a circular pyramidal bin .
Compute and by using C-SPP on each from and , respectively.
end for
Build hierarchical self-convolutional activations from and .
Compute a nonlinear response (5), followed by L-2 normalization.
Build a DeSCA descriptor .


4.4 DeSCA: Deep Self-Convolutional Activations

So far, we have discussed how to build the self-convolutional activation on a single level. In this section, we extend this idea by encoding self-similar structures at multiple levels in a manner similar to a deep architecture widely adopted in the CNNs [26]. DeSCA is defined similarly to SiSCA, except that an average pooling is executed before C-SPP (see Fig. 6). With self-convolutional activations, we perform the average pooling on circular pyramidal point sets.

In comparison to the self-convolutions just from a single patch, the spatial aggregation of self-convolutional responses is clearly more robust, and it requires only marginal computational overhead over SiSCA. The strength of such a hierarchical aggregation has also been shown in [36]. Compared to using only last CNN layer activations, we use all intermediate activations from hierarchical average pooling, which yields better cross-modal matching quality.

Figure 7: Component analysis of DeSCA on the Middlebury benchmark [42] for varying parameter values, such as (a) support window size , (b) number of log-polar circular point , (c) number of random samples , and (d) level of circular spatial pyramid . In each experiment, all other parameters are fixed to the initial values.

To build the hierarchical self-convolutional volume using an average pooling, we first define the circular pyramidal point sets from log-polar circular point sets , where associates all pyramidal level and all points in each level . In the average pooling, the circular pyramidal bins used in C-SPP is re-used such that , thus . Deep self-convolutional activations are defined by aggregating for all patches determined on each such that


which is defined for all , and is the number of patches within . The hierarchical activations are sequentially aggregated using average pooling from bottom to top of circular pyramidal point set . After computing hierarchical self-convolutional aggregations, similar to SiSCA, the DeSCA employs C-SPP, non-linear, and normalization layer presented in Sec. 4.2. Hierarchical self-convolutional activation is computed using the C-SPP such that


Accumulated self-convolutional activations are built from in (4) and in (8) such that . Our DeSCA descriptor is then passed through a non-linear layer. is built for with . Finally, for each pixel is normalized with an L-2 norm for all .

5 Experimental Results and Discussion

5.1 Experimental Settings

In our experiments, the DeSCA descriptor was implemented with the following fixed parameter settings for all datasets: , and . We chose the guided filter (GF) for edge-aware filtering in (6), with a smoothness parameter of . We implemented the DeSCA descriptor in C++ on an Intel Core i7-3770 CPU at 3.40 GHz. We will make our code publicly available. The DeSCA descriptor was compared to other state-of-the-art descriptors (SIFT [14], DAISY [23], BRIEF [24], LIOP [28], DaLI [29], LSS [22], and DASC [10]), as well as area-based approaches (ANCC [35] and RSNCC [9]). Furthermore, to evaluate the performance gain with a deep architecture, we compared SiSCA and DeSCA.

(a) image 1
(b) image 2
(c) ANCC
(d) SIFT
(e) LSS
(f) DASC
(g) DeSCA
Figure 8: Comparison of disparity estimations for Moebius and Dolls image pairs across illumination combination ‘1/3’ and exposure combination ‘0/2’, respectively. Compared to other methods, DeSCA estimates more accurate and edge-preserved disparity maps.
Figure 9: Average bad-pixel error rate on the Middlebury benchmark [42] with illumination and exposure variations. Optimization was done by GC in (a), (b), and by WTA in (c), (d). DeSCA descriptor shows the best performance with the lowest error rate.

5.2 Parameter Evaluation

The matching performance of DeSCA is exhibited in Fig. 7 for varying parameter values, including support window size , number of log-polar circular points , number of random samples , and levels of the circular spatial pyramid . Note that . Especially, Fig. 7(c), (d) prove the effectiveness of self-convolutional activations and deep architectures of DeSCA. For a quantitative analysis, we measured the average bad-pixel error rate on the Middlebury benchmark [42]. With a larger support window , the matching quality improves rapidly until about . influences the performance of circular pooling, which is found to plateau at . Using a larger number of random samples yields better performance since the descriptor encodes more information. The level of circular spatial pyramid also affects the amount of encoding in DeSCA. Based on these experiments, we set and in consideration of efficiency and robustness.

(a) image 1
(b) image 2
(d) LSS
(e) DASC
(f) SiSCA
(g) DeSCA
Figure 10: Dense correspondence evaluations for (from top to bottom) RGB-NIR, flash-noflash, different exposures, and blurred-sharp images. Compared to others, DeSCA estimates more reliable dense correspondences for challenging cross-modal pairs.


Methods WTA optimization SF optimization [11]
RGB-NIR flash-noflash diff. expo. blur-sharp RGB-NIR flash-noflash diff. expo. blur-sharp
ANCC [35] 23.21 20.42 25.19 26.14 18.45 14.14 11.96 19.24
RSNCC [9] 27.51 25.12 18.21 27.91 13.41 15.87 9.15 18.21
SIFT [14] 24.11 18.72 19.42 27.18 18.51 11.06 14.87 20.78
DAISY [23] 27.61 26.30 20.72 27.41 20.42 10.84 12.71 22.91
BRIEF [24] 29.14 18.29 17.13 26.43 17.54 9.21 9.54 19.72
LSS [22] 27.82 19.18 18.21 26.14 16.14 11.88 9.11 18.51
LIOP [28] 24.42 16.42 14.22 20.42 15.32 11.42 10.22 17.12
DASC [10] 14.51 13.24 10.32 16.42 13.42 7.11 7.21 11.21
SiSCA 10.12 10.12 8.22 14.22 9.12 6.18 5.22 9.12
DeSCA 8.12 8.22 6.72 13.28 7.62 5.12 4.72 8.01


Table 1: Comparison of quantitative evaluation on cross-modal benchmark.

5.3 Middlebury Stereo Benchmark

We evaluated DeSCA on the Middlebury stereo benchmark [42], which contains illumination and exposure variations. In the experiments, the illumination (exposure) combination ‘1/3’ indicates that two images were captured under the and illumination (exposure) conditions. For a quantitative evaluation, we measured the bad-pixel error rate in non-occluded areas of disparity maps [42].

Fig. 8 shows the disparity maps estimated under severe illumination and exposure variations with winner-takes-all (WTA) optimization. Fig. 9 displays the average bad-pixel error rates of disparity maps obtained under illumination or exposure variations, with graph-cut (GC) [43] and WTA optimization. Area-based approaches (ANCC [35] and RSNCC [9]) are sensitive to severe radiometric variations, especially when local variations occur frequently. Feature descriptor-based methods (SIFT [14], DAISY [23], BRIEF [24], LSS [22], and DASC [10]) perform better than the area-based approaches, but they also provide limited performance. Our DeSCA descriptor achieves the best results both quantitatively and qualitatively. Compared to SiSCA descriptor, the performance of DeSCA descriptor is highly improved, where the performance benefits of the deep architecture are apparent.

(a) image 1
(b) image 2
(e) LSS
(f) DaLI
(g) DeSCA
Figure 11: Dense correspondence comparisons for images with different illumination conditions and non-rigid image deformations [29]. Compared to other approaches, DeSCA provides more accurate dense correspondence estimates with reduced artifacts.


Methods def. illum. def./ illum. aver.
SIFT [14] 45.15 40.81 47.51 44.49
DAISY [23] 43.98 42.72 43.42 43.37
BRIEF [24] 41.51 37.14 41.35 40
LSS [22] 40.81 39.54 40.11 40.12
LIOP [28] 28.72 31.72 30.21 30.22
DaLI [29] 27.12 27.31 27.99 27.47
DASC [10] 26.21 24.83 27.51 26.18
SiSCA 23.42 22.21 24.17 23.27
DeSCA 20.14 20.72 21.87 20.91


Table 2: Average error rates on the DaLI benchmark.

5.4 Cross-modal and Cross-spectral Benchmark

We evaluated DeSCA on a cross-modal and cross-spectral benchmark [10] containing various kinds of image pairs, namely RGB-NIR, different exposures, flash-noflash, and blurred-sharp. Optimization for all descriptors and similarity measures was done using WTA and SIFT flow (SF) with hierarchical dual-layer belief propagation [11], for which the code is publicly available. Sparse ground truths for those images are used for error measurement as done in [10].

Fig. 10 provides a qualitative comparison of the DeSCA descriptor to other state-of-the-art approaches. As already described in the literature [9], gradient-based approaches such as SIFT [14] and DAISY [23] have shown limited performance for RGB-NIR pairs where gradient reversals and inversions frequently appear. BRIEF [24] cannot deal with noisy regions and modality-based appearance differences since it is formulated on pixel differences only. Unlike these approaches, LSS [22] and DASC [10] consider local self-similarities, but LSS is lacking in discriminative power for dense matching. DASC also exhibits limited performance. Compared to those methods, the DeSCA displays better correspondence estimation. We also performed a quantitative evaluation with results listed in Table 1, which also clearly demonstrates the effectiveness of DeSCA.

5.5 DaLI Benchmark

We also evaluated DeSCA on a recent, publicly available dataset featuring challenging non-rigid deformations and very severe illumination changes [29]. Fig. 11 presents dense correspondence estimates for this benchmark [29]. A quantitative evaluation is given in Table 2 using ground truth feature points sparsely extracted for each image, although DeSCA is designed to estimate dense correspondences. As expected, conventional gradient-based and intensity comparison-based feature descriptors, including SIFT [14], DAISY [23], and BRIEF [24], do not provide reliable correspondence performance. LSS [22] and DASC [10] exhibit relatively high performance for illumination changes, but are limited on non-rigid deformations. LIOP [28] provides robustness to radiometric variations, but is sensitive to non-rigid deformations. Although DaLI [29] provides robust correspondences, it requires considerable computation for dense matching. DeSCA offers greater discriminative power as well as more robustness to non-rigid deformations in comparison to the state-of-the-art cross-modality descriptors.




Table 3: Computation speed of DeSCA and other state-of-the-art local and global descriptors. The brute-force and efficient implementations of DeSCA are denoted by * and †, respectively.

5.6 Computational Speed

In Table 3, we compared the computational speed of DeSCA to state-of-the-art local descriptor, namely DaLI [29], and dense descriptors, namely DAISY [23], LSS [22], and DASC [10]. Even though DeSCA needs more computational time compared to some previous dense descriptors, it provides significantly improved matching performance as described previously.

6 Conclusion

The deep self-convolutional activations (DeSCA) descriptor was proposed for establishing dense correspondences between images taken under different imaging modalities. Its high performance in comparison to state-of-the-art cross-modality descriptors can be attributed to its greater robustness to non-rigid deformations because of its effective pooling scheme, and more importantly its heightened discriminative power from a more comprehensive representation of self-similar structure and its formulation in a deep architecture. DeSCA was validated on an extensive set of experiments that cover a broad range of cross-modal differences. In future work, thanks to the robustness to non-rigid deformations and high discriminative power, DeSCA can potentially benefit object detection and semantic segmentation.


  • [1] Brown, M., Susstrunk, S.: Multispectral sift for scene category recognition. In: CVPR (2011)
  • [2] Yan, Q., Shen, X., Xu, L., Zhuo, S.: Cross-field joint image restoration via scale map. In: ICCV (2013)
  • [3] Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: CVPR (2015)
  • [4] Krishnan, D., Fergus, R.: Dark flash photography. In: SIGGRAPH (2009)
  • [5] Sen, P., Kalantari, N.K., Yaesoubi, M., Darabi, S., Goldman, D.B., Shechtman, E.: Robust patch-based hdr reconstruction of dynamic scenes. In: SIGGRAPH (2012)
  • [6] HaCohen, Y., Shechtman, E., Lishchinski, E.: Deblurring by example using dense correspondence. In: ICCV (2013)
  • [7] Lee, H., Lee, K.: Dense 3d reconstruction from severely blurred images using a single moving camera. In: CVPR (2013)
  • [8] Petschnigg, G., Agrawals, M., Hoppe, H.: Digital photography with flash and no-flash iimage pairs. In: SIGGRAPH (2004)
  • [9] Shen, X., Xu, L., Zhang, Q., Jia, J.: Multi-modal and multi-spectral registration for natural images. In: ECCV (2014)
  • [10] Kim, S., Min, D., Ham, B., Ryu, S., Do, M.N., Sohn, K.: Dasc: Dense adaptive self-correlation descriptor for multi-modal and multi-spectral correspondence. In: CVPR (2015)
  • [11] Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. PAMI 33(5) (2011) 815–830
  • [12] Kim, J., Liu, C., Sha, F., Grauman, K.: Deformable spatial pyramid matching for fast dense correspondences. In: CVPR (2013)
  • [13] Pinggera, P., Breckon, T., Bischof, H.: On cross-spectral stereo matching using dense gradient features. In: BMVC (2012)
  • [14] Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2) (2004) 91–110
  • [15] Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. IEEE Trans. PAMI 36(8) (2014) 1573–1585
  • [16] Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional acitivation features. In: ECCV (2014)
  • [17] Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: A comparison to sift. arXiv:1405.5769 (2014)
  • [18] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: ICML (2014)
  • [19] Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: ICCV (2015)
  • [20] Dong, J., Soatto, S.: Domain-size pooling in local descriptors: Dsp-sift. In: CVPR (2015)
  • [21] Long, J., Shelhamer, E., Darrell, T.: Fully conovlutional networks for semantic segmentation. In: CVPR (2015)
  • [22] Schechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)
  • [23] Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. PAMI 32(5) (2010) 815–830
  • [24] Calonder, M.: Brief : Computing a local binary descriptor very fast. IEEE Trans. PAMI 34(7) (2011) 1281–1298
  • [25] Trzcinski, T., Christoudias, M., Lepetit, V.: Learning image descriptor with boosting. IEEE Trans. PAMI 37(3) (2015) 597–610
  • [26] Alex, K., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
  • [27] Saleem, S., Sablatnig, R.: A robust sift descriptor for multispectral images. IEEE SPL 21(4) (2014) 400–403
  • [28] Wang, Z., Fan, B., Wu, F.: Local intensity order pattern for feature description. In: ICCV (2011)
  • [29] Simo-Serra, E., Torras, C., Moreno-Noguer, F.: Dali: Deformation and light invariant descriptor. IJCV 115(2) (2015) 136–154
  • [30] Heinrich, P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, V., Brady, S., Schnabel, A.: Mind: Modality indepdent neighbourhood descriptor for multi-modal deformable registration. MIA 16(3) (2012) 1423–1435
  • [31] Torabi, A., Bilodeau, G.: Local self-similarity-based registration of human rois in pairs of stereo thermal-visible videos. PR 46(2) (2013) 578–589
  • [32] Ye, Y., Shan, J.: A local descriptor based registration method for multispectral remote sensing images with non-linear intensity differences. JPRS 90(7) (2014) 83–95
  • [33] Pluim, J., Maintz, J., Viergever, M.: Mutual information based registration of medical images: A survey. IEEE Trans. MI 22(8) (2003) 986–1004
  • [34] Heo, Y., Lee, K., Lee, S.: Joint depth map and color consistency estimation for stereo images with different illuminations and cameras. IEEE Trans. PAMI 35(5) (2013) 1094–1106
  • [35] Heo, Y., Lee, K., Lee, S.: Robust stereo matching using adaptive normalized cross-correlation. IEEE Trans. PAMI 33(4) (2011) 807–822
  • [36] Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: ICCV (2013)
  • [37] Black, M.J., Sapiro, G., Marimont, D.H., Heeger, D.: Robust anisotropic diffusion. IEEE Trans. IP 7(3) (1998) 421–432
  • [38] Gastal, E., Oliveira, M.: Domain transform for edge-aware image and video processing. In: SIGGRAPH (2011)
  • [39] He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. PAMI 35(6) (2013) 1397–1409
  • [40] Seidenari, L., Serra, G., Bagdanov, A.D., Bimbo, A.D.: Local pyramidal descriptors for image recognition. IEEE Trans. PAMI 36(5) (2014) 1033–1040
  • [41] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. PAMI 37(9) (2015) 1904–1916
  • [42] online.:
  • [43] Boykov, Y., Yeksler, O., Zabih, R.: Fast approximation enermgy minimization via graph cuts. IEEE Trans. PAMI 23(11) (2001) 1222–1239