1 Introduction
In many computer vision and computational photography applications, images captured under different imaging modalities are used to supplement the data provided in color images. Typical examples of other imaging modalities include nearinfrared
[1, 2, 3] and dark flash [4] photography. More broadly, photos taken under different imaging conditions, such as different exposure settings [5], blur levels [6, 7], and illumination [8], can also be considered as crossmodal [9, 10].Establishing dense correspondences between crossmodal image pairs is essential for combining their disparate information. Although powerful global optimizers may help to improve the accuracy of correspondence estimation to some extent
[11, 12], they face inherent limitations without help of suitable matching descriptors [13]. The most popular local descriptor is scale invariant feature transform (SIFT) [14], which provides relatively good matching performance when there are small photometric variations. However, conventional descriptors such as SIFT often fail to capture reliable matching evidences in crossmodal image pairs due to their different visual properties [9, 10].Recently, convolutional neural networks (CNNs) based features [15, 16, 17, 18, 19] have emerged as a robust alternative with high discriminative power. However, CNNbased descriptors cannot satisfactorily deal with severe crossmodality appearance differences, since they use shared convolutional kernels across images which lead to inconsistent responses similar to conventional descriptors [19, 20]. Furthermore, they do not scale well for dense correspondence estimation due to their high computational complexity. Though recent works [21] propose an efficient method that extracts dense outputs through the deep CNNs, they do not extract dense CNN features for all pixels individually. More seriously, their methods were usually designed to perform a specific task only, e.g., semantic segmentation, not to provide a general purpose descriptor like ours.
To address the problem of crossmodal appearance changes, feature descriptors have been proposed based on local selfsimilarity (LSS) [22], which is motivated by the notion that the geometric layout of local internal selfsimilarities is relatively insensitive to imaging properties. The stateoftheart descriptor for crossmodal dense correspondence, called dense adaptive selfcorrelation (DASC) [10], makes use of LSS and has demonstrated high accuracy and speed on crossmodal image pairs. However, DASC suffers from two significant shortcomings. One is its limited discriminative power due to a limited set of patch sampling patterns used for modeling internal selfsimilarities. In fact, the matching performance of DASC may fall well short of CNNbased descriptors on images that share the same modality. The other major shortcoming is that the DASC descriptor does not provide the flexibility to deal with nonrigid deformations, which leads to lower robustness in matching.
In this paper, we introduce a novel descriptor, called deep selfconvolutional activations (DeSCA), that overcomes the shortcomings of DASC while providing dense crossmodal correspondences. This work is motivated by the observation that local selfsimilarity can be formulated in a deep convolutional architecture to enhance discriminative power and gain robustness to nonrigid deformations. Unlike the DASC descriptor that selects patch pairs within a support window and calculates the selfsimilarity between them, we compute selfconvolutional activations that more comprehensively encode the intrinsic structure by calculating the selfsimilarity between randomly selected patches and all of the patches within the support window. These selfconvolutional responses are aggregated through spatial pyramid pooling in a circular configuration, which yields a representation less sensitive to nonrigid image deformations than the fixed patch selection strategy used in DASC. To further enhance the discriminative power and robustness, we build hierarchical selfconvolutional layers resembling a deep architecture used in CNN, together with nonlinear and normalization layers. For efficient computation of DeSCA over densely sampled pixels, we calculate the selfconvolutional activations through fast edgeaware filtering.
DeSCA resembles a CNN in its deep, multilayer, and convolutional structure. In contrast to existing CNNbased descriptors, DeSCA requires no training data for learning convolutional kernels, since the convolutions are defined as the local selfsimilarity between pairs of image patches, which yields its robustness to crossmodal imaging. Fig. 1 illustrates the robustness of DeSCA for image pairs across nonrigid deformations and illumination changes. In the experimental results, we show that DeSCA outperforms existing areabased and featurebased descriptors on various benchmarks.
2 Related Work
2.0.1 Feature Descriptors
Conventional gradientbased descriptors, such as SIFT [14] and DAISY [23], as well as intensity comparisonbased binary descriptors, such as BRIEF [24]
, have shown limited performance in dense correspondence estimation between crossmodal image pairs. Besides these handcrafted features, several attempts have been made using machine learning algorithms to derive features from largescale datasets
[15, 25]. A few of these methods use deep convolutional neural networks (CNNs) [26], which have revolutionized imagelevel classification, to learn discriminative descriptors for local patches. For designing explicit feature descriptors based on a CNN architecture, immediate activations are extracted as the descriptor [15, 16, 17, 18, 19], and have been shown to be effective for this patchlevel task. However, even though CNNbased descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in crossmodal image correspondence because they are derived from convolutional layers using shared patches or volumes [19, 20]. Furthermore, they cannot in practice provide dense descriptors in the image domain due to their prohibitively high computational complexity.To estimate crossmodal correspondences, variants of the SIFT descriptor have been developed [27], but these gradientbased descriptors maintain an inherent limitation similar to SIFT in dealing with image gradients that vary differently between modalities. For illumination invariant correspondences, Wang et al. proposed the local intensity order pattern (LIOP) descriptor [28], but severe radiometric variations may often alter the relative order of pixel intensities. SimoSerra et al. proposed the deformation and light invariant (DaLI) descriptor [29] to provide high resilience to nonrigid image transformations and illumination changes, but it cannot provide dense descriptors in the image domain due to its high computational time.
Schechtman and Irani introduced the LSS descriptor [22] for the purpose of template matching, and achieved impressive results in object detection and retrieval. By employing LSS, many approaches have tried to solve for crossmodal correspondences [30, 31, 32]. However, none of these approaches scale well to dense matching in crossmodal images due to low discriminative power and high complexity. Inspired by LSS, Kim et al. recently proposed the DASC descriptor to estimate crossmodal dense correspondences [10]. Though it can provide satisfactory performance, it is not able to handle nonrigid deformations and has limited discriminative power due to its fixed patch pooling scheme.
2.0.2 AreaBased Similarity Measures
A popular measure for registration of crossmodal medical images is mutual information (MI) [33]
, based on the entropy of the joint probability distribution function, but it provides reliable performance only for variations undergoing a global transformation
[34]. Although crosscorrelation based methods such as adaptive normalized crosscorrelation (ANCC) [35] produce satisfactory results for locally linear variations, they are less effective against more substantial modality variations. Robust selective normalized crosscorrelation (RSNCC) [9] was proposed for dense alignment between crossmodal images, but as an intensity based measure it can still be sensitive to crossmodal variations. Recently, DeepMatching [36] was proposed to compute dense correspondences by employing a hierarchical pooling scheme like CNN, but it is not designed to handle crossmodal matching.using centerbiased dense max pooling, (b) DASC
[10] using patchwise receptive field pooling, and (c) our DeSCA. Boxes, formed by solid and dotted lines, depict source and target patches. DeSCA incorporates a circular spatial pyramid pooling on hierarchical selfconvolutional activations.3 Background
Let us define an image as for pixel , where is a discrete image domain. Given the image , a dense descriptor with a feature dimension of is defined on a local support window of size .
Unlike conventional descriptors, relying on common visual properties across images such as color and gradient, LSSbased descriptors provide robustness to different imaging modalities since internal selfsimilarities are preserved across crossmodal image pairs [22, 10]. As shown in Fig. 2(a), the LSS discretizes the correlation surface on a logpolar grid, generates a set of bins, and then stores the maximum correlation value of each bin. Formally, it generates an
feature vector
for , with computed as(1) 
where logpolar bins are defined as with a log radius for and a quantized angle for with and . is a correlation surface between a patch and of size , computed using sum of square differences. Each pair of and is associated with a unique index . Though LSS provides robustness to modality variations, its significant computation does not scale well for estimating dense correspondences in crossmodal images.
Inspired by the LSS [22], the DASC [10] encodes the similarity between patchwise receptive fields sampled from a logpolar circular point set as shown in Fig. 2(b). It is defined such that , which has a higher density of points near a center pixel, similar to DAISY [23]. The DASC is encoded with a set of similarities between patch pairs of sampling patterns selected from such that for :
(2) 
where and are the selected sampling pattern from at pixel . The patchwise similarity is computed with an exponential function with a bandwidth of , which has been widely used for robust estimation [37]. is computed using an adaptive selfcorrelation measure. While the DASC descriptor has shown satisfactory results for crossmodal dense correspondence [10], its randomized receptive field pooling has limited descriptive power and does not accommodate nonrigid deformations.
4 The DeSCA Descriptor
4.1 Motivation and Overview
Inspired by DASC [10], our DeSCA descriptor also measures an adaptive selfcorrelation between two patches. We, however, adopt a different strategy for selecting patch pairs, and build selfconvolutional activations that more comprehensively encode selfsimilar structure to improve the discriminative power and the robustness to nonrigid image deformation (Sec. 4.2). Motivated by the deep architecture of CNNbased descriptors [19], we further build hierarchical selfconvolution activations to enhance the robustness of the DeSCA descriptor (Sec. 4.4). Densely sampled descriptors are efficiently computed over an entire image using a method based on fast edgeaware filtering (Sec. 4.3). Fig. 2(c) illustrates the DeSCA descriptor, which incorporates a circular spatial pyramid pooling on hierarchical selfconvolutional activations.
4.2 SiSCA: Single SelfConvolutional Activation
To simultaneously leverage the benefits of selfsimilarity in DASC [10] and the deep convolutional archiecture of CNNs while overcoming the limitations of each method, our approach builds selfconvolutional activations. Unlike DASC [10], the feature response is obtained through circular spatial pyramid pooling. We start by describing a singlelayer version of DeSCA, which we denote as SiSCA.
4.2.1 SelfConvolutions
To build a selfconvolutional activation, we randomly select points from a logpolar circular point set defined within a local support window . We convolve a patch centered at the th point with all patches , which is defined for and as Fig. 3(b). Similar to DASC [10], the similarity between patch pairs is measured using an adaptive selfcorrelation, which is known to be effective in addressing crossmodality. With omitted for simplicity, is computed as follows:
(3) 
for and . and represent weighted averages of and . Similar to DASC [10], the weight represents how similar two pixels and are, and is normalized, i.e., . It may be defined using any form of edgeaware weighting [38, 39].
4.2.2 Circular Spatial Pyramid Pooling
To encode the feature responses on the selfconvolutional surface, we propose a circular spatial pyramid pooling (CSPP) scheme, which pools the responses within each hierarchical spatial bin, similar to a spatial pyramid pooling (SPP) [20, 40, 41] but in a circular configuration. Note that many existing descriptors also adopt a circular pooling scheme thanks to its robustness based on a higher pixel density near a central pixel [22, 23, 24]. We further encodes more structure information with a CSPP.
The circular pyramidal bins are defined from logpolar circular bins , where indexes all pyramidal level and all bins in each level as in Fig. 4. The circular pyramidal bin at the top of pyramid, i.e., , first encompasses all of bins . At the second level, i.e., , it is defined by dividing into quadrants. For further lower pyramid levels, i.e., , the circular pyramidal bins are defined differently according to whether
is odd or even. For an odd
, the bins are defined by dividing bins in upper level into two parts along the radius. For an even , they are defined by dividing bins in upper level into two parts with respect to the angle. The set of all circular pyramidal bins is denoted such that for , where the number of circular spatial pyramid bins is defined as .As illustrated in Fig. 3(c), the feature responses are finally maxpooled on the circular pyramidal bins of each selfconvolutional surface , yielding a feature response
(4) 
This pooling is repeated for all , yielding accumulated activations where indexes for all and .
Interestingly, LSS [22] also uses the max pooling strategy to mitigate the effects of nonrigid image deformation. However, max pooling in the 2D selfcorrelation surface of LSS [22] loses finescale matching details as reported in [10]. By contrast, DeSCA employs circular spatial pyramid pooling in the 3D selfcorrelation surface that provides a more discriminative representation of selfsimilarities, thus maintaining finescale matching details as well as providing robustness to nonrigid image deformations.
4.2.3 Nonlinear Gating and Nomalization
The final feature responses are passed through a nonlinear and normalization layer to mitigate the effects of outliers. With accumulated activations
, the single selfconvolution activiation (SiSCA) descriptor is computed for through a nonlinear gating layer:(5) 
where is a Gaussian kernel bandwidth. The size of features obtained from the SiSCA becomes . Finally, for each pixel is normalized with an L2 norm for all .
4.3 Efficient Computation for Dense Description
The most timeconsuming part of DeSCA is in constructing selfconvolutional surfaces for and , where computations of (3) are needed for each pixel . Straightforward computation of a weighted summation using in (3) would require considerable processing with a computational complexity of , where represents the image size (height and width ). To expedite processing, we utilize fast edgeaware filtering [38, 39] and propose a precomputation scheme for convolutional surfaces.
Similar to DASC [10], we compute efficiently by first rearranging the sampling patterns into referencebiased pairs . can then be expressed as
(6) 
where , , and . can be efficiently computed using any form of fast edgeaware filter [38, 39] with the complexity of . is then simply obtained from by reindexing sampling patterns.
Though we remove the computational dependency on patch size , computations of (6) are still needed to obtain the selfconvolutional activations, where many sampling pairs are repeated. To avoid such redundancy, we first compute selfconvolutional activation for with a doubled local support window of size . A doubled local support window is used because (6) is computed with patch and the minimum support window size for to cover all samples within is as shown in Fig. 5(b). After the selfconvolutional activation for is computed once over the image domain, can be extracted through an index mapping process, where the indexes for are estimated from .


Algorithm 1: Deep SelfConvolutional Activations (DeSCA) Descriptor  


Input : image , random samples .  
Output : DeSCA descriptor .  
Compute for a doubled support window by using (6).  
Estimate from according to the index mapping process.  
for do hierarchical aggregation using average pooling  
Determine a circular pyramidal point .  
Compute by using an average pooling for on .  
end for  
for do hierarchical spatial aggregation using CSPP  
Determine a circular pyramidal bin .  
Compute and by using CSPP on each from and , respectively.  
end for  
Build hierarchical selfconvolutional activations from and .  
Compute a nonlinear response (5), followed by L2 normalization.  
Build a DeSCA descriptor .  

4.4 DeSCA: Deep SelfConvolutional Activations
So far, we have discussed how to build the selfconvolutional activation on a single level. In this section, we extend this idea by encoding selfsimilar structures at multiple levels in a manner similar to a deep architecture widely adopted in the CNNs [26]. DeSCA is defined similarly to SiSCA, except that an average pooling is executed before CSPP (see Fig. 6). With selfconvolutional activations, we perform the average pooling on circular pyramidal point sets.
In comparison to the selfconvolutions just from a single patch, the spatial aggregation of selfconvolutional responses is clearly more robust, and it requires only marginal computational overhead over SiSCA. The strength of such a hierarchical aggregation has also been shown in [36]. Compared to using only last CNN layer activations, we use all intermediate activations from hierarchical average pooling, which yields better crossmodal matching quality.
To build the hierarchical selfconvolutional volume using an average pooling, we first define the circular pyramidal point sets from logpolar circular point sets , where associates all pyramidal level and all points in each level . In the average pooling, the circular pyramidal bins used in CSPP is reused such that , thus . Deep selfconvolutional activations are defined by aggregating for all patches determined on each such that
(7) 
which is defined for all , and is the number of patches within . The hierarchical activations are sequentially aggregated using average pooling from bottom to top of circular pyramidal point set . After computing hierarchical selfconvolutional aggregations, similar to SiSCA, the DeSCA employs CSPP, nonlinear, and normalization layer presented in Sec. 4.2. Hierarchical selfconvolutional activation is computed using the CSPP such that
(8) 
5 Experimental Results and Discussion
5.1 Experimental Settings
In our experiments, the DeSCA descriptor was implemented with the following fixed parameter settings for all datasets: , and . We chose the guided filter (GF) for edgeaware filtering in (6), with a smoothness parameter of . We implemented the DeSCA descriptor in C++ on an Intel Core i73770 CPU at 3.40 GHz. We will make our code publicly available. The DeSCA descriptor was compared to other stateoftheart descriptors (SIFT [14], DAISY [23], BRIEF [24], LIOP [28], DaLI [29], LSS [22], and DASC [10]), as well as areabased approaches (ANCC [35] and RSNCC [9]). Furthermore, to evaluate the performance gain with a deep architecture, we compared SiSCA and DeSCA.
5.2 Parameter Evaluation
The matching performance of DeSCA is exhibited in Fig. 7 for varying parameter values, including support window size , number of logpolar circular points , number of random samples , and levels of the circular spatial pyramid . Note that . Especially, Fig. 7(c), (d) prove the effectiveness of selfconvolutional activations and deep architectures of DeSCA. For a quantitative analysis, we measured the average badpixel error rate on the Middlebury benchmark [42]. With a larger support window , the matching quality improves rapidly until about . influences the performance of circular pooling, which is found to plateau at . Using a larger number of random samples yields better performance since the descriptor encodes more information. The level of circular spatial pyramid also affects the amount of encoding in DeSCA. Based on these experiments, we set and in consideration of efficiency and robustness.


Methods  WTA optimization  SF optimization [11]  
RGBNIR  flashnoflash  diff. expo.  blursharp  RGBNIR  flashnoflash  diff. expo.  blursharp  
ANCC [35]  23.21  20.42  25.19  26.14  18.45  14.14  11.96  19.24 
RSNCC [9]  27.51  25.12  18.21  27.91  13.41  15.87  9.15  18.21 
SIFT [14]  24.11  18.72  19.42  27.18  18.51  11.06  14.87  20.78 
DAISY [23]  27.61  26.30  20.72  27.41  20.42  10.84  12.71  22.91 
BRIEF [24]  29.14  18.29  17.13  26.43  17.54  9.21  9.54  19.72 
LSS [22]  27.82  19.18  18.21  26.14  16.14  11.88  9.11  18.51 
LIOP [28]  24.42  16.42  14.22  20.42  15.32  11.42  10.22  17.12 
DASC [10]  14.51  13.24  10.32  16.42  13.42  7.11  7.21  11.21 
SiSCA  10.12  10.12  8.22  14.22  9.12  6.18  5.22  9.12 
DeSCA  8.12  8.22  6.72  13.28  7.62  5.12  4.72  8.01 

5.3 Middlebury Stereo Benchmark
We evaluated DeSCA on the Middlebury stereo benchmark [42], which contains illumination and exposure variations. In the experiments, the illumination (exposure) combination ‘1/3’ indicates that two images were captured under the and illumination (exposure) conditions. For a quantitative evaluation, we measured the badpixel error rate in nonoccluded areas of disparity maps [42].
Fig. 8 shows the disparity maps estimated under severe illumination and exposure variations with winnertakesall (WTA) optimization. Fig. 9 displays the average badpixel error rates of disparity maps obtained under illumination or exposure variations, with graphcut (GC) [43] and WTA optimization. Areabased approaches (ANCC [35] and RSNCC [9]) are sensitive to severe radiometric variations, especially when local variations occur frequently. Feature descriptorbased methods (SIFT [14], DAISY [23], BRIEF [24], LSS [22], and DASC [10]) perform better than the areabased approaches, but they also provide limited performance. Our DeSCA descriptor achieves the best results both quantitatively and qualitatively. Compared to SiSCA descriptor, the performance of DeSCA descriptor is highly improved, where the performance benefits of the deep architecture are apparent.


Methods  def.  illum.  def./ illum.  aver. 
SIFT [14]  45.15  40.81  47.51  44.49 
DAISY [23]  43.98  42.72  43.42  43.37 
BRIEF [24]  41.51  37.14  41.35  40 
LSS [22]  40.81  39.54  40.11  40.12 
LIOP [28]  28.72  31.72  30.21  30.22 
DaLI [29]  27.12  27.31  27.99  27.47 
DASC [10]  26.21  24.83  27.51  26.18 
SiSCA  23.42  22.21  24.17  23.27 
DeSCA  20.14  20.72  21.87  20.91 

5.4 Crossmodal and Crossspectral Benchmark
We evaluated DeSCA on a crossmodal and crossspectral benchmark [10] containing various kinds of image pairs, namely RGBNIR, different exposures, flashnoflash, and blurredsharp. Optimization for all descriptors and similarity measures was done using WTA and SIFT flow (SF) with hierarchical duallayer belief propagation [11], for which the code is publicly available. Sparse ground truths for those images are used for error measurement as done in [10].
Fig. 10 provides a qualitative comparison of the DeSCA descriptor to other stateoftheart approaches. As already described in the literature [9], gradientbased approaches such as SIFT [14] and DAISY [23] have shown limited performance for RGBNIR pairs where gradient reversals and inversions frequently appear. BRIEF [24] cannot deal with noisy regions and modalitybased appearance differences since it is formulated on pixel differences only. Unlike these approaches, LSS [22] and DASC [10] consider local selfsimilarities, but LSS is lacking in discriminative power for dense matching. DASC also exhibits limited performance. Compared to those methods, the DeSCA displays better correspondence estimation. We also performed a quantitative evaluation with results listed in Table 1, which also clearly demonstrates the effectiveness of DeSCA.
5.5 DaLI Benchmark
We also evaluated DeSCA on a recent, publicly available dataset featuring challenging nonrigid deformations and very severe illumination changes [29]. Fig. 11 presents dense correspondence estimates for this benchmark [29]. A quantitative evaluation is given in Table 2 using ground truth feature points sparsely extracted for each image, although DeSCA is designed to estimate dense correspondences. As expected, conventional gradientbased and intensity comparisonbased feature descriptors, including SIFT [14], DAISY [23], and BRIEF [24], do not provide reliable correspondence performance. LSS [22] and DASC [10] exhibit relatively high performance for illumination changes, but are limited on nonrigid deformations. LIOP [28] provides robustness to radiometric variations, but is sensitive to nonrigid deformations. Although DaLI [29] provides robust correspondences, it requires considerable computation for dense matching. DeSCA offers greater discriminative power as well as more robustness to nonrigid deformations in comparison to the stateoftheart crossmodality descriptors.



image size  SIFT  DAISY  LSS  DaLI  DASC  DeSCA*  DeSCA† 

5.6 Computational Speed
In Table 3, we compared the computational speed of DeSCA to stateoftheart local descriptor, namely DaLI [29], and dense descriptors, namely DAISY [23], LSS [22], and DASC [10]. Even though DeSCA needs more computational time compared to some previous dense descriptors, it provides significantly improved matching performance as described previously.
6 Conclusion
The deep selfconvolutional activations (DeSCA) descriptor was proposed for establishing dense correspondences between images taken under different imaging modalities. Its high performance in comparison to stateoftheart crossmodality descriptors can be attributed to its greater robustness to nonrigid deformations because of its effective pooling scheme, and more importantly its heightened discriminative power from a more comprehensive representation of selfsimilar structure and its formulation in a deep architecture. DeSCA was validated on an extensive set of experiments that cover a broad range of crossmodal differences. In future work, thanks to the robustness to nonrigid deformations and high discriminative power, DeSCA can potentially benefit object detection and semantic segmentation.
References
 [1] Brown, M., Susstrunk, S.: Multispectral sift for scene category recognition. In: CVPR (2011)
 [2] Yan, Q., Shen, X., Xu, L., Zhuo, S.: Crossfield joint image restoration via scale map. In: ICCV (2013)
 [3] Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: CVPR (2015)
 [4] Krishnan, D., Fergus, R.: Dark flash photography. In: SIGGRAPH (2009)
 [5] Sen, P., Kalantari, N.K., Yaesoubi, M., Darabi, S., Goldman, D.B., Shechtman, E.: Robust patchbased hdr reconstruction of dynamic scenes. In: SIGGRAPH (2012)
 [6] HaCohen, Y., Shechtman, E., Lishchinski, E.: Deblurring by example using dense correspondence. In: ICCV (2013)
 [7] Lee, H., Lee, K.: Dense 3d reconstruction from severely blurred images using a single moving camera. In: CVPR (2013)
 [8] Petschnigg, G., Agrawals, M., Hoppe, H.: Digital photography with flash and noflash iimage pairs. In: SIGGRAPH (2004)
 [9] Shen, X., Xu, L., Zhang, Q., Jia, J.: Multimodal and multispectral registration for natural images. In: ECCV (2014)
 [10] Kim, S., Min, D., Ham, B., Ryu, S., Do, M.N., Sohn, K.: Dasc: Dense adaptive selfcorrelation descriptor for multimodal and multispectral correspondence. In: CVPR (2015)
 [11] Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. PAMI 33(5) (2011) 815–830
 [12] Kim, J., Liu, C., Sha, F., Grauman, K.: Deformable spatial pyramid matching for fast dense correspondences. In: CVPR (2013)
 [13] Pinggera, P., Breckon, T., Bischof, H.: On crossspectral stereo matching using dense gradient features. In: BMVC (2012)
 [14] Lowe, D.: Distinctive image features from scaleinvariant keypoints. IJCV 60(2) (2004) 91–110
 [15] Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. IEEE Trans. PAMI 36(8) (2014) 1573–1585
 [16] Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multiscale orderless pooling of deep convolutional acitivation features. In: ECCV (2014)
 [17] Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: A comparison to sift. arXiv:1405.5769 (2014)
 [18] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: ICML (2014)
 [19] SimoSerra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., MorenoNoguer, F.: Discriminative learning of deep convolutional feature point descriptors. In: ICCV (2015)
 [20] Dong, J., Soatto, S.: Domainsize pooling in local descriptors: Dspsift. In: CVPR (2015)
 [21] Long, J., Shelhamer, E., Darrell, T.: Fully conovlutional networks for semantic segmentation. In: CVPR (2015)
 [22] Schechtman, E., Irani, M.: Matching local selfsimilarities across images and videos. In: CVPR (2007)
 [23] Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient dense descriptor applied to widebaseline stereo. IEEE Trans. PAMI 32(5) (2010) 815–830
 [24] Calonder, M.: Brief : Computing a local binary descriptor very fast. IEEE Trans. PAMI 34(7) (2011) 1281–1298
 [25] Trzcinski, T., Christoudias, M., Lepetit, V.: Learning image descriptor with boosting. IEEE Trans. PAMI 37(3) (2015) 597–610
 [26] Alex, K., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
 [27] Saleem, S., Sablatnig, R.: A robust sift descriptor for multispectral images. IEEE SPL 21(4) (2014) 400–403
 [28] Wang, Z., Fan, B., Wu, F.: Local intensity order pattern for feature description. In: ICCV (2011)
 [29] SimoSerra, E., Torras, C., MorenoNoguer, F.: Dali: Deformation and light invariant descriptor. IJCV 115(2) (2015) 136–154
 [30] Heinrich, P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, V., Brady, S., Schnabel, A.: Mind: Modality indepdent neighbourhood descriptor for multimodal deformable registration. MIA 16(3) (2012) 1423–1435
 [31] Torabi, A., Bilodeau, G.: Local selfsimilaritybased registration of human rois in pairs of stereo thermalvisible videos. PR 46(2) (2013) 578–589
 [32] Ye, Y., Shan, J.: A local descriptor based registration method for multispectral remote sensing images with nonlinear intensity differences. JPRS 90(7) (2014) 83–95
 [33] Pluim, J., Maintz, J., Viergever, M.: Mutual information based registration of medical images: A survey. IEEE Trans. MI 22(8) (2003) 986–1004
 [34] Heo, Y., Lee, K., Lee, S.: Joint depth map and color consistency estimation for stereo images with different illuminations and cameras. IEEE Trans. PAMI 35(5) (2013) 1094–1106
 [35] Heo, Y., Lee, K., Lee, S.: Robust stereo matching using adaptive normalized crosscorrelation. IEEE Trans. PAMI 33(4) (2011) 807–822
 [36] Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: ICCV (2013)
 [37] Black, M.J., Sapiro, G., Marimont, D.H., Heeger, D.: Robust anisotropic diffusion. IEEE Trans. IP 7(3) (1998) 421–432
 [38] Gastal, E., Oliveira, M.: Domain transform for edgeaware image and video processing. In: SIGGRAPH (2011)
 [39] He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Trans. PAMI 35(6) (2013) 1397–1409
 [40] Seidenari, L., Serra, G., Bagdanov, A.D., Bimbo, A.D.: Local pyramidal descriptors for image recognition. IEEE Trans. PAMI 36(5) (2014) 1033–1040
 [41] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. PAMI 37(9) (2015) 1904–1916
 [42] online.: http://vision.middlebury.edu/stereo/.
 [43] Boykov, Y., Yeksler, O., Zabih, R.: Fast approximation enermgy minimization via graph cuts. IEEE Trans. PAMI 23(11) (2001) 1222–1239
Comments
There are no comments yet.