1 Introduction
This paper addresses the problem of multispectral registration, and is aimed specifically to the visible (VIS) and NearInfraRed (NIR) channels. Different spectra capture different scenes, therefore their registration is challenging and cannot be solved by stateoftheart approaches for geometric alignment [24]. See Figure 1 for example, the VIS channel captures the color of the scene while the NIR channel captures more details about the far objects. These two images are different by their nature. In this work we introduce a method for performing registration of this kind of images based on deep learning.
Our approach is based on metric learning of crossspectral image patches. In the first stage we detect feature points by Harris [9] corner detector. In the second stage, we match them to derive the global transformation between the input images. Since the patches around these points are different by their nature, SIFT [14] matching will not produce correct results. Therefore, we match them by deeplearning approach as follows. Our network is trained on CIFAR10 [12]
dataset and is geared to classify
patches of the RGB visible channel into 10 classes. By removing its last softmax layer and achieve an informative descriptor for each such RGB patch. Now, we train this trimmed net from scratch on NIR patches, such that cross spectral patches of the same object are trained to produce the same descriptor. Then we get two nets with the same architecture and different weights, the first for VIS descriptor and the second for NIR. These two networks induce a metric between cross spectral patches, which is the Euclidean distance of the two descriptors. As we show experimentally, this metric is an accurate basis for classification of multispectral patches to same or different, and therefore it is also a basis for our featurebased registration.The paper is organized as follows. In Section 2 we cover previous work on the topic of multimodal registration. In Section 3 we introduce our learning scheme of a deepdescriptor invariant to different wavelengths on top of a network trained on CIFAR10 dataset. Then, in Section 4, we explain how to use this descriptor to perform multispectral registration. In Section 5 we demonstrate evaluations of the accuracy of our registration algorithm compared to other existing approaches.
2 Previous Work
Image registration is an important task of computer vision which has many related works. Image registration methods
[3, 24] are the basis for many applications such as image fusion and 3d object recovery. Early methods rely on basic approaches such as solving translation by correlation [18]. [15] solves registration based on the gradients of the image. [19] solves global registration using a FFTbased method. Advanced methods are using keypoints [9, 14] and invariant descriptors to find the geometric alignment [21].Multimodal images registration is also addressed by several works over the last decades [6, 20]. In [16] registration was carried out based on maximization of mutual information. [4, 10, 11, 13] offer to utilize contours and gradients for registration. If solving only displacement, cross spectral registration can be executed by measuring correlation on Sobel image [8], or Canny image [5]. A group of works was focused on the specific task of visible to NearInfrared (NIR) registration [4, 11]. [1] solves registration by FAST features [22] and unique descriptors for non linear intensity variations. [2]
was the first to measure crossspectral similarity by ConvolutionalNeuralNetwork (CNN). Their approach indeed manages to classify pairs of multispectral patches into same or different, but unfortunately, do not induce a metric. Therefore, if two patchpairs are found similar by their network, it cannot be derived which one is more similar than the other. This knowledge is crucial for finding the best match for a featurepoint. Our approach instead utilizes CNN’s to measure the distance between cross spectral patches, and also to produce a continuous score of how similar they are. This measure forms a solid basis for multispectral registration as described in Sections
3 and 4.3 MultiSpectral Descriptor Learning
We propose to align a pair of cross spectral images by a feature based approach. In order to find the global transformation between the feature points, we need a mechanism to match them. In this section we introduce an approach to match feature points from different spectral channels by their deepdescriptor.
Given a VIS channel patch and a NIR patch we offer to learn a metric that measures the similarity distance between them. The descriptor of is computed from the trimmed network trained on CIFAR10 [12]. Figure 1 summarizes this network architecture. Denoting this network by , then the descriptor of the visible channel is . Now, we would like to learn a network for the NIR channel, with the same architecture as in Figure 1 but with different weights. The NIR descriptor is . We seek for weights of such that the descriptor would be invariant to different wavelengths. It means that the distance of corresponding patches, and
(1) 
would be significantly less than the distance of noncorresponding patches.
We learn the weights of as follows. We use the dataset of [4] that contains over 900 aligned images from the VIS and NIR channels. For every image we apply Harris corner detector [9] and extract around 1000 patch center. By that process we store over 100,000 corresponding pairs of crossspectral patches, each such pair being a training example. The input for is the NIR patch , while the label is . By that approach we teach the network to output the visible descriptor for a NIR input. This network is responsible for maintaining the invariance to spectral channels of our distance metric. Figure 2
demonstrates the convergence of our training process and the trained network architecture. It can be seen that the L2 distance is decreasing over the epochs, and that the validation graph is closed to the training graph, indicating that there is no overfitting.
An alternative for computing the spectral distance in equation (1) is to measure similarity would be to use a 2channel network as described in [2]. This 2channel network would receive the two patches and as an input and outputs 1 if they are same and 1 if they are different. It is trained by pairs of positive and negative multispectral patches. Unfortunately, this architecture cannot be used for registration because it does not form a metric. If we found two matches for a patch, it is necessary to know which one is more similar, and this cannot be deduced in a 2channel network that acts as binary classifier. On the contrary, our metric indicates a distance between the patches, and therefore supply to the algorithm a boolean indicates if the patches are same, but also a similarity score. Additional problem of the 2channel architecture is its runtime. This network requires to run a forward pass for each pair of corners, and it is an expensive procedure that can take minutes. In our architecture, a forward pass is calculated for each corner separately, and only L2 distances are computed for every pair.
In Section 5 we will show that our metric can be used to classify between same and different pairs of crossspectral patches with high accuracy. In the next Section 4 we explain how to use this metric to compute the multispectral registration.
Layer  Type  Output Dim  Kernel  Stride  Pad 

1  convolution  32  55  1  2 
2  maxpooling  32  33  2  0 
3  ReLU  32    1  0 
4  convolution  32  55  1  2 
5  ReLU  32    1  0 
6  avgpooling  32  33  2  0 
7  convolution  64  55  1  2 
8  ReLU  64    1  0 
9  avgpooling  64  33  2  0 
10  convolution  64  44  1  0 
4 MultiSpectral Image Registration
We use the metric of crossspectral patches described in Section 3
to form a deep feature based registration. Our approach consists of three stages: corner detection by Harris
[9], corners matching by our deep descriptor and finally computation of the global geometric transformation by RandomSampleConsensus (RANSAC) [7]mechanism for outliers rejection.
Corners Detection. Denote by the VIS channel image and by the corresponding NIR image. We use the method described in [9] to extract the corresponding group of corners and . Each such corner is a local maximum in the Harris score image:
(2) 
where
are the eigenvectors of the matrix of derivatives for each pixel:
(3) 
are the horizontal and vertical derivatives of the input image respectively. Since this corner detection method is based on local gradients, it is relatively invariant to multispectral images and therefore the group of corners and
has a large overlap. This characteristic of the feature extraction is necessary for the success of our whole scheme.
Feature Matching. We want to match the points in to those in to form the group of all matches . For every point we find the best match as follows. Firstly, we compute the descriptor of by and the descriptors of by . The complexity of all this forward passes is which is practical and feasible for real time registration. If we used a 2channel network for computing similarities, the complexity would be and the runtime would be huge. A match is a pair , such that is the nearest neighbour of according to our deep metric. It means that their descriptors and are the closest out of all other possible matches.
Transformation Computation. Now, we use the group of all matches to form the final global transformation between the images and . Typically, most of the matches in are outliers. Therefore, we compute the transformation, by RANSAC [7] outliers rejection method. The transformation for every sample of matches from is computed by leastsquares. We look for the largest subgroup of that accepts on the same transformation , this group contains the inlier matches. We can restrict
to be affine, rigid or translation only. As the degree of freedom of
is getting lower, the accuracy of our registration becomes higher. The transformation found by RANSAC is the final result of our algorithm, the score of success is the ratio of inliers divided by . A score greater than indicates a successful run of our method. In Section 5 we evaluate the accuracy of our approach.5 Experiments
We trained and tested our method on crossspectral images from the datasets of [4]. This dataset contains over 900 aligned images from the VIS and NIR channels. In Figure 6 we show examples of pair of images from this dataset. Our code is implemented in Matlab using MatConvNet library [23]. The runtime of our registration is around 10 seconds per pair of images and can be further reduced by utilization of GPU and parallel computing. The training time for our network is one hour on a TitanX GPU. We trained the network with learning rate of 0.005 and weight decay of 0.0004. For evaluation of our registration accuracy we manually simulated transformations on the dataset of aligned crossspectral images and tried to recover them automatically by our approach. For each run of a simulation we recored the error which is the Euclidean distance of a specific parameter, for example the error between a simulated translation to the one found by our code. We compared our method to several different approaches for multispectral registration. The first approach is to use edge descriptors and to match them by binary correlation, this is still feature based and can solve any type of transformation. Additional approaches solve only translation, among them correlation of Canny [5] images, correlation of Sobel [8] or maximization of mutual information. We also compare to the feature based approach of LGHD descriptor [1].
In Figure 3 we show our results when trying to classify pair of patches to similar or different. The positive set is the pair of patches around corners in the dataset while the negative examples are produced by random matching. The accuracy of our binaryclassifier is when selecting to correct threshold on the L2 distance between the descriptors. The Fmeasure [17], , is 0.75 and it is achieved with a similar threshold for maximizing the accuracy.
Table 2 compares the different methods when trying to solve translation only. It can be noticed that our deep method achieves the lowest error which is very close to 0 pixels. In Figure 4 we plot this error along sample of different scenes. It can be seen that our error is the lowest across almost all the scenes.
Figure 5 shows the error of our deepmethod across different scaling of the simulated transformation. When solving scaling we achieve an error of around one pixel in all scalings levels. In the difference of the scaling parameter we gain a negligible error below 0.002. Overall, our multispectral registration is accurate and solves complex transformations.
6 Conclusions
We introduced a novel method for multispectral registration that utilized an invariant deep descriptor of crossspectral patches. For that end, we trained a network to extract such descriptor for NIR patches. This network together with the trimmed network pretrained on CIFAR10 for RGB patches, form a metric between multispectral patches. Our experiments demonstrate that our metriclearning scheme is useful for classifying pair of patches to same or different. Moreover, it forms a basis for an accurate multispectral registration. In future work we plan to build and train a fully endtoend network that will carry out all the stages of our feature based registration including corner detection and feature matching. In addition, we plan to train a generative adversarial networks to create a VIS image out of NIR image.
. Right: precisionrecall graph of the classifier, the obtained Fscore is
.Algorithm  VISNIR 

Our method  0.03 
EdgeDescriptor  0.08 
Canny  0.07 
Sobel  0.07 
Mutual Information  0.11 
LGHD  0.21 
References
 [1] C. Aguilera, A. D. Sappa, and R. Toledo. Lghd: A feature descriptor for matching across nonlinear intensity variations. In Image Processing (ICIP), 2015 IEEE International Conference on, page 5. IEEE, Sep 2015.

[2]
C. A. Aguilera, F. J. Aguilera, A. D. Sappa, and R. Toledo.
Learning crossspectral similarity measures with deep convolutional
neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pages 1–9, 2016.  [3] L. G. Brown. A survey of image registration techniques. ACM computing surveys (CSUR), 24(4):325–376, 1992.
 [4] M. Brown and S. Süsstrunk. Multispectral SIFT for scene category recognition. In Computer Vision and Pattern Recognition (CVPR11), pages 177–184, Colorado Springs, June 2011.
 [5] J. Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
 [6] C. Chen, Y. Li, W. Liu, and J. Huang. Sirf: simultaneous satellite image registration and fusion in a unified framework. IEEE Transactions on Image Processing, 24(11):4213–4224, 2015.
 [7] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
 [8] W. Gao, X. Zhang, L. Yang, and H. Liu. An improved sobel edge detection. In Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, volume 5, pages 67–71. IEEE, 2010.
 [9] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Manchester, UK, 1988.
 [10] M. Irani and P. Anandan. Robust multisensor image alignment. In Computer Vision, 1998. Sixth International Conference on, pages 959–966. IEEE, 1998.
 [11] Y. Keller and A. Averbuch. Multisensor image registration via implicit similarity. IEEE transactions on pattern analysis and machine intelligence, 28(5):794–801, 2006.
 [12] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [13] H. Li, B. Manjunath, and S. K. Mitra. A contourbased approach to multisensor image registration. IEEE transactions on image processing, 4(3):320–334, 1995.
 [14] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
 [15] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
 [16] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens. Multimodality image registration by maximization of mutual information. IEEE transactions on medical imaging, 16(2):187–198, 1997.
 [17] D. M. Powers. Evaluation: from precision, recall and fmeasure to roc, informedness, markedness and correlation. 2011.
 [18] W. K. Pratt. Correlation techniques of image registration. IEEE transactions on Aerospace and Electronic Systems, (3):353–358, 1974.
 [19] B. S. Reddy and B. N. Chatterji. An fftbased technique for translation, rotation, and scaleinvariant image registration. IEEE transactions on image processing, 5(8):1266–1271, 1996.
 [20] X. Shen, L. Xu, Q. Zhang, and J. Jia. Multimodal and multispectral registration for natural images. In European Conference on Computer Vision, pages 309–324. Springer, 2014.
 [21] M. Subramanyam et al. Automatic feature based image registration using sift algorithm. In Computing Communication & Networking Technologies (ICCCNT), 2012 Third International Conference on, pages 1–5. IEEE, 2012.
 [22] G. Takacs, V. Chandrasekhar, S. Tsai, D. Chen, R. Grzeszczuk, and B. Girod. Unified realtime tracking and recognition with rotationinvariant fast features. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 934–941. IEEE, 2010.
 [23] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACM International Conference on Multimedia, 2015.
 [24] B. Zitova and J. Flusser. Image registration methods: a survey. Image and vision computing, 21(11):977–1000, 2003.
Comments
There are no comments yet.