Finding correspondences between image regions (patches) is a key factor in many computer vision applications. For example, structure-from-motion, multi-view reconstruction, image retrieval and object recognition require accurate computation of local image similarity. Due to importance of these problems various descriptors have been proposed for patch matching with the aim of improving accuracy and robustness. Many of the most widely used approaches, like SIFT or DAISY 
descriptors, are based on hand-crafted features and have limited ability to cope with negative factors (occlusions, variation in viewpoint etc.) making a search of similar patches more difficult. Recently, various methods based on supervised machine learning have been successfully applied for learning patch descriptors[3, 4, 5, 6]. These methods significantly outperform hand-crafted approaches and inspire our research.
During recent years, neural networks have achieved great success in object classification  and other computer vision problems. Specifically, methods based on Convolutional Neural Network (CNN) have showed significant improvements over previous state-of-the-art recognition and object detection approaches. Influenced by these works, we aim to create a CNN-based discriminative descriptor for patch matching task. In contrast to [8, 9] where the representations of two patches are compared using a set of fully connected layers, we utilize Euclidean distance as a metric of similarity. The same metric is used in one of the most popular and applicable descriptor, SIFT. Therefore, our approach can be considered as a direct alternative to SIFT and similar techniques can be used for fast matching and indexing of descriptors as with SIFT. We utilize labeled patch pairs to learn the descriptor so that Euclidean distance ( norm) between patches in the feature space is small for similar patches and large otherwise. This is analogous to face-verification problem where Siamese structure  has been utilized to predict whether the persons illustrated in an input image pair are the same or not.
For training and evaluation of the proposed descriptor we utilize Multi-view Stereo Correspondence (MSC) dataset , which is illustrated in Fig. 1 and consists of more than 1.5M grayscale patches. The dataset consists of pairs of matching and non-matching patches extracted from images of the Statue of Liberty, Notredame and Half Dome (Yosemite) by using Difference of Gaussian (DoG) interest point detector and matched by utilizing the respective 3D multi-view reconstructions computed from the images . In detail, corresponding interest points were found by mapping between images using the dense stereo depth maps computed by the multi-view stereo algorithm of  based on the initial point cloud reconstructions by . Pairs of patches corresponding to the same 3D point are defined to be matching (i.e. positive or similar pairs in our terminology) if they also originate from DoG interest points detected with sufficiently similar scale and orientation . Pairs of patches sampled from different 3D points are non-matching (i.e. negative or dissimilar). In summary, as illustrated in Fig. 1, the matching pairs represent the same 3D structure with roughly correct geometric alignment so that their appearances are similar whereas the negative pairs typically have different texture and dissimilar appearance.
In this work, we conduct multiple experiments with preprocessing of raw patches and demonstrate that histogram equalization as well as batch normalization significantly improve the accuracy of the proposed descriptor.
We also explore different types of descriptor architectures evaluating their performance on MSC dataset. Our experimental evaluation shows that the proposed model outperforms recent state-of-the-art -based approaches. In addition, we investigate the use of spatial transformer networks  in the patch matching problem.
The paper is organized as follows. Section 2 presents related work focusing on patch matching problem. Section 3 describes the proposed method of finding corresponding patches, discusses an architecture of the descriptor, objective function and details of data preprocessing. Section 4 presents the experimental pipeline and performance on the MSC dataset. In the end of this paper we summarize our results and point some directions of future work.
2 Related work
Local image descriptors have been widely used in finding similar and dissimilar regions in images. Nowadays, the trend has changed from hand-crafted and carefully-designed methods (SIFT  or DAISY ) to a new generation of learned descriptors including unsupervised and supervised techniques like boosting , convex optimization  and Linear Discriminant Analysis (LDA) [3, 14].
In our approach, however, we propose a descriptor based on deep convolutional neural networks (CNN) with batch normalization units accelerating learning and convergence. The first papers which utilized CNN based representations for finding matching image patches were  and . More recently, Žbontar and LeCun 
proposed a method for comparing image patches in order to extract stereo depth information. Their method is based on using convolutional networks minimizing a hinge loss function and showed the best performance on KITTI stereo evaluation dataset. However, as that approach operates on very small patches ( pixels), it restricts the area of applicability.
In addition, one recent related paper is , which utilizes Siamese network architecture for the challenging problem of matching street-level and aerial images. In contrast to our work,  concentrates on matching entire images in a specific application, i.e. ground-to-aerial geolocalization. Their approach is therefore not directly applicable in tasks where local features are currently used and it does not allow replacing or comparing with SIFT. Moreover, in their work the length of the proposed descriptor is significantly larger () than that of SIFT and our representation ().
Recent approaches [8, 9, 20] propose CNN descriptors trained with two-branch (Siamese) architecture which significantly exceed the accuracy of hand-crafted descriptors. However, in contrast to SIFT, in [8, 9] the feature representations of input patches are compared by a set of fully connected layers (match network) that learns a complex comparison metric. Nevertheless, Zagoruyko et al.  and Simo-Serra et al.  also conducted experiments in which the match network was replaced with Euclidean distance metric between the outputs of two branches and, hence, they are the closest works to ours. The implementation of  is not yet publicly available. Thus, in order to compare performance, we reproduced the network architecture of  and evaluated it using the standard protocol. The results show that our network architecture outperforms those of [8, 20]. More detailed comparison is presented in Sec. 3.2.
3 Neural Descriptor
Our goal is to construct a system that efficiently distinguishes matching (similar) and non-matching (dissimilar) patches. To do this, we propose a method based on a deep convolutional neural network. As shown in Fig. 2, the model consists of two identical branches that share the same set of weights and parameters. Patches and are fed into branches and propagated through the model separately. The main objective of a proposed network is to map the raw patches to a low dimensional feature space so that the distance between pairs is small if the patches are similar and large otherwise. The same distance measure ( distance) is usually applied also for matching hand-crafted descriptors.
The following section describes the proposed loss function and how it can be used in our approach.
3.1 Loss Function and Data Preprocessing
To optimize the proposed network, we have to use a loss function which is capable to distinguish similar (positive) and dissimilar (negative) pairs. More precisely, we train the weights of the network by using a loss function which encourages similar examples to be close, and dissimilar ones to have Euclidean distance larger or equal to a margin from each other. In contrast to [8, 20], which utilize hinge embedding loss , we use margin-based contrastive loss  defined as follows:
where is a binary label which selects whether the input pair consisting of patch and is a positive () or negative (), 0 is the margin for negative pairs and is the Euclidean Distance between feature vectors and of input images and .
Dissimilar pairs contribute to the loss function only if their distance is smaller than the margin . The idea of learning is schematically illustrated in Fig. 3. The loss function encourages matching patches (elements with the same color and shape) to be close in feature space while pushing non-matching pairs apart. Obviously, negative pairs with a distance larger than margin would not contribute to the loss (second part of (1)). Thus, setting margin to too small value would lead to optimizing the objective function only over the set of positive pairs and, as a result, would hamper learning.
To demonstrate what has been learned by our proposed descriptor, we illustrate the histogram of pairwise Euclidean distances of patch pairs of test set both before and after training in Fig. 4. The blue and brown bars represent pairwise distances of positive and negative pairs, respectively. It can clearly be seen that the training process of the descriptor on patch pairs effectively pushes non-matching pairs away and pulls matching pairs together. In the very beginning, the distributions of positive and negative pairs are grouped at the intersection of the blue (penalty for similar pairs (1)) and the red (penalty for dissimilar pairs) curves in Fig. 3(a). We experimentally verified that for efficient training the margin value should be set to twice the average Euclidean distance between features of training patch pairs before learning.
Data Preprocessing and Augmentation.
Data preprocessing plays an important role in machine learning algorithms. However, in practice it is hard to say in advance which preprocessing technique is helpful for achieving best performance. Here we calculate mean and standard deviation of pixel’s intensities over the whole training dataset and use them to normalize intensity value of every pixel in the input grayscale patch. In addition, analysing raw patches in MSC dataset, we noticed that there are a lot of pairs where patches have significantly different contrast. To adjust patch intensities we apply histogram equalization before normalization. Histogram equalization is a technique that allows us to improve the contrast of images and it has been found to be a powerful technique in image enhancement. Equalized histogram of a discrete gray-level image represents the frequency of occurrence of all gray-levels in the image and well distributes the pixels intensity over the full intensity range. Finally, to prevent overfitting we used the same approach as and augmented training data applying affine transformation by rotating both patches in pairs to 90, 180, 270 degrees and flipping them horizontally and vertically.
3.2 Network Architecture and Learning
The proposed network architecture for one branch of the Siamese network of Fig. 2 has following modules: convBlock[32,3,1,1]-convBlock[64,3,1,1]-pool-convBlock[64,3,1,1]-convBlock[64,3,1,1]-pool-convBlock[128,3,1,1]- convBlock[128,3,1,1]-pool-convBlock[128,3,1,1]-L2norm. The shorthand notation: convBlock[N,w,s,p] consists of a convolution layer with N filters of size
, a regularisation layer (ReLU) and batch normalisation, pool[k
] is a max-pooling layer of sizeapplied with stride k. This architecture dubbed cnn7 was selected based on several experiments with different network structures having varying number of layers and involving also fully connected layers. We observed that convolutional networks without fully connected layers seemed to perform better than networks with fully connected layers, and cnn7 had the best performance among the networks we experimented.
In our case, the benefit of applying batch normalization  and histogram equalization was verified experimentally, as is shown in Fig. 5 and described in Section 4. We also analyzed the network structure proposed by , titled cnn3, by re-implementing its architecture and utilizing contrastive loss objective function. As shown in Fig. 5 we noticed that our network architecture clearly outperforms cnn3 even without histogram equalization of the input patches (blue and red curves respectively). Moreover, applying histogram equalization further improves the accuracy of the proposed method.
In contrast to cnn3  model and two models siam-, pseudo-siam- proposed by , we decomposed convolutional layers with a big kernel size into several filters with smaller kernels (), separated by ReLU activations. According to , it increases nonlinearities of the whole network and makes the decision function more discriminative. Moreover, our model has only half the number of parameters compared to .
We minimize Contrastive loss function (1
) over a training set using Stochastic Gradient Descent (SGD) with a standard back-propagation and ADADELTA 
. We train our descriptor in two stages. In the first stage, the training data has 500,000 patch pairs and it took about 1 day to finish 100,000 iterations of training, which is equal to 40 epochs of the training set. Weights are initialised randomly and the model is trained from scratch. In the second stage, we augmented the number of training samples up to 4M pairs by using also rotated and mirrored versions of the original patches, and then resumed training for another 20 epochs starting from pre-trained descriptor from the first stage. Learning rate (0.01), weight decay (0.001) as well as the size of mini-batch (100) remain constant during the training. The model111Source code and the model will be made available upon publication.
In this section, we present experimental results evaluating the proposed descriptor on MSC dataset. In order to compare results with previous work, we use exactly the same standard datasets for training and testing as used by e.g. [3, 8]. That is, for each of the three subsets of MSC dataset (Liberty, Notredame, Yosemite) we use a test set of 100,000 pairs of patches originally provided by . For training we utilize 500,000 pairs of patches from each subset (also provided by ). If we augment the training data by including rotated and mirrored versions of the original training patches, as described in Section 3, we get 4 million pairs from the original 0.5 million. We train three models by using training patches from the three different subsets, and evaluate each of the three models with test pairs of the two remaining subsets. In total we get six cases which are presented in Table 1.
|nSIFT + L2 (no training)||128d||29.84||22.53||29.84||27.29||22.53||27.29||26.55||27.38|
|Brown et al w/PCA||29d||18.27||11.98||16.85||13.55||-||-||-||15.16|
|Ours, 500k training pairs||128d||14.88||9.47||16.57||19.50||9.01||17.21||14.44||15.11|
|Ours, 4M training pairs||128d||15.48||8.88||11.84||17.78||8.40||15.07||12.91||13.50|
We follow the standard protocol of  and calculate ROC curves by thresholding the distance between feature pairs and determine the false positive rate at 95% recall. The numbers are shown in Table 1. As in , we also report the across all six combinations of training and test data. Like the original work , we also provide metric which is the mean across the four cases obtained by training models only on Yosemite and Notredame.
The results in Table 1 confirm that the proposed model has better performance than  with the same number of training pairs. For instance, in Notredame-Liberty siam-L2 outperforms hand-crafted descriptor nSIFT+L2 and nSIFT squared diff. by 16.6% and 13.3% respectively in absolute error. Our method with the same size of training data further improves accuracy by 1.4% in absolute error rate. Moreover, the length of our descriptor is significantly shorter than in . The benefit of applying histogram equalization is presented in the last two rows of Table 1. The proposed model achieves 12.21% and 13.07% average error for with and without augmentation of training data, respectively. In general, it improves the performance of the proposed descriptor by 9.21% in relative units for average error and by 6.93% for compared to .
). Specifically, we notice that some patches in false negative and false positive examples are so similar that even a human could make a mistake in interpretation. In fact, it seems that the top-ranking false positives (i.e. the pairs of negative patches whose descriptors are closest to each other) are probably originating from repeating texture patterns of the scene (i.e. similar texture appears in different 3D locations of the scene). Obviously, our descriptor or any other similar descriptor can not tell the difference here as it does not have access to multi-view information which was used to generate the ground truth labels. More interestingly, the top-ranking false negatives (i.e. the pairs of positive patches with descriptors furthest away from each other) seem to originate from patches where there is a perceived dissimilarity because of inaccurate geometric alignment (due to non-planarity of the scene surface or due to inaccuracies in the orientation assignment or localization of the interest point). Thus, augmentation of training data and/or hard positive mining could bring further improvement and robustness to aforementioned factors in future. Nevertheless, Fig.6 confirms the good behaviour of the proposed descriptor as the failure cases are intuitively understandable and hard to avoid in general without trade-offs.
4.1 Spatial Transformer Networks
Our visualisation in Fig. 6 shows that the image patches in many of the false negative pairs have a slightly differing alignment. That is, the patches represent corresponding scene surfaces but the scales, orientations and locations assigned by the interest point detector do not match precisely. Thus, based on the visualisation and interpretation of our results in Fig. 6, we decided to further investigate that whether our descriptor could by made more robust to spatial misalignment by applying spatial transformer (ST) networks . Specifically, the spatial transformer is a differentiable module performing explicit spatial transformations of input feature maps and can be placed at any part of a neural network easily. However, so far they have been mainly used in image classification problems  and, to the best of our knowledge, they have not been previously used for learning image similarity metrics with contrastive loss function.
Fig. 8 schematically illustrates how we append our cnn7 model (introduced in Section 3.2) by incorporating ST modules right after the preprocessing layer. As we put ST module as the first layer in the network, it directly transforms the preprocessed input patches. The number of parameters can vary and depends on the type of transformation used. Inspired by examples of Fig. 6, we aim to compensate errors caused by rotation, translation and scaling
. Therefore, the number of estimated parameters by localisation network equals 4 (one for rotation, one for scaling and two for translation transformations). The architecture of the localisation network is as follows: convBlock[32,5,1,2]-pool-convBlock[64,5,1,2]-pool-convBlock[128,5,1,2]-fc-fc where fc[n] denotes a fully-connected layer with n outputs. The complete model with the ST layer is denoted as cnn7stn.
We train both cnn7 and cnn7stn from random initialization using the histogram equalized pairs from the augmented Liberty training set (4M pairs). However, this time we did not use weight decay, and both models were trained using a smaller number of epochs than used for the results of Table 1 (due to a limited available training time). The models were evaluated with the NotreDame test set (100k pairs) and the results are shown in Fig. 7. We can see that cnn7stn gives better performance than cnn7.
In order to further visualize and analyse the difference Fig. 9 shows examples of pairs for which the two models give different classification result at 0.95 recall. Fig. 10 shows the output of ST layer for the same patches. We can see that in most cases the ST layer transforms both patches of a pair quite similarly but in some cases (indicated with the blue color) the ST layer seems to improve the alignment which is probably the explanation for the better performance of cnn7stn. Hence, it seems that the ST layer has learnt the desirable behaviour to some extent. Still, there is probably room for further improvements since many misaligned pairs remain quite differently aligned after the ST layer (cf. Fig. 10).
Visualisation of some NotreDame test pairs which are classified differently bycnn7 and cnn7stn when we set recall to 0.95 for both. As can be seen from Fig. 7 cnn7stn has higher precision. The total number of test pairs is 100k.
In this paper, we use Siamese architecture to train a deep convolutional network for extracting descriptors from image patches. In training we utilized matching and non-matching pairs of image patches from MSC dataset. There are several conclusions that we can get from our experiments. First, we propose a descriptor with good performance, notably outperforming previous CNN-based norm descriptors on several datasets. We also show that utilizing histogram equalization for adjusting patch contrast improves the accuracy of the proposed model. In addition, we run preliminary experiments by appending our CNN architecture with spatial transformer layers and observe an improvement in the resulting descriptor. A potential future performance enhancement could be to investigate optimal structures of the localisation network of ST layers which could make the descriptor even more robust to geometric transformations.
-  Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60 (2004) 91–110
Tola, E., V.Lepetit, Fua, P.:
A Fast Local Descriptor for Dense Matching.
In: Proceedings of Computer Vision and Pattern Recognition. (2008)
-  Hua, G., Brown, M., Winder, S.: Discriminant learning of local image descriptors. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on. (2010)
-  Trzcinski, T., Christoudias, C.M., Lepetit, V., Fua, P.: Learning image descriptors with the boosting-trick. In: NIPS. (2012) 278–286
-  Trzcinski, T., Christoudias, M., Fua, P., Lepetit, V.: Boosting binary keypoint descriptors. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. CVPR ’13, IEEE Computer Society (2013) 2874–2881
-  Simonyan, K., Vedaldi, A., Zisserman, A.: Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., Weinberger, K., eds.: Advances in Neural Information Processing Systems 25. (2012) 1097–1105
-  Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
-  Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: Matchnet: Unifying feature and metric learning for patch-based matching. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
-  Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. Computer Vision and Pattern Recognition 1 (2005) 539–546
-  Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereo for community photo collections. In: Proceedings of the 11th International Conference on Computer Vision (ICCV 2007), IEEE (2007) 265–270
-  Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. Int. J. Comput. Vision 80 (2008) 189–210
-  Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems 28. (2015) 2017–2025
-  Strecha, C., Bronstein, A., Bronstein, M., Fua, P.: Ldahash: Improved matching with smaller descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 66–78
-  Jahrer, M., Grabner, M., Bischof, H.: Learned local descriptors for recognition and matching. Computer Vision Winter Workshop (2008)
-  Osendorfer, C., Bayer, J., Urban, S., van der Smagt, P.: Convolutional neural networks learn compact local image descriptors. In: Neural Information Processing - 20th International Conference, ICONIP. (2013) 624–630
-  Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
-  Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013)
-  Lin, T.Y., Cui, Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. (2015)
-  Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.: Discriminative learning of deep convolutional feature point descriptors. International Conference on Computer Vision (2015)
-  Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09, ACM (2009) 737–744
-  Hadsell, R., Sumit, C., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. Computer Vision and Pattern Recognition 2 (2006) 1735–6919
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (1989) 541–551
-  Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012)