1 Introduction
Featurebased matching between images has become a standard approach in computer vision literature in the last decade, in many respects due to the introduction of stable and invariant feature detection and description algorithms such as SIFT
[23] and similar methods [27, 2, 38]. The usual assumption guiding the design of feature descriptors is invariance across viewpoints, which should guarantee that the same feature appearing in two different views has the same descriptor. Since perspective transformations are approximately locally affine, it is common to construct affineinvariant descriptors [21].While being a good model in many cases, affine invariance is not sufficiently accurate in cases of wide baseline (very different view points) or even more complicated setting of optical imperfections such as lens distortions, blur, etc. In particular, in omnidirectional vision systems the distortion is introduced intentionally (e.g., using a parabolic mirror [25]) to allow a view. Designing invariant descriptors for such cases is challenging, as the invariance is complicated and cannot be easily modeled.
An alternative to ‘invariancebyconstruction’ approaches which rely on a simplified invariance model is to learn the descriptor invariance from examples. Recent work of Strecha et al. [35] showed very convincingly that such approaches can significantly improve the performance of existing descriptors.
In this paper, we consider the learning of invariant descriptors for omnidirectional image matching. We construct a training set of similar and dissimilar descriptor pairs including strong optical distortions, and use a neural network to learn a mapping from the descriptor space to the Hamming space preserving similarity on the training set. Experimental results show that our approach outperforms not only straightforward descriptors, but also other similaritypreserving hashing methods. The latter observation is explained by the suboptimality of existing approaches which solve a simplified optimization problem.
The main contribution of this paper is twofold. First, we formulate a new similaritysensitive hashing algorithm. Second, we use this approach to learn smaller invariant descriptors suitable for feature matching in omnidirectional images. The rest of the paper is organized as follows. In Section 2, we overview the related works. Section 3 is dedicated to metric learning and similaritypreserving hashing methods. In Section 4, we describe our NNhash approach. Section 5 contains experimental results. Finally, Section 6 discusses potential future work and concludes the paper.
2 Background
Although featurebased correspondence problems have been investigated in depth for standard perspective cameras, omnidirectional image matching still remains an open problem, largely because of the complicated geometry introduced by lenses and curved mirrors. Broadly speaking, the existing approaches either try to reduce the problem to the simpler perspective setting, or design special descriptors suitable for omnidirectional images.
Svoboda et al. [36] proposed to use adaptive windows around interest points to generate normalized patches with the assumption that the displacement of the omnidirectional system is smaller than the depth of the surrounding scene. Nayar [28] showed that, given the mirror parameters, it is possible to generate a perspective version of the omnidirectional image and Mauthner et al. [24] used this approach to generate perspective representation of each interest point region. This unwarping procedure removes the nonlinear distortions and enables the use of algorithms designed for perspective cameras. Micusik and Pajdla [26] checked the candidate correspondences between two views using the RANSAC algorithm and the epipolar constraint [13]. Construction of scalespace by means of diffusion on manifolds was used in [3, 16, 11] for the construction of local descriptors. Puig et al. [29]
integrated the sphere camera model with the partial differential equations on manifolds framework.
Another possible solution is to consider different kind of features to exploit particular invariance in omnidirectional systems, for example, extracting onedimensional features [5] or vertical lines [32] and defining descriptors suitable for omnidirectional images.
More recently, it was shown in [35] that one can approach the design of invariant descriptors from the perspective of metric learning
, constructing a distance between the descriptor vectors from a training set of similar and dissimilar pairs
[1, 42]. In particular, similaritypreserving hashing methods [14, 34, 43, 22, 30] were found especially attractive for descriptor learning, as they significantly reduce descriptor storage and comparison complexity. These methods have also been applied to image search [17, 39, 19, 18, 20, 41], video copy detection [7], and shape retrieval [6].In [31]
, binary codes were produced using a restricted Boltzmann machine and in
[43] using spectral hashing in an unsupervised setting. The authors showed that the learnt binary vectors capture the similarities of the data. With such an approach it is however impossible to explicitly provide information about data similarities. Since in our problem it is easy to produce labeled data, supervised metric learning is advantageous.3 Similarity preserving hashing
Given a set of keypoint descriptors, represented as dimensional vectors in , the problem of metric learning is to find their representation in some metric space by means of a map of the form . The metric parametrizes the similarity between the feature descriptors, which may be difficult to compute in the original representation. Typically, is fixed and is the map we are trying to find in such a way that, given a set of pairs of descriptors from corresponding points in different images (positives) and a set of pairs of descriptors from different points (negatives), we have for all and for all
with high probability.
A particular setting of this problem, where is the dimensional space of binary strings and is the Hamming metric, the problem is referred to as similaritypreserving hashing. Here, we limit our attention to affine embeddings of the form
(1) 
where is an matrix and is an vector. Our goal is to find such and that minimize one of the following cost functions,
for and . Both cost functions try to map positives as close as possible to each other (expressed as large correlations or small distance), and negatives as far as possible from each other (small correlation or large distance), in order to ensure low false positive (FPR) and false negative (FNR) rates. is a parameter determining the tradeoff between the FPR and FNR. In practice, the expectations are approximated as means on some sufficiently large training set.
The problem is a nonlinear nonconvex optimization problem without an obvious simple solution. It is commonly approached by the following twostage relaxation: first, approximate the map by removing the sign and the offset vectors, minimizing
w.r.t. to (introducing some regularization, e.g., , in order to avoid a trivial solution ). Second, fix and solve w.r.t. . To further simplify the problem, it is also common to assume separability, thus solving independently for each dimension of the hash.
3.1 Similaritysensitive hashing (SSH)
In [34], the above strategy was used for the approximate minimization of the cost . The computation of optimal parameters and was posed as a boosted binary classification problem, where
acts as a strong binary classifier, and each dimension of the linear projection
is considered a weak classifier (here, denotes the th row of ). This way, AdaBoost can be used to find a greedy approximation of the minimizer of by progressively constructing and . At the th iteration, the th row of the matrix and the th element of the vector are found minimizing a weighted version of . Since the problem is nonlinear, such an optimization is a challenging problem. In [34], random projection directions were used. A better method for projection selection similar to linear discriminative analysis (LDA) was proposed [7, 9]. Weights of false positive and false negative pairs are increased, and weights of true positive and true negative pairs are decreased, using the standard AdaBoost reweighting scheme [12].3.2 Covariance difference hashing (diffhash)
In [35], it was observed that the minimization can be written as
(2) 
where are the covariance matrices of the differences of the positive and negative pairs of vectors. Requiring an orthonormal projection matrix , the problem has a closedform solution consisting of the
smallest eigenvectors of
, and is thus also a separable problem. Once the projection is found in this way, the threshold vector maximizing the sum of the false positive and false negative rates is selected. This second stage also turns out separable in each dimension. In [8], a more generic kernelized version of diffhash (kdiffhash) was shown.3.3 LDAHash
A similar method was derived in [35] by transforming the coordinates as , which allows to write as
(3) 
This approach resembles linear discriminant analysis (LDA), hence the name LDAhash. Requiring an orthonormal projection matrix , the problem has a separable closedform solution consisting of the smallest eigenvectors of .
4 Neural network hashing (NNhash)
The problem of existing and most successful similaritypreserving hashing approaches such as LDA or diffhash is that they do not solve the optimization problem but rather its relaxation. As a result, the parameters found by these methods in the aforementioned twostage separable scheme is suboptimal, i.e., . Our experience shows that in some cases, the suboptimality is dramatic (at least an order of magnitude).
A way of solving the ‘true’ optimization problem is by formulating it in the neural network (NN) framework and exploiting numerous optimization techniques and heuristics developed in this field. Since we have a way of cheaply producing labeled data, we decide to adopt the
siamese network architecture [33, 15]which, contrary to conventional models, receives two input patterns and minimize a loss function similar to equation (
3),(4) 
where the constant represents the margin between dissimilar pairs. The margin is introduced as regularization to avoid the system from minimizing the loss just pulling two vector as far apart as possible. The embedding is then learned to make positive pairs as close as possible and negative pairs at least at distance .
Network architecture of this type can be traced back to the work of Schmidhuber and Prelinger [33] on problems of predictable classification. In [15], siamese networks were used to learn an invariant mapping of tiny images directly from pixel representation, whereas in [37] a similar approach is used to learn a model that is highly effective at matching people in similar pose which exhibits invariance to identity, clothing, background, lighting, shift and scale. An advantage of such architecture is that one can create arbitrarily complex embeddings by simply stacking many layers in the network. In all our experiments, in order to make a fair comparison to other hashing methods, we adopt a simple single layer architecture, wherein . Network training attempts to find that minimize (which is a regularized version of ). Since we solve a nonlinear problem without introducing any simplification or relaxation, the results are expected to be better compared to hashing methods described in Section 3. In the following, we refer to our method as NNhash.
Since a binary output is required, we adopt
as the nonlinear activation function for our siamese network, which enforces binary vectors when either
or the steepness of the function is increased. Since the problem is highly nonconvex, it is liable to local convergence, and thus there is no theoretical guarantee to find the global minimum. However, by initializing by the solution obtained by one of the standard hashing methods, we have a good initial point that can be improved by network optimization,5 Results
5.1 Data
In our experiments, we used the Rawseeds dataset [4, 10]. The dataset contained video sequences of a robot equipped with an omnidirectional camera system based on a parabolic mirror moving in an indoor and outdoor scene. The image undergoes significant distortion since different parts of the scene move from the central part of the mirror to the boundaries.
We used the toolbox of Vedaldi [40] to compute SIFT features in each frame of the video. Since the robot movement is slow, the change between two adjacent frames in the dataset is infinitesimal, and SIFT features can be matched reliably. Tracking features for multiple frames, we constructed the positive set as the transitive closure of these adjacent feature descriptor pairs. This way, the positive set included also descriptors distant in time, and, as a result of robot motion located at different regions in the image and thus subject to strong distortions. As negatives, we used features not belonging to the same track.
In addition to the Rawseeds dataset, we created synthetic omnidirectional datasets using panorama images that were warped simulating the effect of a parabolic mirror. The warping intentionally was not the same as in Rawseeds dataset. By moving the panorama image, we created synthetic motion with known pixelwise groundtruth correspondence (Figure 5). The positive and negative sets for synthetic data were constructed as described above.
5.2 Methods
We compared the SSH [34], diffhash [35], and our NNhash methods. For the NNhash training we used scaled conjugate gradient over the whole batch of descriptors, which we normalize in the range . We used a margin in all cases. The steepness factor for is in the case of bit while for bit we gradually increased it up to
so to have a smooth binarization. We reached convergence in about
epochs in all cases.5.3 Performance degradation in time
For this experiment, we constructed the training set using descriptors extracted from about consecutive frames of the outdoor sequence (similar results were obtained when using outdoor or synthetic data for training). We considered descriptors that could be tracked for at least consecutive frames and selected as positives pairs of descriptors belonging to these tracks.
To avoid bias, we selected pairs of descriptors in frames in such a way that the time difference
between the frames was uniformly distributed. The training was performed on a positive set of size
and on a negative set of size to produce hashes of length and bits.Testing was performed on a different portion of the same sequence, where frames at distance (Figure 2, left) and (Figure 2, right) were used. A few phenomena can be observed in Figure 2 showing the ROC curves of straightforward SIFT matching using the Euclidean distance and matching of learned binary descriptors using the Hamming distance. First, we can see that even with very compact descriptors (as small as bit, compared to bit required to represent SIFT) we match or outperform SIFT. These results are consistent with the study in [35]. Second, we observe that NNhash significantly outperforms other hashing methods for the same number of bits. This is a clear indication that SSH and diffhash methods are finding a suboptimal solution by solving a relaxed problem, while NNhash attempts to solve the full nonlinear nonconvex optimization problem.
Comparing Figure 2 (left and right) and Tables 1–2, we can observe how the matching performance degrades if we increase the time between the frames (from frames to frames). Because of significant distortions caused by the parabolic mirror, objects moving around the scene appear differently. This phenomenon is especially noticeable when the distance between the frames () is large. SIFT shows significant degradation, while NNhash, trained on a dataset including positive pairs at distances up to degrades only slightly (even a bit NNhash performs better than SIFT). This is a clear indication that we are able to learn feature invariance.
Finally, Figure 4 shows a visual example of feature matching using different methods. NNhash produces matches most similar to the groundtruth (shown in green).
EER  FPR@1%  FPR@0.1%  

SIFT  1024  1.91%  3.08%  13.87% 
NNhash  32  1.66%  3.77%  23.81% 
64  1.31%  1.92%  9.48%  
DiffHash  32  4.41%  9.36%  29.95% 
64  2.57%  5.17%  18.30%  
SSH  32  4.02%  15.64%  36.41% 
64  2.22%  4.90%  16.74% 
EER  FPR@1%  FPR@0.1%  

SIFT  1024  3.31%  7.47%  27.94% 
NNhash  32  2.70%  6.98%  24.98% 
64  2.38%  4.54%  14.22%  
DiffHash  32  5.17%  12.55%  37.49% 
64  3.69%  8.75%  27.34%  
SSH  32  5.52%  24.10%  47.29% 
64  3.46%  9.48%  27.66% 
5.4 Generalization
To test for generalization we perform experiments of transfer learning from outdoor data to indoor data and from synthetic data to real data.
Figure 3left shows the performance of descriptors trained on outdoor and tested on indoor data. We can see that even though the data used for training is very different from the one used for testing (i.e. see Figure 1 and Figure 4 for a visual comparison) we achieve better performance than SIFT with just 64 bits. Figure 3right shows the performance of descriptors trained on synthetic and tested on indoor data. All learning methods perform better than SIFT. The discrepancy between NNhash and the other algorithms is less pronounced that in the real case.
6 Discussion, Conclusions, and Future Work
We presented a new approach for feature matching in omnidirectional images based on similaritysensitive hashing and inspired by the recent work [35]. We learn a mapping from the descriptor space to the space of binary vectors that preserves the similarity of descriptors on a training set. By carefully constructing the training set, we account for descriptor variability, e.g. due to optical distortions. The resulting descriptors are compact and are compared using the Hamming metric, offering significant computational advantage over other traditional metrics such as . Though tested with SIFT descriptors, our approach is generic and can be applied to any feature descriptor.
We compared several existing similaritypreserving hashing methods, as well as our NNhash method based on a neural network. Experimental results show that NNhash outperforms other approaches. An explanation to this behavior is the fact that of today’s stateoftheart similaritypreserving hashing algorithms like SSH or LDAHash solve a simplified optimization problem, whose solution does not necessarily coincide with the solution of the “true” nonlinear nonconvex problem. We showed that using a neural network, we can solve the “true” problem and yield better performance.
Finally, our discussion in this paper was limited to simple embeddings of the form which in some cases are too simple. The neural network framework seems to us a very natural way to consider more generic embeddings using multilayer network architectures.
Acknowledgement
M. B. is partially supported by the Swiss High Performance and High Productivity Computing (HP2C) grant. J. M. is supported by Arcelor Mittal Maizières Research SA. D. M. is partially supported by the EU projects FP7ICTIP231722 (IMCLeVeR).
References
 [1] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap: a method for efficient approximate similarity ranking. In Proc. CVPR, 2004.
 [2] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding, 10(3):346–359, 2008.
 [3] I. Bogdanova, X. Bresson, J. P. Thiran, and P. Vandergheynst. Scale space analysis and active contours for omnidirectional images. Trans. Image Processing, 16(7):1888–1901, 2007.
 [4] A. Bonarini, W. Burgard, G. Fontana, M. Matteucci, D. G. Sorrenti, and J. D. Tardos. Rawseeds: Robotics advancement through webpublishing of sensorial and elaborated extensive data sets. In Proc. IROS Workshop on Benchmarks in Robotics Research, 2006.
 [5] A. Briggs, Y. Li, D. Scharstein, and M. Wilder. Robot navigation using 1d panoramic images. In Proc. ICRA, 2006.
 [6] A. Bronstein, M. Bronstein, M. Ovsjanikov, and L. Guibas. Shape Google: geometric words and expressions for invariant shape retrieval. ACM TOG, 2010.
 [7] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Video genome. Technical Report arXiv:1003.5320v1, 2010.
 [8] M. M. Bronstein. Kernel diffhash. Technical Report arXiv:1111.0466v1, 2011.
 [9] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, 2010.
 [10] S. Ceriani, G. Fontana, A. Giusti, D. Marzorati, M. Matteucci, D. Migliore, D. Rizzi, D. G. Sorrenti, and P. Taddei. Rawseeds ground truth collection systems for indoor selflocalization and mapping. Autonomous Robots, 27(4):353–371, 2009.
 [11] J. Cruz, I. Bogdanova, B. Paquier, M. Bierlaire, and J. P. Thiran. Scale invariant feature transform on the sphere: Theory and applications. Technical report, 2009.

[12]
Y. Freund and R. Schapire.
A decisiontheoretic generalization of online learning and an
application to boosting.
In
European Conference on Computational Learning Theory
, pages 23–37, 1995. 
[13]
C. Geyer and H. Stewenius.
A ninepoint algorithm for estimating paracatadioptric fundamental matrices.
In Proc. CVPR, 2007.  [14] A. Gionis, P. Indik, and R. Motwani. Similarity Search in High Dimensions via Hashing. In International Conference on Very Large Databases, 2004.
 [15] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
 [16] P. I. Hansen, P. Corke, and W. Boles. Wideangle visual feature matching for outdoor localization. Int. J. Robotics Research, 29(2/3):267–297, February 2010.
 [17] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In CVPR, 2008.
 [18] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In Proc. ECCV, 2008.
 [19] H. Jégou, M. Douze, and C. Schmid. Packing BagofFeatures. In Proc. ICCV, 2009.
 [20] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. Trans. PAMI, 2010.
 [21] R. Kimmel, C. Zhang, A. M. Bronstein, and M. M. Bronstein. Are mser features really interesting? Trans. PAMI), 32(11):2316–2320, 2011.
 [22] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proc. NIPS, pages 1042–1050, 2009.
 [23] D. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 20(2):91–110, 2004.
 [24] T. Mauthner, F. Fraundorfer, and H. Bischof. Region matching for omnidirectional images using virtual camera planes. Technology, 2006.
 [25] C. Mei and P. Rives. Single view point omnidirectional camera calibration from planar grids. In Proc. ICRA, 2007.
 [26] B. Micusik and T. Pajdla. Structure from motion with wide circular field of view cameras. Trans. PAMI, 28(7):1135–1149, 2006.
 [27] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1/2):43–72, 2005.
 [28] S. K. Nayar. Catadioptric Omnidirectional Camera. In Proc. CVPR, 1997.
 [29] L. Puig and J. J. Guerrero. Scale space for central catadioptric systems. towards a generic camera feature extractor. In Proc. ICCV, 2011.
 [30] M. Raginsky and S. Lazebnik. LocalitySensitive Binary Codes from ShiftInvariant Kernels. In Proc. NIPS, 2009.
 [31] R. Salakhutdinov and G. Hinton. Semantic hashing. In SIGIR Workshop on Information Retrieval and applications of Graphical Models, 2007.
 [32] D. Scaramuzza, R. Siegwart, and A. Martinelli. A robust descriptor for tracking vertical lines in omnidirectional images and its use in mobile robotics. Int. J. Robotics Research, 28(2):149–171, 2009.
 [33] J. Schmidhuber and D. Prelinger. Discovering predictable classifications. Neural Computation, 5(4):625–635, 1993.
 [34] G. Shakhnarovich. Learning TaskSpecific Similarity. PhD thesis, MIT, 2005.
 [35] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: improved matching with smaller descriptors. Trans. PAMI, 2011.

[36]
T. Svoboda and T. Pajdla.
Matching in catadioptric images with appropriate windows, and outliers removal.
In Proc. CAIP, 2001.  [37] G. W. Taylor, I. Spiro, C. Bregler, and R. Fergus. Learning invariance through imitation. In Proc. CVPR, 2011.
 [38] E. Tola, V. Lepetit, and P. Fua. Daisy: an Efficient Dense Descriptor Applied to Wide Baseline Stereo. Trans. PAMI, 32(5):815–830, 2010.

[39]
A. Torralba, R. Fergus, and W. T. Freeman.
80 million tiny images: a large dataset for nonparametric object and scene recognition.
Trans. PAMI, 30(11):1958–1970, 2008.  [40] A. Vedaldi. An open implementation of the SIFT detector and descriptor. Technical Report 070012, UCLA CSD, 2007.

[41]
J. Wang, S. Kumar, and S. F. Chang.
Semisupervised hashing for scalable image retrieval.
In CVPR, 2010.  [42] J. Wang, S. Kumar, and S. F. Chang. Sequential projection learning for hashing with compact codes. In ICML, 2010.
 [43] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. 2009.
Comments
There are no comments yet.