Convolutional Neural Networks learn compact local image descriptors

04/30/2013 ∙ by Christian Osendorfer, et al. ∙ 0

A standard deep convolutional neural network paired with a suitable loss function learns compact local image descriptors that perform comparably to state-of-the art approaches.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 General Learning Architecture

Recently, several machine learning based approaches

[1, 9, 8] have shown impressive results for finding compact low-level image representations. These representations are considered good when corresponding image patches are described by representations that are close by.

DrLim [3]

is a framework for energy based models that learns representation using only such correspondence relationships. We utilize DrLim to train a convolution neural network for learning low-dimensional mappings for low-level image patches.

The main idea behind DrLim is to map similar (i.e. corresponding) image patches to nearby points on the output manifold and dissimilar image patches to distant points. It is defined over pairs of image patches, . The -th pair is associated with a label , with if and are deemed similar and otherwise. We denote by the parameterized distance function between the representations of and that we want to learn. Based on we define DrLim’s loss function :

We denote with the partial loss function for similar pairs (it pulls similar pairs together) and with the partial loss function for dissimilar pairs (it pushes dissimilar pairs apart). is defined as in [3]:

is the push margin: Dissimilar pairs are not pushed farther apart if they already are at a distance greater than the push margin. is a scaling factor.

For we use a loss similar to hinge loss, differently to the loss function proposed in the original DrLim formulation:

is a scaling factor, is a pull margin: Similar pairs are pulled together only if they are at a distance above .

is defined as the Euclidean distance between the learned representations of and :

denotes the mapping from the (high-dimensional) input space to the low-dimensional space. In this paper, is a convolutional neural network[5]. The layers of the convolutional network comprise a convolutional layer (kernel size ) with 6 feature maps, a subsampling layer , a second convolutional layer (kernel size ) with 21 feature maps, a subsampling layer , a third convolutional layer (kernel size ) with 55 feature maps and a fully connected layer with 32 units.

2 Experiments

We evaluate our proposed model on the dataset from [1]. The dataset is based on more than 1.5 million image patches ( pixels) of three different scenes: the Statue of Liberty (about 450,000 patches), Notre Dame (about 450,000 patches) and Yosemite’s Half Dome (about 650,000 patches). We denote these scenes with LY, ND and HD respectively. There are 250000 corresponding image patch pairs and 250000 non-corresponding image patch pairs available for every scene. We train on one scene and evaluate the learned embedding function on the other two scenes. Evaluation is done on the same test sets (50000 matching and non-matching pairs) used also by other approaches.

Table 1 shows that convolutional networks (last entry) perform comparably to other state-of-the-art approaches. The appeal of a simple parameteric model like a convolutional neural network is that it does not require any complex paramter tuning or pipeline optimization and that it can be integrated into larger systems that can then be trained in an end-to-end fashion [4].

The architecture is trained with standard gradient descent. Training stops when a local minima of the DrLim objective is reached. Notably, the hyperparameters (

, , , ) used in our evaluation are not scene dependent.

Test set
Method Tr. set LY ND HD
SIFT 31.7 22.8 25.6
LY 14.1 19.6
L-BGM ND 18.0 15.8
(64d) HD 21.0 13.7
Brown et al. ND 16.8 13.5
(29d) HD 18.2 11.9
Simonyan et al. ND 14.5 12.5
(29d) HD 17.4 9.6
LY 11.2 18.5
CNN ND 16.4 16.2
(32d) HD 18.9 10.7
Table 1: Error rates, i.e. the percent of incorrect matches when 95% of the true matches are found. Every subtable, indicated by an entry in the Method column, denotes a descriptor algorithm. The line below every method denotes the size of the desciptor (e.g. 32d denotes a 32 dimensional descriptor). The 128 dimensional SIFT descriptor [7] does not require learning (denoted by in the column Tr. set (i.e. Training set)). The numbers in the columns labeled LY, ND and HD are the error rates of a method on the respective test set for this scene. [1, 8] do not have results when trainend on the LY scene (indicated by ). L-BGM is presented in [9]

. The mean error rates for convolutional neural networks (CNN) are given with a standard deviation over 10 runs.

3 More data

Convolutional Neural Networks benefit from abundant data [2, 6]. Utilizing data from two scenes improves error rates noticebly: We get 15.1% on LY with combined training on ND and HD (in total 1M patch pairs). Similarly, we get 8.5% on ND and 14.3% on HD.


  • Brown et al. [2010] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. IEEE PAMI, 2010.
  • Ciresan et al. [2012] Dan Ciresan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. In Proc. CVPR, 2012.
  • Hadsell et al. [2006] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
  • Hadsell [2008] R.T. Hadsell. Learning long-range vision for an offroad robot. PhD thesis, New York University, 2008.
  • Jarrett et al. [2009] K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In Proc. ICCV, 2009.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
  • Lowe [2004] D.G. Lowe. Distinctive image features from scale-invariant keypoints.

    International Journal of Computer Vision

    , 60(2):91–110, 2004.
  • Simonyan et al. [2012] K. Simonyan, A. Vedaldi, and A. Zisserman. Descriptor learning using convex optimisation. In Computer Vision–ECCV 2012, 2012.
  • Trzcinski et al. [2012] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua. Learning image descriptors with the boosting-trick. In Proc. NIPS, 2012.