1 General Learning Architecture
Recently, several machine learning based approaches
[1, 9, 8] have shown impressive results for finding compact lowlevel image representations. These representations are considered good when corresponding image patches are described by representations that are close by.DrLim [3]
is a framework for energy based models that learns representation using only such correspondence relationships. We utilize DrLim to train a convolution neural network for learning lowdimensional mappings for lowlevel image patches.
The main idea behind DrLim is to map similar (i.e. corresponding) image patches to nearby points on the output manifold and dissimilar image patches to distant points. It is defined over pairs of image patches, . The th pair is associated with a label , with if and are deemed similar and otherwise. We denote by the parameterized distance function between the representations of and that we want to learn. Based on we define DrLim’s loss function :
We denote with the partial loss function for similar pairs (it pulls similar pairs together) and with the partial loss function for dissimilar pairs (it pushes dissimilar pairs apart). is defined as in [3]:
is the push margin: Dissimilar pairs are not pushed farther apart if they already are at a distance greater than the push margin. is a scaling factor.
For we use a loss similar to hinge loss, differently to the loss function proposed in the original DrLim formulation:
is a scaling factor, is a pull margin: Similar pairs are pulled together only if they are at a distance above .
is defined as the Euclidean distance between the learned representations of and :
denotes the mapping from the (highdimensional) input space to the lowdimensional space. In this paper, is a convolutional neural network[5]. The layers of the convolutional network comprise a convolutional layer (kernel size ) with 6 feature maps, a subsampling layer , a second convolutional layer (kernel size ) with 21 feature maps, a subsampling layer , a third convolutional layer (kernel size ) with 55 feature maps and a fully connected layer with 32 units.
2 Experiments
We evaluate our proposed model on the dataset from [1]. The dataset is based on more than 1.5 million image patches ( pixels) of three different scenes: the Statue of Liberty (about 450,000 patches), Notre Dame (about 450,000 patches) and Yosemite’s Half Dome (about 650,000 patches). We denote these scenes with LY, ND and HD respectively. There are 250000 corresponding image patch pairs and 250000 noncorresponding image patch pairs available for every scene. We train on one scene and evaluate the learned embedding function on the other two scenes. Evaluation is done on the same test sets (50000 matching and nonmatching pairs) used also by other approaches.
Table 1 shows that convolutional networks (last entry) perform comparably to other stateoftheart approaches. The appeal of a simple parameteric model like a convolutional neural network is that it does not require any complex paramter tuning or pipeline optimization and that it can be integrated into larger systems that can then be trained in an endtoend fashion [4].
The architecture is trained with standard gradient descent. Training stops when a local minima of the DrLim objective is reached. Notably, the hyperparameters (
, , , ) used in our evaluation are not scene dependent.Test set  

Method  Tr. set  LY  ND  HD 
SIFT  –  31.7  22.8  25.6 
LY  –  14.1  19.6  
LBGM  ND  18.0  –  15.8 
(64d)  HD  21.0  13.7  – 
LY  –  
Brown et al.  ND  16.8  –  13.5 
(29d)  HD  18.2  11.9  – 
LY  –  
Simonyan et al.  ND  14.5  –  12.5 
(29d)  HD  17.4  9.6  – 
LY  –  11.2  18.5  
CNN  ND  16.4  –  16.2 
(32d)  HD  18.9  10.7  – 
. The mean error rates for convolutional neural networks (CNN) are given with a standard deviation over 10 runs.
3 More data
References
 Brown et al. [2010] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. IEEE PAMI, 2010.
 Ciresan et al. [2012] Dan Ciresan, Ueli Meier, and Jürgen Schmidhuber. Multicolumn deep neural networks for image classification. In Proc. CVPR, 2012.
 Hadsell et al. [2006] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
 Hadsell [2008] R.T. Hadsell. Learning longrange vision for an offroad robot. PhD thesis, New York University, 2008.
 Jarrett et al. [2009] K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, and Y. LeCun. What is the best multistage architecture for object recognition? In Proc. ICCV, 2009.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.

Lowe [2004]
D.G. Lowe.
Distinctive image features from scaleinvariant keypoints.
International Journal of Computer Vision
, 60(2):91–110, 2004.  Simonyan et al. [2012] K. Simonyan, A. Vedaldi, and A. Zisserman. Descriptor learning using convex optimisation. In Computer Vision–ECCV 2012, 2012.
 Trzcinski et al. [2012] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua. Learning image descriptors with the boostingtrick. In Proc. NIPS, 2012.
Comments
There are no comments yet.