Unsupervised Learning Framework of Interest Point Via Properties Optimization

07/26/2019 ∙ by Pei Yan, et al. ∙ Huazhong University of Science u0026 Technology 10

This paper presents an entirely unsupervised interest point training framework by jointly learning detector and descriptor, which takes an image as input and outputs a probability and a description for every image point. The objective of the training framework is formulated as joint probability distribution of the properties of the extracted points. The essential properties are selected as sparsity, repeatability and discriminability which are formulated by the probabilities. To maximize the objective efficiently, latent variable is introduced to represent the probability of that a point satisfies the required properties. Therefore, original maximization can be optimized with Expectation Maximization algorithm (EM). Considering high computation cost of EM on large scale image set, we implement the optimization process with an efficient strategy as Mini-Batch approximation of EM (MBEM). In the experiments both detector and descriptor are instantiated with fully convolutional network which is named as Property Network (PN). The experiments demonstrate that PN outperforms state-of-the-art methods on a number of image matching benchmarks without need of retraining. PN also reveals that the proposed training framework has high flexibility to adapt to diverse types of scenes.



There are no comments yet.


page 4

page 6

page 7

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interest point is a sparse set of image point containing representative information [3, 18, 22]. With advantage of computation efficiency and representation stability, interest point has a wide utilization in matching tasks such as simultaneous localization and mapping, image registration and stereo reconstruction. Both computation efficiency and representation stability rely on the location and description of interest point which are produced by detector and descriptor respectively.

All interest point related methods (including detectors and descriptors) can be classified as hand-crafted and learning based methods. Hand-crafted methods define some explicit criteria to extract the points

[10, 16, 20] and calculate the descriptions while learning based methods are more flexible by training detector and descriptor according to some objectives [23, 24, 9]. Both hand-crafted methods and learning based methods make efforts to lead some important properties to interest point which benefit to diverse applications.

Currently, unsupervised learning is especially feasible to interest point by avoiding the tedious labeling of ground truth points [21, 19]

. Following the trend, this paper presents an entirely unsupervised training framework by jointly learning detector and descriptor, which takes an image as input and outputs a probability and a description for every image point. Here the probability reflects how likely the point become interest point, and the description is a feature vector used to represent the unique discriminability of an interest point. In essence, the learning model is to approach to the several optimal properties of interest point. The joint probability distribution of the extracted point properties is used to formulate the objective of our training framework which is maximized to learn detector and descriptor through achieving all desired properties.

The rest of this paper is organized as follows. In section 2 we discuss the related work. In section 3, we formulate our unsupervised learning framework of interest point. Section 4 introduces latent variable to convert the objective function and explains the Expectation Maximization (EM) optimization algorithm. Section 5 instantiate detector and descriptor with fully convolutional network, and show some experiments. Finally, in Section 6 we summarize this paper and list some possible future work.

2 Related Work

The key idea of hand-crafted interest detector is to define explicit criteria to detect and describe interest point. Harris [10], SIFT [16], SURF [4] and KAZE [2] extract the points whose gray value have abrupt changes in two-dimensional space or scale space. FAST [20] detects the points whose local gray distribution conforms to typical corner patterns. Gradient or gradient-like histograms are widely used in float-value description such as SIFT [16], SURF [4], and Harris [2]. Binary descriptors such as BREAF [5], BRISK [13], FREAK [1]

select some pixel pairs in neighborhood of interest point to calculate binary feature. Furthermore, scale and orientation estimations

[16, 4, 2] are normally integrated into above descriptors to improve matching performance under scale and rotation transformations.

TaSK [23] and TILDE [24] focus on the points extracted by existing detectors (i.e., Forstner [24] and SIFT [16]) and train detectors to improve repeatability of these points under different illuminations. Quad-network [21] approximates repeatability objective with ranking objective and doesn’t rely on any existing detectors. LIFT [26] first selects image patches around SIFT point [16], then train descriptor, orientation estimator and fine-tune detector. With known key point locations, SuperPoint [9]

uses synthetic geometry shapes dataset to pre-trains a weak detector which is further trained in active learning way. Descriptor of SuperPoint is learned by another discriminability objective. In LF-Net

[19] repeatability objective is approximated by calculating loss of detector between a pair of images, and its discriminability objective impact both descriptions and locations of extracted points, which is similar to LIFT [26]. SIPS [7] fixes descriptor as existing model such as SIFT [16], and train detector to fit the descriptor. IMIP [6] jointly learn detector and descriptor to extract the fixed number of points whose descriptions are limited as one-hot feature vectors.

There are three essential differences between our training framework and existing methods. First, whereas existing methods improve some properties implicitly, we directly maximize the probability that interest point satisfies desired properties. Second, whether a point satisfies the required properties can be thought as latent binary variable so that our training framework can be optimized by Expectation Maximization algorithm. Third, our framework is very flexible to be generalized to any specific application by instantiating different models.

3 Unsupervised Learning Framework of Interest Point

3.1 Problem Formulation

Interest point is a sparse set of image point containing representative and discriminative information. In order to integrate both types of properties, a general framework learns detector and descriptor which extract interest point and its description respectively. First we introduce the general process of extracting interest point from a given scene.

For the sake of describing the problem clearly, we give some notations. A actual world scene to acquire image is denoted as . All possible viewpoint or illumination conditions for taking image are abstracted as transformation set . Here represents a specific condition, and represent all possible conditions. The image acquired from under condition is expressed as . Each point in and its corresponding mapped point in are all denoted as . Suppose the entire scene point set is . Here represents the number of scene points.

In general detector and descriptor take image as input and output interest point and its description respectively. Detector is defined as a function outputting a probability for every point in image ,


where are all the parameters of the model of detector, and reflects how likely becomes an interest point. In practice probability threshold is introduced to obtain a deterministic interest point set. Interest point set of is


And is named as background point set.

Descriptor is defined as a function outputting a description vector for every in image , i.e.,


where are all the parameters of the model of descriptor, and can be used to calculate the similarity between this point and other interest point, which is very important to determine the discriminability of an interest point. We always ensure with length normalization. The description set of image is denoted as


The problem of learning based interest point is equivalent to learning the parameters of detector and descriptor. Whereas there is no unique supervised label for interest point and its description, it’s reliable to jointly train detector and descriptor in an unsupervised way which focuses on the essential properties of interest point. Therefore, we propose an unsupervised training framework to optimize the learning model of interest point.

The overview of the framework is shown as Figure 5. With images acquired from the same scene, this framework jointly trains detector and descriptor. The training samples can be transformed from a scene image by simulating different illumination and viewpoint changes. Images are fed into detector and descriptor which outputs a probability and a description for every image point. Then, the joint probability of interest point properties is computed through the above two produced information. Finally, a proposed expectation maximization algorithm optimizes the joint probability to find the best model parameters, which means that the outputs of detector and descriptor have achieved all desired properties.

Figure 1: Overview of unsupervised training framework via optimizing the properties of interest point. It consists of three parts: (1) training images transformed from a scene by simulating different imaging conditions, (2) models of detector and descriptor which produce interest point probability and description for each pixel, (3) Joint probability maximization algorithm that optimizes the properties of interest point.

3.2 Unsupervised Properties Optimization

The probability that interest point of satisfies given the th property can be formulated as . Here and is the number of all desired properties. Suppose all properties are independent and properties in different images are also independent. The objective maximizing the probability that interest point satisfies desired properties is


In this paper we make , and sparsity, repeatability and discriminability are selected as essential properties in this paper. The probabilities that an interest point satisfies above properties are denote as sparsity probability, repeatability probability and discriminability probability, whose formulations are introduced in subsequent subsections.

3.3 Sparsity Probability

Sparsity is an essential property controlling the limited number of interest points sparsely scatter over image. According to definition sparsity only rely on interest point rather than its description, so its probability can be represented as .

As a specific implementation, we define as the probability that interest point set of image is locally sparse. Local sparsity indicates that there is none in the interest point ’s neighborhood except itself. represents the neighborhood of point whose radius is . Here is a small integral (e.g., ), and doesn’t contain itself. Define returns the number of elements in a given set. Then the local sparsity probability of is defined as


Suppose the local sparsity probabilities of different interest points is independent, the local sparsity probability of entire interest point is


Except making interest point to be sparse locally, sparsity is also a global properties controlling the number of interest points. Depending on the number of interest points , represents the probability that number of interest point is reasonable. We define it as


Combining Equation 7 and 8, the sparsity probability is expressed as


3.4 Repeatability Probability

Repeatability indicates how likely a point can be extracted repeatedly under different viewpoints and illuminations. With this definition repeatability is determined on multiple images acquired from the same scene rather than a single image, so repeatability is defined on to simplify notations. Denote repeatability probability of point in as , which represents the probability that is extracted in , i.e.


where is the probability output by detector for in image which is defined in Equation 1. Then is the probability point belongs to background point set. In the reminder of the paper we ignore the limit of to simplify notations. Suppose the repeatability of each point is independent on each other, the repeatability probability of satisfies


3.5 Discriminability Probability

Discriminability denotes how likely an interest point in one image is more similar to the same point than the other interest points in another image. The similarity between point and is normally defined as the inner product of their description vector and , which is formulated as


because . If , denotes the similarity between the same point in two images, which is termed as positive pair. Otherwise when , represents the similarity between different points in two images, which is termed as negative pair.

Denote indicator function as which return 1 if and only if logical operation is true. Define function returning the maximum of a set and returning for empty set. Then the discriminability probability of an interest point is


where must be an interest point rather than an arbitrary point because only interest point will be considered in matching process. In practice Equation 13 is too sharp to represent the gap between current and optimal discriminability. So we approximate it with


where is the difference between similarity of positive pair and negative pairs, and is the maximum of . is a factor controlling the sensitivity of discriminability probability with respect to . The formulation of is


The formulation of is inspired by descriptor loss in [7]. Here and are named as positive margin and negative margin, which can be seemed as the target similarity of positive pair and negative pair. is a weight to balance positive pair and negative pair. Because the maximum of is .

Suppose the discriminability of each point is independent, the discriminability probability of is


Note in Equation 16 we don’t concern discriminability between interest point and background point, or discriminability between background points themselves.

3.6 Objective of Properties Optimization

During training, we generally have a scene set . Denote interest point set of image as where the descriptor of each point is . Thus, description set of image is . Notations , , are also extended as , , to represent repeatability and discriminability for point in scene . Then the objective of properties optimization is


which is named as properties objective. Note the description set is ignored in and because sparsity and repeatability probability don’t concern description of interest point.

4 Optimization and Implementation

4.1 Problem Conversion with Latent Variable

It’s hard to straightforward optimize properties objective with conventional gradient based algorithm. In Equation 2 interest point set is determined by the probability threshold where the derivative of the logical operation doesn’t exist. Therefore, we introduce latent variable to solve this problem.

Binary latent variable is formulated for every point in scene and if and only if point in any image satisfy all desired properties. is the vector whose th component is , and is the matrix whose item in th row and th col is . Point is defined as satisfied point if , and satisfied point set of is .

In original objective 17 and are optimized to make interest point sets and their descriptions achieve the desired properties, meaning that optimal solution of must be the point set . So replace as in Equation 17 won’t change its optimal solution. We first convert the property probability to more compact formulations with . The number of satisfied points can be calculated with . Redefine local sparsity probability for as


For vector we define the result of is still a vector, whose th component is .

Redefine sparsity probability of as


The conversion of repeatability probability is straightforward.


Also we can redefine discriminability probability as


Note for any is same because all images of share the same , so we replace it with in this case. In fact, all property probabilities for different images of are exactly equal by sharing . Then properties objective 17 can be converted as latent properties objective


According to Equation 19 sparsity probability is either 1 or 0. To maximize latent properties objective has to be ensured as 1, so that objective 23 is equivalent to


With logarithm transformation it’s equivalent to


where log-likelihood item


In objective 25 is named as log-likelihood function.

Figure 2: Visual matching results of state-of-the-art algorithms and our PN-i-64 model (M-score indicates Matching Score, and Homo-error means the error of estimated Homography) . First col: our PN-i-64 model, second col: SuperPoint, third col: SIFT and fourth col: LF-Net. All points extracted by each method are shown in blue dots whereas correct matches are shown in green lines and red dots. M-score indicates Matching Score that higher is better, and Homo-error means the error of estimated Homography that lower is better. Whereas SuperPoint and SIFT achieve best performance in some specific scenes, PN-i-64 model give reliable results under both illumination and viewpoint changes.

4.2 Expectation Maximization of latent properties objective

Objective 25 can be optimized with Expectation Maximization algorithm (EM). Suppose the optimization contains times iterations. Let and are initial parameters, and the parameters of detector and descriptor are and after times iterations.


In E-step of iteration , we need to obtain the expectation of log-likelihood function with respect to , which is denoted as . To achieve it we first estimate the probability distribution . Because different are independent we only discuss the distribution of directly. The formulation for probability of is


Denote all possible form the set


Then the probability distribution of is


where is the normalization factor. Denote , then the expectation of is . So the expectation of is





In M-step and are obtained by maximizing the expectation of . Normally the number of parameters is very huge, so Gradient Ascent algorithm (GA) is selected to achieve M-step in this paper. The gradient is computed as:


Similarity is computed with inner product which is derivable. So if the detector and descriptor are derivable, their parameters and can be optimized with GA. Theoretically, from and to and GA may need multiple times updates to achieve convergence. In practice, both Equation 29 and 44 lead to very high computational complexity, so we introduce Mini-Batch Approximation as an efficient implementation. All details of of this approximation are outlined in Supplementary Section 1.

Figure 3: Visual matching results of Superpoint and different Property Networks. First col: PN-i-64 model, second col: PN-v-64 model, third col: PN-128 model and fourth col: SuperPoint. Here all notations are same to what in Figure 2. By focusing on sharp illumination changes, PN-i-64 has significant superiority for scenes with illumination changes. Analogously PN-v-64 achieve most reliable performance under large range of viewpoint changes. PN-128 can manage diverse changes by learning with both sharp illumination changes and large range of viewpoint changes. With description vector length of 128, PN-128 achieves similar performance to state-of-the-art SuperPoint whose description vector length is 256.

5 Experimental Result

5.1 Experiment Setup

Our framework is flexible to be integrated with different models. In this paper we implement detector and descriptor as two Fully Convolutional Network [15, 27]

with Batch Normalization

[11], and parameters of their encoders are shared. We name this implementation as Property Network (PN). The architecture of PN is demonstrated in Supplementary Section 2.

In this paper we select MS-COCO 2014 [14] as training dataset, which comprises of more than 80 thousand images. We treat each single image as a scene which is transformed to the training images with different illuminations and viewpoints through simulated transformations. The details of simulated transformations are outlined in Supplementary Section 3.

To adapt to different kinds of illumination and viewpoints changes, we train three different PN models corresponding to three transform simulation conditions. PN-i-64 model is trained with the condition of sharp illumination change and medium viewpoint change. PN-v-64 is trained with the condition of large range viewpoint change and medium illumination change. PN-128 is trained with the condition of sharp illumination change and large range viewpoint change. The architectures of above three networks are identical except the length of their description vector. The length of description vector of PN-i-64, PN-v-64 and PN-128 are 64, 64, 128 respectively.

The configurations of all hyperparameters of PN are as below. During training we resize all images to

and stack different images to mini-batches (but the size of testing image can be arbitrary). In every iteration two scenes are randomly transformed ten times respectively, so there are in fact twenty images in a mini-batch (i.e., and ). In entire training we fix number range , , and non-maximum suppression radius pixels. For discriminability, negative pair weight , positive margin , negative margin and discriminability weight

. We always stop training after two epochs. The FCN model is implemented with and solved by Adam optimizer

[12] with default parameters ( and ).

5.2 Performance Comparison

Three datasets HPatches [3], Webcam [24], Oxford [17] are used to evaluate the performance. HPatches is divided into illumination changed subset and viewpoint changed subset, which are denoted as HP-i and HP-v respectively. Webcam contains sharp illumination changes but no viewpoint change. In this paper no method is trained on Webcam, so training set and testing set of Webcam are both available to evaluate performance, which is denoted as W-train and W-test. Oxford comprises of both illumination and viewpoint changes, but we don’t split it because the number of images in Oxford is small.

Two used metrics Matching Score and Homography Estimation are identical to that used in [9] (see more details in Supplementary Section 4). Higher Matching Score and Homography Estimation are better. Briefly, Matching Score measures the ratio of the recovered ground truth correspondences over the number of points extracted by the detector in the shared viewpoint region. Homography Estimation measures the ability of an algorithm to estimate the homography that relates a pair of images by comparing it to ground truth homography. Similar to [9], we fix correct distance pixels, and use RANSAC method implemented by Opencv Toolbox to estimate homography. In this paper all evaluations are performed on image size . To be fair we use the recommended hyperparameters for every method, except the maximum number of extracted interest points. We keep no more than 1000 interest points for images.

The performances of different methods are shown in Table LABEL:tab:tabscore and Table LABEL:tab:tabhomo.

In summary, our Property Networks outperform the other methods on viewpoint changed subset of HPatches and Oxford, and achieve competitive performance on others according to Matching Score. With Homography Estimation, our methods outperform other methods on HPatches and Webcam and achieve competitive performance on Oxford. As the state-of-the-art algorithms, SIFT has low performance on illumination changed datasets and SuperPoint has no superiority on viewpoint changed datasets, but our Property Networks present much more stable results on all above datasets. Furthermore, it’s interesting that SuperPoint has slight advantage on Webcam according to Matching Score but has large gap comparing to our PN-i-64 according to Homography Estimation. This is because Matching Score is a basic metric only reflecting the maximum number of points which are likely helpful to subsequent tasks. However, the space distribution of matched points is as important as the number of matched points for the final tasks such as homography estimation and camera pose estimation, which can’t be revealed by Matching Score. Sparsity is an essential property making the space distribution of interest point suit for these tasks. By improving sparsity and other properties jointly PN-i-64 achieve much higher performance for Homography Estimation on illumination changed dataset.

Figure 5 demonstrates some visualization details about it. Whereas SuperPoint and SIFT achieve the best performance in some specific scenes, PN-i-64 model give reliable results under both illumination and viewpoint changes. Furthermore, above visualization results of PN-i-64 and SuperPoint explains the difference between Matching Score and Homography Estimation. More results can be found in Supplementary Section 5.

HP-i HP-v W-train W-test Oxford
SIFT 0.295 0.314 0.130 0.137 0.357
SURF 0.300 0.281 0.128 0.138 0.395
ORB 0.335 0.306 0.141 0.158 0.421
KAZE 0.362 0.276 0.167 0.182 0.381
LIFT 0.336 0.291 0.172 0.187 0.325
LF-Net 0.299 0.273 0.159 0.175 0.326
Super 0.527 0.458 0.317 0.330 0.434
PN-i-64 0.522 0.472 0.302 0.316 0.445
PN-v-64 0.466 0.503 0.232 0.249 0.521
PN-128 0.469 0.464 0.248 0.259 0.472
Table 1: Matching Score of Different Methods

5.3 Performance Analysis of Three PM Models

It’s an ultimate goal to learning an interest point model that can be generalized to all application scenes. From the viewpoint of actual application, it’s practical to learning an interest model to cope with specific conditions. The results of our three models confirm this opinion, which is intuitively demonstrated in Figure 6.

Though PN-128 has the highest potentials with largest description vector length, in our training it can’t converge well facing both sharp illumination changes and large range viewpoint changes, which makes it have no superiority for either illumination or viewpoint changes. From another perspective, if there are both sharp illumination changes and large range viewpoint changes in given application, PN-128 should be a reasonable tradeoff with limited learning ability. Note PN-128 achieves similar or better results comparing with SuperPoint whose length of description vector is 256.

By focusing on sharp illumination changes and medium viewpoint changes, PN-i-64 achieves much better performance on illumination changed datasets, which demonstrates the flexibility of our training framework. What we need is only feeding the model corresponding images for different application scenes, without adjusting the architecture and objective anymore. PN-v-64 outperform PN-128 under viewpoint changes, but the superiority is relatively small. One reason is conventional convolution neural network such as Fully Convolutional Network can only achieve limited rotation invariant with convolution and pooling operation

[8, 25]. Explicitly estimating the orientation of interest point is a one of reliable ways to improve rotation invariant [16, 26].

HP-i HP-v W-train W-test Oxford
SIFT 0.842 0.557 0.526 0.563 0.650
SURF 0.781 0.444 0.396 0.42 0.600
ORB 0.682 0.334 0.304 0.329 0.533
KAZE 0.809 0.398 0.449 0.473 0.492
LIFT 0.875 0.462 0.600 0.628 0.558
LF-Net 0.841 0.435 0.517 0.573 0.567
Super 0.938 0.525 0.713 0.730 0.592
PN-i-64 0.950 0.555 0.802 0.811 0.608
PN-v-64 0.882 0.572 0.564 0.602 0.642
PN-128 0.937 0.554 0.715 0.730 0.617
Table 2: Homography Estimation of Different Methods

6 Conclusion

This paper proposes an entirely unsupervised training framework by maximizing the sparsity, repeatability and discriminability probability of interest point and its description. With Expectation Maximization algorithm and mini-batch approximation this framework can be optimized efficiently. As an implementation based on Fully Convolutional Network, Property Network outperforms state-of-the-art algorithms on a number of image matching dataset, which demonstrates the effectiveness and flexibility of our training framework. Future work will investigate more well-designed architecture of detector and descriptor which can reach more potential of this framework. Furthermore, our framework can be integrated with more properties to improve model performance in diverse applications, and some supervised information can also be formulated as properties which bring semantics to interest point.


  • [1] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In

    IEEE Conference on Computer Vision & Pattern Recognition

    , pages 510–517, 2012.
  • [2] Pablo Fernández Alcantarilla, Adrien Bartoli, and Andrew J Davison. Kaze features. In European Conference on Computer Vision, pages 214–227. Springer, 2012.
  • [3] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017.
  • [4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
  • [5] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010.
  • [6] Titus Cieslewski, Michael Bloesch, and Davide Scaramuzza. Matching features without descriptors: Implicitly matched interest points (imips). arXiv preprint arXiv:1811.10681, 2018.
  • [7] Titus Cieslewski and Davide Scaramuzza. Sips: unsupervised succinct interest points. arXiv preprint arXiv:1805.01358, 2018.
  • [8] Taco Cohen and Max Welling. Group equivariant convolutional networks. In

    International conference on machine learning

    , pages 2990–2999, 2016.
  • [9] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018.
  • [10] Christopher G Harris, Mike Stephens, et al. A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
  • [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] Stefan Leutenegger, Margarita Chli, and Roland Siegwart. Brisk: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pages 2548–2555. Ieee, 2011.
  • [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [15] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [16] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [17] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE transactions on pattern analysis and machine intelligence, 27(10):1615–1630, 2005.
  • [18] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of affine region detectors. International journal of computer vision, 65(1-2):43–72, 2005.
  • [19] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems, pages 6237–6247, 2018.
  • [20] Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006.
  • [21] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1822–1830, 2017.
  • [22] Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. International Journal of computer vision, 37(2):151–172, 2000.
  • [23] Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua. Training for task specific keypoint detection. In Joint Pattern Recognition Symposium, pages 151–160. Springer, 2009.
  • [24] Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: a temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015.
  • [25] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In IEEE Conference on Computer Vision & Pattern Recognition, pages 5028–5037, 2017.
  • [26] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision, pages 467–483. Springer, 2016.
  • [27] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. 2010.

Supplementary Material

In this supplementary material we give more details of our implementation and experimential results. In section 1 we implement the Expectation Maximization (EM) process with an efficient strategy called Mini-Batch approximation of EM (MBEM). Section 2 demonstrates the architecture of our Property Network. Section 3 outlines details of our simulation for illumination and viewpoint changes. In Section 4 we introduce performance metrics used in experiments and more experimential results are demonstrated in Section 5.

1 Mini-Batch Approximation of Expectation Maxmization

1.1 Problems of Original Expectation Maxmization

Before discuss the difficultes of optimizing our objective with original Expectation Maxmization algorithm (EM), we first overview the objective of our training framework, which is formulated as


where and are parameters of detector and descriptor respectively, and are latent variables. Each is correspond to a scene , and is a vector whose component

is a binary random variable representing whether point

satisfies desired properties. Point is defined as satisfied point if . Note need to be optimized because we don’t known which points can become satisfied points. To be convenient for optimization we also define log-likelihood function


Because each is independent in objective (34) , we drop the superscript in the following parts of the supplement for the brief expression. Without ambiguity we always use to represent log-likelihood function no matter whether the is ignored or not.

Properties considered in objective (34) are sparsity, repeatability and discriminability. The repeatability and discriminability probability is


where and are probabilities that satisfies repeatability and discriminability respectively. The formulations of and can be found in main text. In this section we only need know don’t rely on , but depend on vector .

The constraint in objective (34) is named as sparsity constraint. represents the number of satisfied points, and are reasonable number range of interest points in a single scene. Local sparsity constraint is defined as


where represents the neighborhood of point whose radius is . Here is a small integral (e.g., ), and doesn’t contain itself.

Theoretically, objective (34) can be optimized with EM. Normally EM contains iterations, and an Expectation step (E-step) and a Maximization step (M-step) are conducted in each iteration. Let and are initial parameters, and the parameters of detector and descriptor are and after iterations. In each E-step we need obtain and compute the expectation of with respect to which is denoted as . In each M-step we need update and to and by maximizing .

We first formulate theoretically. Without sparsity constraint the sample space of can be represented as where is the number of all the points in given scene. That’s because the length of vector is and each component of is a binary variable. But with sparsity constraint this sample space should be more narrow. Denote the sample space of as which can be formulated as


The sparsity constraint in objective (34) can be ignored if we ensure . So when the distribution of can be reformulate as


Note we obtain (40) by substituting (36) and (37) into (34).

Because the distribution is a marginal distribution of , so we obtain


where is the normalization factor, and . By combining 35, 36 and 37, the expectation of with respect to is


Note we drop all the of and only considering single . Here depend on so we can’t straightforward simplify the expectation in (42).

The need be maximized in M-step, which can be achieved with Gradient Ascent algorithm (GA). According to definitions in main text, depends on , and depends on . The expressions of partial derivative are


Here in (43) can be computed directly, but the partial derivative in (44) is more complex.

Above E-step and M-step lead to high computational complexity from three aspects. 1) It’s inefficient to make GA achieve convergence on entire image set in each M-step (note (43) and (44) are defined for one scene but in fact we need compute them for all scenes). 2) Directly computing with (41) need traverse through entire , which leads to very high computational complexity. 3) Even with known , computing (44) still need traverse through entire .

In the subsequent subsection, we solve above three problems with some approximations of original EM. The entire approximate algorithm is named as Mini-Batch approximation of Expectation Maxmization (MBEM).

1.2 Considering Mini-Batch in Each Iteration

Because it’s too slow to make GA achieve convergence on entire image set in each M-step, in each iteration we only focus on a small subset of training set, which is normally named as mini-batch.

Here is more details. First we select size of mini-batch as (in our training we fix ). Construct in iteration and select mini-batch to conduct E-step and M-step. Furthermore, each M-step only updates parameters once rather than multiple times until converge, because it’s not necessary to achieve convergence on a mini-batch which increases the risk of overfitting.

1.3 Efficient Approximation for Distribution of Latent Variable

Though distribution can be computed with (41), but this computation need traverse through entire sample space which leads to very high computational complexity. There are two obstacles to simplify (41).

  1. The definition of sample space is integrated with sparsity constraint. It’s easy to check whether a specific satisfies sparsity constraint or not, but it’s difficult to straightforward obtain all satisfied .

  2. To solve distribution , we must consider all points because depends on vector .

Whereas it’s hard to solve (41) precisely, this subsection introduce an efficient strategy to approximate which comprises of three step. First, we approximate sample space with which can be obtained efficiently. Second, the is approximated with where is a constant for given . Third, we simplify the approximate formulation of .

1.3.1 Efficient Approximation of Sample Space

We first approximate with to simplify the sparsity constraint. A reasonable should have two characteristics. First, all satisfy sparsity constraint. That’s mean can only be a subset of , i.e., we can only remove some elements from in order to obtain . Second, replace with won’t change significantly. That’s mean elements in should have greater contributions to comparing to elements in .

The first characteristic can be achieved by setting for all if we want to setting . The remaining question is how to select point to set . According to (41), the larger and point has, the greater contribution can make. Considering can be obtained directly but depend on , and we have assume and are independent in main text, so we can select point if is a local maximum and set . According to above analyses we define


where return the maximum of given set. Define is a vector whose th component is . Since the local maximum points that are deducted from the current model require the corresponding points in the sample must be also local maximum as , the sample space Y can be further reduced as


Note any satifies becuase , this significantly benefit computation efficiency by avoid checking this constraint.

1.3.2 Efficient Approximation of

The second problem is depends on vector which make us have to consider all points when solve . To reduce the computational complexity, we give an approximation of with . We first make an assumption called

Discriminability Consistency:

Given two feasible samples which satisfy , discriminability consistency assumes , if , then .

Discriminability consistency assumes if the discriminability of point is larger than in satisfied point set , then inserting some interest points into this satisfied point set may not change their relative order. According to the definition which is deduced from the local maximum computed with current model parameters, must be the super set of all the other genuine interest point sets.

If we view any sample as and as , then the discriminability consistency will be satisfied. Our objective of optimization is to maintain the best descriminability of each interest point so that we can use to replace . Therefore,it’s reasonable to select to replace any sample to guarantee discriminability consistency. Furthermore, the feasible samples are also replaced by such that we only concern about the computation of . With we can obtain


where is the function computing disciminability for given point , and its formulation can be found in main text.

1.3.3 Simplify the Formulation of

First we give the formulation of combining above two approximations. Because is a constant with given , we denote as for the simplicity. Then (41) can be approximated with


Here is still the normalization factor and we don’t show its formulation repeatedly, and in (41) is changed to