TILDE
Repository for "TILDE: A Temporally Invariant Learned DEtector", CVPR2015
view repo
We introduce a learningbased approach to detect repeatable keypoints under drastic imaging changes of weather and lighting conditions to which stateoftheart keypoint detectors are surprisingly sensitive. We first identify good keypoint candidates in multiple training images taken from the same viewpoint. We then train a regressor to predict a score map whose maxima are those points so that they can be found by simple nonmaximum suppression. As there are no standard datasets to test the influence of these kinds of changes, we created our own, which we will make publicly available. We will show that our method significantly outperforms the stateoftheart methods in such challenging conditions, while still achieving stateoftheart performance on the untrained standard Oxford dataset.
READ FULL TEXT VIEW PDF
This paper presents a matching network to establish point correspondence...
read it
Global constraints and reranking have not been used in cognates detectio...
read it
We introduce a novel approach for keypoint detection task that combines
...
read it
We propose a method for multiperson detection and 2D pose estimation t...
read it
In the recent years, a number of novel, deeplearning based, interest po...
read it
We present a novel approach for 2D hand keypoint localization from regul...
read it
A novel algorithm for widebaseline matching called MODS  Matching On D...
read it
Repository for "TILDE: A Temporally Invariant Learned DEtector", CVPR2015
None
Repository for "TILDE: A Temporally Invariant Learned DEtector", CVPR2015
Keypoint detection and matching is an essential tool to address many Computer Vision problems such as image retrieval, object tracking, and image registration. Since the introduction of the Moravec, Förstner, and Harris corner detectors
[27, 11, 15] in the 1980s, many others have been proposed [41, 10, 31]. Some exhibit excellent repeatability when the scale and viewpoint change or the images are blurred [26]. However, their reliability degrades significantly when the images are acquired outdoors at different times of day and in different weathers or seasons, as shown in Fig. 1. This is a severe handicap when attempting to match images taken in fair and foul weather, in the morning and evening, in winter and summer, even with illumination invariant descriptors [13, 39, 14, 43].In this paper, we propose an approach to learn a keypoint detector that extracts keypoints which are stable under such challenging conditions and allow matching in situations as difficult as the one depicted by Fig. 1. To this end, we first introduce a simple but effective method to identify potentially stable points in training images. We then use them to train a regressor that produces a score map whose values are local maxima at these locations. By running it on new images, we can extract keypoints with simple nonmaximum suppression. Our approach is inspired by a recently proposed algorithm [34] that relies on regression to extract centerlines from images of linear structures. Using this idea for our purposes has required us to develop a new kind of regressor that is robust to complex appearance variation so that it can efficiently and reliably process the input images.
As in the successful application of Machine Learning to descriptors
[5, 40] and edge detection [8], learning methods have also been used before in the context of keypoint detection [30, 37] to reduce the number of operations required when finding the same keypoints as handcrafted methods. However, in spite of an extensive literature search, we have only found one method [38]that attempts to improve the repeatability of keypoints by learning. This method focuses on learning a classifier to filter out initially detected keypoints but achieved limited improvement. This may be because their method was based on pure classification and also because it is nontrivial to find good keypoints to be learned by a classifier in the first place.
Probably as a consequence, there is currently no standard benchmark dataset designed to test the robustness of keypoint detectors to these kinds of temporal changes. We therefore created our own from images from the Archive of Many Outdoor Scenes (AMOS) [18] and our own panoramic images to validate our approach. We will use our dataset in addition to the standard Oxford [26] and EF [44] datasets to demonstrate that our approach significantly outperforms stateoftheart methods in terms of repeatability. In the hope of spurring further research on this important topic, we will make it publicly available along with our code.
In summary, our contribution is threefold:
We introduce a “Temporally Invariant Learned DEtector” (TILDE), a new regressionbased approach to extracting feature points that are repeatable under drastic illumination changes causes by changes in weather, season, and time of day.
We propose an effective method to generate the required training set of “good keypoints to learn.”
We created a new benchmark dataset for evaluation of feature point detectors on outdoor images captured at different times ands seasons.
In the remainder of this paper, we first discuss related work, give an overview of our approach, and then detail our regressionbased approach. We finally present the comparison of our approach to stateoftheart keypoint detectors.
An extraordinary large amount of work has been dedicated to developing ever more effective feature point detectors. Even though the methods that appeared in the 1980s [27, 11, 15] are still in wide use, many new ones have been developed since. [10] proposed the SFOP detector to use junctions as well as blobs, based on a general spiral model. [17] and the WADE detector of [33] use symmetries to obtain reliable keypoints. With SIFER and DSIFER, [25, 24] used Cosine Modulated Gaussian filters and 10 order Gaussian derivative filters for more robust detection of keypoints. Edge Foci [44] and [12] use edge information for robustness against illumination changes. Overall, these methods have consistently improved the performance of keypoint detectors on the standard dataset [26], but still suffer severe performance drop when applied to outdoor scenes with temporal differences.
One of the major drawbacks of handcrafted methods are that they cannot be easily adapted to the context, and consequently lack flexibility. For instance, SFOP [10] works well when calibrating cameras and WADE [33] shows good results when applied to objects with symmetries. However, their advantages are not easily carried on to the problem we tackle here, such as finding similar outdoors scenes [19].
Although work on keypoint detectors were mainly focused on handcrafted methods, some learning based methods have already been proposed [30, 38, 16, 28]. With FAST, [30] introduced Machine Learning techniques to learn a fast corner detector. However, learning in their case was only aimed toward the speed up of the keypoint extraction process. Repeatability is also considered in the extended version FASTER [31], but it did not play a significant role. [38] trained the WaldBoost classifier [36] to learn keypoints with high repeatability on a prealigned training set, and then filter out an initial set of keypoints according to the score of the classifier. Their method, called TaSK, is probably the most related to our method in the sense that they use prealigned images to build the training set. However, the performance of their method is limited by the initial keypoint detector used.
Recently, [16] proposed to learn a classifier which detects matchable keypoints for StructurefromMotion (SfM) applications. They collect matchable keypoints by observing which keypoints are retained throughout the SfM pipeline and learn these keypoints. Although their method shows significant speedup, they remain limited by the quality of the initial keypoint detector. [28]
learns convolutional filters through random sampling and looking for the filter that gives the smallest pose estimation error when applied to stereo visual odometry. Unfortunately, their method is restricted to linear filters, which are limited in terms of flexibility, and it is not clear how their method can be applied to other tasks than stereo visual odometry.
We propose a generic scheme for learning keypoint detectors, and a novel efficient regressor specified for this task. We will compare it to stateoftheart handcrafted methods as well as TaSK, as it is the closest method from the literature, on several datasets.
In this section, we first outline our regressionbased approach briefly and then explain how we build the required training set. We will formalize our algorithm and describe the regressor in more details in the following section.
Let us first assume that we have a set of training images of the same scene captured from the same point of view but at different seasons and different times of the day, such as the set of Fig. 2(a). Let us further assume that we have identified in these images a set of locations that we think can be found consistently over the different imaging conditions. We propose a practical way of doing this in Section 3.2 below. Let us call positive samples the image patches centered at these locations in each training image. The patches far away from these locations are negative samples.
To learn to find these locations in a new input image, we propose to train a regressor to return a value for each patch of a given size of the input image. These values should have a peaked shape similar to the one shown in Fig. 2(b) on the positive samples, and we also encourage the regressor to produce a score that is as small as possible for the negative samples. As shown in Fig. 2(c), we can then extract keypoints by looking for local maxima of the values returned by the regressor, and discard the image locations with low values by simple thresholding. Moreover, our regressor is also trained to return similar values for the same locations over the stack of images. This way, the regressor returns consistent values even when the illumination conditions vary.
As shown in Fig. 3, to create our dataset of positive and negative samples, we first collected series of images from outdoor webcams captured at different times of day and different seasons. We identified several suitable webcams from the AMOS dataset [18]—webcams that remained fixed over long periods of time, protected from the rain, etc. We also used panoramic images captured by a camera located on the top of a building.
To collect a training set of positive samples, we first detect keypoints independently in each image of this dataset. We use SIFT [23], but other detectors could be considered as well. We then iterate over the detected keypoints, starting with the keypoints with the smallest scale. If a keypoint is detected at about the same location in most of the images from the same webcam, its location is likely to be a good candidate to learn.
In practice we consider that two keypoints are at about the same location if their distance is smaller than the scale estimated by SIFT and we keep the best 100 repeated locations. The set of positive samples is then made of the patches from all the images, including the ones where the keypoint was not detected, and centered on the average location of the detections.
This simple strategy offers several advantages: we keep only the most repeatable keypoints for training, discarding the ones that were detected only infrequently. We also introduce as positive samples the patches where a highly repeatable keypoint was missed. This way, we can focus on the keypoints that can be detected reliably under different conditions, and correct the mistakes of the original detector.
To create the set of negative samples, we simply extract patches at locations that are far away from the keypoints used to create the set of positive samples.
In this section, we first introduce the form of our regressor, which is made to be applied to every patch from an image efficiently, then we describe the different terms of the proposed objective function to train for detecting keypoints reliably, and finally we explain how we optimize the parameters of our regressor to minimize this objective function.
Our regressor is a piecewise linear function expressed using Generalized Hinging Hyperplanes (GHH)
[4, 42]:(1) 
where
is a vector made of image features extracted from an image patch,
is the vector of parameters of the regressor and can be decomposed into . The vectors can be seen as linear filters. The parameters are constrained to be either 1 or +1. and are metaparameters which control the complexity of the GHH. As image features we use the three components of the LUV color space and the image gradients—horizontal and vertical gradients and the gradient magnitude—computed at each pixel of the patches.[42] showed that any continuous piecewiselinear function can be expressed in the form of Eq. (1). It is well suited to our keypoint detector learning problem, since applying the regressor to each location of the image involves only simple image convolutions and pixelwise maximum operators, while regression trees require random access to the image and the nodes, and CNNs involve higherorder convolutions for most of the layers. Moreover, we will show that this formulation also facilitates the integration of different constraints, including constraints between the responses for neighbor locations, which are useful to improve the performance of the keypoint extraction.
Instead of simply aiming to predict the score computed from the distance to the closest keypoint in a way similar to what was done in [34], we argue that it is also important to distinguish the image locations that are close to keypoints from those that are far away. The values returned by the regressor for image locations close to keypoints should have a local maximum at the keypoint locations, while the actual values for the locations far from the keypoints are irrelevant as long as they are small enough to discard them by simple thresholding. We therefore first introduce a classificationlike term that enforces the separation between these two different types of image locations. We also rely on a term that enforces the response to have a local maximum at the keypoint locations, and a term that regularizes the responses of the regressor over time. To summarize, the objective function we minimize over the parameters of our regressor can be written as the sum of three terms:
(2) 
In this subsection we describe in detail the three terms of the objective function introduced in Eq. (2). The individual influences of each term are evaluated empirically and discussed in Section 5.4.
As explained above, this term is useful to separate well the image locations that are close to keypoints from the ones that are far away. It relies on a maxmargin loss, as in traditional SVM [7]. In particular, we define it as:
(3) 
where is a metaparameter, is the label for the sample , and is the number of training data.
To have local maxima at the keypoint locations, we enforce the response of the regressor to have a specific shape at these locations. For each positive sample , we force the response shape by defining a loss term related to the desired response shape , similar to the one used in [34] and shown in Fig. 2(b):
(4) 
where , are pixel coordinates with respect to the center of the patch, and , metaparameters influencing the sharpness of the shape.
However, we want to enforce only the general shape and not the scale of the responses to not interfere with the classificationlike term . We therefore introduce an additional term defined as:
(5) 
where denotes the convolution product, is the number of positive samples; is a metaparameter for weighting the term that will be estimated by crossvalidation. is used to enforce the shape constraints only on the filters that contribute to the regressor response of the operator.
It turns out that it is more convenient to perform the optimization of this term in the Fourier domain. If we denote the 2D Fourier transform of
, , and as , , and , respectively, then by applying Parseval’s theorem and the Convolution theorem, Eq. (5) becomes ^{3}^{3}3 See Appendix in the supplemental material for derivation.(6) 
where
(7) 
This way of enforcing the shape of the responses is a generalization of the approach of [29] to any type of shape. In practice, we approximate with the mean over all positive training samples for efficient learning. We also use Parseval’s theorem and the feature mapping proposed in Ashraf et al.’s work [2] for easy calculation ^{3}^{3}footnotemark: 3.
To enforce the repeatability of the regressor over time, we force the regressor to have similar responses at the same locations over the stack of training images. This is simply done by adding a term defined as:
(8) 
where is the set of samples at the same image locations as but from the other training images of the stack. is again a metaparameter to weight this term.








After dimension reduction using Principal Component Analysis (PCA) applied to the training samples to decrease the number of parameters to optimize, we solve Eq. (
2) through a greedy procedure similar to gradient boosting. We start with an empty set of hyperplanes
and we iteratively add new hyperplanes that minimize the objective function until we reach the desired number (we use and in our experiments). To estimate the hyperplane to add, we apply a trust region Newton method [22], as in the widelyused LibLinear library [9].After initialization, we randomly go through the hyperplanes one by one and update them with the same Newton optimization method. Fig. 4(a) shows the filters learned by our method on the StLouis sequence. We perform a simple crossvalidation using grid search in logscale to estimate the metaparameters , , and on a validation set.
To further speed up our regressor, we approximate the learned linear filters with linear combinations of separable filters using the method proposed in [35]. Convolutions with separable filters are significantly faster than convolutions with nonseparable ones, and the approximation is typically very good. Fig. 4(b) shows an example of such approximated filters.
In this section we first describe our experimental setup and present both quantitative and qualitative results on our Webcam dataset and the more standard Oxford dataset.
We compare our approach to TaSK, SIFT, SFOP, WADE, FAST9, SIFER, SURF, LCF, MSER, and EdgeFoci^{4}^{4}4See the supplementary material for implementation details.. In the following, our full method will be denoted TILDEP. TILDEP24 denotes the same method, after approximation of the piecewise linear regressor using 24 separable filters.
To evaluate our regressor itself, we also compared it against two other regressors. The first regressor, denoted TILDEGB, is based on boosted regression trees and is an adaptation of the one used in [34]
for centerline detection to keypoint detection, with the same parameters used for implementation as in the original work. The second regressor we tried, denoted TILDECNN, is a Convolutional Neural Network, with an architecture similar to the LeNet5 network
[20]but with an additional convolution layer and a maxpooling layer. The first, third, and fifth layers are convolutional layers; the first layer has a resolution of
and filters of size , the third layer has 10 features maps of size and filters of size , and the fifth layer 50 feature maps of size , and filters of size . The second, fourth, and sixth layers are maxpooling layers of size. The seventh layer is a layer of 500 neurons fully connected to the previous layer, which is followed by the eighth layer which is a fullyconnected layer with a sigmoid activation function, followed by the final output layer. For the output layer we use the
regression cost function.We thoroughly evaluated the performance of our approach using the same repeatability measure as [31], on our Webcam dataset, and the Oxford and EF datasets. The repeatability is defined as the number of keypoints consistently detected across two aligned images. As in [31] we consider keypoints that are less than 5 pixels apart when projected to the same image as repeated. However, the repeatability measure has two caveats: First, a keypoint close to several projections can be counted several times. Moreover, with a large enough number of keypoints, even simple random sampling can achieve high repeatability as the density of the keypoints becomes high.
We therefore make this measure more representative of the performance with two modifications: First, we allow a keypoint to be associated only with its nearest neighbor, in other words, a keypoint cannot be used more than once when evaluating repeatability. Second, we restrict the number of keypoints to a small given number, so that picking the keypoints at random locations would results with a repeatability score of only 2%, reported as Repeatability (2%) in the experiments.
We also include results using the standard repeatability score, 1000 keypoints per image, and a fixed scale of 10 for our methods, which we refer to as Oxford Stand. and EF Stand., for comparison with previous papers, such as [26, 44]. Table 1 shows a summary of the quantitative results.
Webcam  Oxford  EF  
#keypoints  (2%)  Stand.  (2%)  Stand.  (2%) 
TILDEGB  33.3  54.5  32.8  43.1  16.2 
TILDECNN  36.8  51.8  49.3  43.2  27.6 
TILDEP24  40.7  58.7  59.1  46.3  33.0 
TILDEP  48.3  58.1  55.9  45.1  31.6 
FAST9  26.4  53.8  47.9  39.0  28.0 
SFOP  22.9  51.3  39.3  42.2  21.2 
SIFER  25.7  45.1  40.1  27.4  17.6 
SIFT  20.7  46.5  43.6  32.2  23.0 
SURF  29.9  56.9  57.6  43.6  28.7 
TaSK  14.5  25.7  15.7  22.8  10.0 
WADE  27.5  44.3  51.0  25.6  28.6 
MSER  22.3  51.5  35.9  38.9  23.9 
LCF  30.9  55.0  40.1  41.6  23.1 
EdgeFoci  30.0  54.9  47.5  46.2  31.0 
Fig. 5 gives the repeatability scores for our Webcam dataset. Fig. 5top shows the results of our method when trained on each sequence and tested on the same sequence, with the set of images divided into disjoint train, validation, and test sets. Fig. 5bottom shows the results when we apply our detector trained on one sequence to all other unseen sequences from the Webcam dataset. We significantly outperform stateoftheart methods when using a detector trained specifically to each sequence. Moreover, while the gap is reduced when we test on unseen sequences, we still outperform all compared methods by a significant margin, showing the generalization capability of our method.
In Fig. 8 we also evaluate our method on Oxford and EF datasets. Oxford dataset is simpler in the sense that it does not exhibit the drastic changes of the Webcam dataset but it is a reference for the evaluation of keypoint detectors. EF dataset on the other hand exhibits drastic illumination changes and is very challenging. It is therefore interesting to evaluate our approach on these datasets.
Instead of learning a new keypoint detector on this dataset, we apply the detector learned using the Chamonix sequence from the Webcam dataset. Our method still achieves stateoftheart performance. We even significantly outperform stateoftheart methods in the case of the Bikes, Trees, Leuven and Rushmore images, which are outdoor scenes. Note that we also obtain good results for Boat which has large scale changes, although we currently do not consider scale in learning and detecting. Repeatability score shown here is lower than what was reported in previous works [26, 31] as we consider a smaller number of keypoints. As mentioned before, considering a large number of keypoints artificially improves the repeatability score.
We also give in Fig. 9 some qualitative results on the task of matching challenging pairs of images captured at different days under different weather conditions. Our matching pipeline is as follow: we first extract keypoints in both images using the different methods we want to compare, compute the keypoints descriptors, and compute the homography between the two images using RANSAC. Since the goal of this comparison is to evaluate keypoints not descriptors, we use the SIFT descriptor for all methods. Note that we also tried using other descriptors [3, 32, 6, 1, 21] but due to the drastic difference between the matched images, only SIFT descriptors with ground truth orientation and scale worked. We compare our method with the SIFT [23], SURF [3], and FAST9 [31] detectors, using the same number of keypoints (300) for all methods. Our method allows to retrieve the correct transformations between the images even under such drastic changes of the scene appearance.
Fig. 6 gives the results of the evaluation of the influence of each loss term of Eq. (2) by evaluating the performance of our detector without each term. We will refer to our method when using only the classification loss as , when using both classification loss and the temporal regularization as , and when using the classification loss and the shape regularization as . We achieve the best performance when all three terms are used together. Note that the shape regularization enhances the repeatability on Oxford and EF, two completely unseen datasets, whereas the temporal regularization helps when we test on images which are similar to the training set.
Fig. 7 gives the computation time of SIFT and each variant of our method. TILDEP24 is not very far from SIFT. Note that our method is highly parallelizable, while our current implementation does not benefit from any parallelization. We therefore believe that our method can be significantly sped up with a better implementation.
We have introduced a learning scheme to detect keypoints reliably under drastic changes of weather and lighting conditions. We proposed an effective method for generating the training set to learn regressors. We learned three regressors, which among them, the piecewise linear regressor showed best result. We evaluated our regressors on our new outdoor keypoint benchmark dataset. Our regressors significant outperforms the current stateoftheart on our new benchmark dataset and also achieve stateoftheart performances on Oxford and EF datasets, demonstrating their generalisation capability.
An interesting future research direction is to extend our method to scale space. For example, the strategy applied in [21] to FAST can be directly applied to our method.
This work was supported by the EU FP7 project MAGELLAN under the grant number ICTFP7611526 and in part by the EU project EDUSAFE.
Conference on Computer Vision and Pattern Recognition
, 2012.Reinterpreting the Application of Gabor Filters as a Manipulation of the Margin in Linear Support Vector Machines.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7):1335–1341, 2010.Trust Region Newton Method for Logistic Regression.
Journal of Machine Learning Research, 9:627–650, 2008.
Comments
There are no comments yet.