[AI Robotics Summer School 2016] From local features to deep learning
We show how to train a Convolutional Neural Network to assign a canonical orientation to feature points given an image patch centered on the feature point. Our method improves feature point matching upon the state-of-the art and can be used in conjunction with any existing rotation sensitive descriptors. To avoid the tedious and almost impossible task of finding a target orientation to learn, we propose to use Siamese networks which implicitly find the optimal orientations during training. We also propose a new type of activation function for Neural Networks that generalizes the popular ReLU, maxout, and PReLU activation functions. This novel activation performs better for our task. We validate the effectiveness of our method extensively with four existing datasets, including two non-planar datasets, as well as our own dataset. We show that we outperform the state-of-the-art without the need of retraining for each dataset.READ FULL TEXT VIEW PDF
An activation function is a crucial component of a neural network that
Activation functions play a vital role in the training of Convolutional
Although the recent progress in the deep neural network has led to the
Rectifying the orientation of images represents a daily task for every
The function approximator that finds the function mapping the feature to...
We provide the first non-asymptotic analysis for finding stationary
[AI Robotics Summer School 2016] From local features to deep learning
Feature points are an essential and ubiquitous tool in computer vision, and extensive research has been conducted on both detectors [3, 6, 22, 25, 27, 32, 48] and descriptors [2, 6, 22, 25, 32, 38, 43, 47], including using statistical approaches [33, 46]
. However, the assignment of a canonical orientation, which is an important common step, has received almost no individual attention, probably since the dominant orientation ofSIFT  is considered to give good results.
region rotated back with our learned orientations. Estimation errors are denoted by a green arc.Bottom left: MVS results with SIFT orientations, and Bottom right: MVS results with our orientations. As shown, due to viewpoints changes on non-planar surfaces, SIFT orientations are not stable. On the contrary, orientations provided by our method are stable, which leads to better reconstructions. 46272 vertices were obtained using SIFT orientations, and 84087 vertices with our orientations. Edge Foci feature points  were used in conjunction with Daisy  descriptors for both methods. 111Figures are best viewed in color.
However, this is not necessarily true. In complex scenes, feature points lie on non-planar surfaces and their appearance can be drastically altered by viewpoint and illumination changes. This can easily produce errors in orientation estimates as shown in Fig. 1. In addition, rotation invariant descriptors [7, 14, 42] are not a definitive solution either as these descriptors discard rotation sensitive information which can be useful when ideal orientations are given. Thus, as we will show in our experiments, higher matching performances can be achieved with rotation sensitive descriptors and better orientation assignments.
In this paper, we show how to remedy this problem by training a regressor to estimate better orientations for matching, and to boost the performance of existing rotation sensitive descriptors. We train a Convolutional Neural Network to predict an orientation, given a patch around a feature point. To avoid the difficult task of finding the canonical orientation to learn, we treat the orientation to learn as an implicit variable, by training a Siamese network [9, 12] similar to descriptor learning methods [33, 46]. Also, to allow our method to work in conjunction with any existing rotation sensitive descriptors such as SIFT , SURF , and the learning-based VGG , we consider the descriptor component as a black box when learning.
To evaluate the performance of descriptors with orientations from the proposed method, we use datasets with both planar or far away objects [26, 40, 48] and 3D objects [1, 36]. In addition, we created our own dataset as well, to further enrich the dataset with complex camera movements, such as in-plane rotations and viewpoint changes. We demonstrate that the proposed method gives significant improvement over the state-of-the-art for all the datasets, without the need of re-training for each dataset.
In the remainder of this paper, we first discuss related work, introduce our learning framework, detail our method as well as the proposed activation function. We then present our experimental results demonstrating the effectiveness of our orientation assignment compared to the state-of-the-art. We also investigate the influence of the proposed activation and of the datasets, and we conclude with several application results.
As shown in the survey of , the importance of having a good orientation estimation has been overlooked, and thought to be a not very important step which either feature point detector or descriptor has to perform. The widely-used solution for assigning an orientation to a feature point is to use the dominant orientation of SIFT . However, as pointed out by , dominant orientation-based methods do not work well for arbitrary positions, although it has critical impact on the descriptor performances . Nevertheless, here we provide a brief review of existing methods related to orientation assignment and our method.
In SIFT , histograms of gradient orientations are used to determine the dominant orientation. It remains the most popular method and has also been extended to 3D . SURF  uses Haar-wavelet responses of sample points to extract the dominant orientation. MOPs  simply uses the gradient at the center of a patch after some smoothing for robustness to noise. ORB 
uses image moments to compute the center of mass as well as the main orientation.HIP  considers intensity differences over a circle centered around the feature point to estimate the orientation. Although this is rather fast, it is also very sensitive to noise.
In summary, despite the variation, the main idea of these methods remain the same: finding a reliable dominant orientation in their respective ways. Thus, when computation time constraints are not too drastic, using the SIFT orientation remains to be the first solution to try .
As existing orientation assignment methods are not always robust enough to guarantee good matching performances, interest has been drawn to descriptors which are inherently rotation invariant [14, 42]. MROGH  uses local intensity order pooling with rotation invariant gradients, and LIOP  constructs the descriptor in a similar way but with a different strategy for aggregating the gradient information. BRISK  and FREAK  claims rotation invariance as well, but they still depend on the orientation estimation which is included in the descriptor extraction process.
Besides descriptors that are rotation invariant by construction,  uses concentric rings for generating orientation histogram bins with spin images, and a specific distance function for rotation invariant matching. sGLOH  also proposes to use a rotation invariant distance function which computes distances for all possible rotation combinations and takes the minimum. The authors further extend their method by proposing a general method for histogram-based feature descriptors taking into account the main orientation of the scene .
Although these methods may be better than the original SIFT descriptor , SIFT descriptor combined with our learning-based orientation estimation outperforms them, as we will show in the experiments. This is probably due to the fact that rotation sensitive information is discarded when computing these descriptors. Furthermore,  is only applicable when the entire scene is the object of interest, and is impractical as the main orientation is obtained by computing all the possible matching pairs of features to keep the configuration with best matches.
Learning-based methods have been already used in the context of feature point matching, but only for problems other than orientation assignment of general feature points. For example,  learns to predict the pose of patches, but uses one regressor per patch, which is not a viable solution for general feature points. [33, 46] use Siamese networks—as we do—to directly compare image patches , or to learn to compute descriptors . VGG  as well as [13, 39] also learn descriptors, through convex optimization, boosting, and greedy optimization, respectively.
One caveat in these learning-based descriptors is that they still rely on the orientation estimation of local feature detectors which are traditionally handcrafted. Moreover, they typically use the Brown dataset  for learning, with patches extracted using ground truth orientations from Structure from Motion (SfM) techniques. This ground truth orientation assignment is not something one can expect to have in practical use, and may lead to performance degradation when tested on other data with inaccurate orientation assignments . These methods will also benefit from better orientation assignments on test time, as we will show in our experiments with VGG.
In this section we first introduce our learning strategy, then formalize it. We also describe our activation function based on GHH.
As illustrated in Fig. 2, orientation assignment plays a critical role in the descriptor matching performances. However, a major problem we face in our approach is that it is not clear which orientation should be learned. For example, one can try to learn to predict the dominant orientation of SIFT 
, or maybe the median of the dominant orientations for the same feature point extracted from multiple images. However, there is no guarantee that the orientations retrieved from such approach is the ideal canonical orientation we want to learn. Our early experiments, based on such heuristics to decide which orientation should be learned, remained unfruitful.
Since it is hard to define a canonical orientation to learn, we instead take into account that it is actually the descriptor distances of the feature points that are important, not the orientation values themselves. We formulate the problem by learning to assign orientations which minimize the descriptor distances of pairs of local features corresponding to the same physical point. In this way, we do not have to decide which orientations should be learned. We let the learning optimization find which orientations are both reliably predictable and improve the matching performance. We formalize this approach in the next subsection.
, but the loss function and its computation are different since we learn to estimate the orientation and not the descriptor itself. In fact, we treat the descriptor as a black box so that various rotation variant descriptors can be used. However, this is not necessarily a restriction, and can be easily adapted to include learning of the descriptors as well.
Our training data is made of pairs of image patches centered on feature points in two images but corresponding to the same physical 3D points. We minimize a loss function over the parameters of a CNN, with
where , the pairs are pairs of image patches from the training set, denotes the orientation computed for image patch using a CNN with parameters , and is the descriptor for patch and orientation . As discussed in the previous subsection, there is no target orientation in the loss function of Eq. (1): the predicted orientations will be optimized implicitly during training.
Learning angles requires a special care. Directly predicting an angle with a CNN did not work well in our early experiments, probably because the periodicity of in Eq. (1) generates many local minima. An alternative way would be to learn to provide histogram-like outputs, which then can be used with to give angular outputs, in a way reminiscent of SIFT. However, this approach also did not work well, as the estimated orientations have to be discretized and the network becomes too large when we want fine resolutions.
To alleviate the problem of periodicity, similarly to how manifolds are embed in [30, 31], we train a CNN to predict two values, which can be seen as a scaled cosine and sine, and compute an angle by taking:
where and are the two values returned by the CNN for patch , and is the four-quadrant inverse tangent function222We follow the standard implementation for the C language for this function..
This function is not defined at the origin, which turned out to be a problem only happening in rare occasions at the first iteration, after random initialization of the CNN parameters . To prevent this, we use the following approximation for its gradient:
where is a very small value.
The derivatives of the loss function for a given training pair
can be computed using the chain rule:
with and .
Jacobians and are straightforward to compute. is not as easy, since is the descriptor for patch after rotation by an amount given by . For example, in case of SIFT, the descriptor extraction process involves building histograms, which cannot be expressed as a differentiable function. Moreover, depending on the descriptor, pooling region for extracting the descriptor changes as a different orientation is provided.
We therefore use a numerical approximation of the gradients: when we form the training data we also compute the descriptors for many possible orientations, every 5 degrees in our current implementation. We can then efficiently compute the derivatives in by numerical differentiation. Note that in case of descriptors that can be expressed in analytic form, for example learning based descriptors , we can also easily compute , instead of using numerical approximations.
To implement the CNN
, we use three convolution layers with the ReLU activation function, each followed by a max-pooling layer, followed by two fully connected layers with GHH activation. We detail the GHH activation below. We also use dropout regularization for better generalization. Implementation details are provided in Section 4.1.
To achieve state-of-the-art results with CNNs, we propose to use a new activation function in our network layers that works better for our problem than standard ones. This activation function is a generalization of the popular ReLU, maxout , and the recent PReLU  activation functions based on Generalized Hinging Hyperplanes (GHH), which is a general form for continuous piece-wise linear functions . As GHH activation function is more general, it has less restrictions in shape, and allows for more flexibility in what a single layer can learn. This activation function plays one of the key roles in our method for obtaining good orientations, as we will show in Section 4.3.
Mathematically, for a given layer output before activation, we consider the following activation function:
and and are meta-parameters controlling the number of planar segments and thus the complexity of the function. When , Eq. (5) reduces to maxout activation , and when additionally with , the equation reduces to the ReLU activation function. Finally, when , , and , where is a scalar variable, Eq. (5) is equivalent to the PReLU activation function proposed in .
Therefore, instead of having to choose a non-linear activation function, we also learn it under the constraint that it is piece-wise linear.
In this section, we first introduce the datasets used for evaluation and the setup for training our regressor. We then demonstrate the effectiveness of our method by comparing the descriptor performances using the original and our learned orientations. We show that the best matching performance can be achieved with our learned orientations, outperforming state-of-the-art. We also demonstrate the performance gain obtained by using the GHH activation compared to other activation functions, and investigate influence of datasets on the descriptor performances. We finally show a Multi-View Stereo (MVS) application.333Datasets and source code available at http://cvlab.epfl.ch/.
Fig. 3 shows example images from the datasets we use for evaluation and training. Note that our collection of data is not only composed of planar objects but also of 3D objects with self occlusions. We also have various imaging changes including changes in the camera pose. We use the Oxford dataset  for training, and the Edge Foci (EF) dataset , the Webcam dataset , the Strecha dataset , the DTU dataset , and our own Viewpoints dataset for testing. Details on the datasets are as follows:
Oxford dataset : 8 sequences with 48 images in total. The dataset contains various imaging changes including viewpoint, rotation, blur, illumination, scale, JPEG compression changes. We thus use this dataset for training.
EF dataset : 5 sequences with 38 images in total. The dataset exhibits drastic lighting changes as well as daytime changes and viewpoint changes.
Webcam dataset : 6 sequences with 120 images in total. The dataset exhibits seasonal changes as well as daytime changes of scenes taken from far away.
Strecha dataset : composed of two sequences, fountain-P11 (11 images) and Herz-Jesu-P8 (8 images). The scene is non-planar and 3D. The dataset exhibits large viewpoint changes with self occlusions.
DTU dataset : 60 sequences with 600 images in total. This dataset also has multiple lighting settings for selected viewpoints, but we consider here only one lighting setting as we are mostly interested in the changes that occur on non-planar scenes undergoing camera movements. We also sample the viewpoints from the original dataset in regular intervals to make the dataset a manageable size.
Viewpoints dataset: 5 sequences with 30 images in total. We created our own dataset to further enrich the dataset. The dataset exhibits large viewpoint changes and in-plane rotations up to 45 degrees from the reference image, which is when commercial cameras compensate the image orientation as landscape or portrait.
We use a patch size of as input to the CNN. For the convolution layers, the first convolution layer uses a filter size of and output channels, the second convolution layer a filter size of and output channels, and the third convolution layer a filter size of and output channels. All max-pooling layers perform max pooling. The size of the output of the first fully connected layer is 100, with the second fully connected layer having two outputs with the mapping into orientations as described on Section 3.2.
For optimization, We use the ADAM  method with default parameters and exponentially decaying learning rate. We run epochs with batch size of . The learning rate decay is set to half the learning rate every ten epochs.
Our method is also computationally efficient as we are only estimating the orientations. On an Intel Xeon E5-2680 2.5GHz Processor, our current implementation in Python with Theano takes 0.47 milliseconds per feature point to compute orientations without any multi-threading. When used with SIFT descriptors, it overall takes 1.39 milliseconds per feature point to obtain the final descriptor. Note that the C++ implementation of the MROGH descriptor, which is the best performing rotation invariant descriptor in our experiments, takes 1.94 milliseconds per feature point.
To demonstrate the effectiveness of our method, we compare the descriptor matching performances with our orientation estimation against other state-of-the-art descriptors444Details on the implementations of these methods are provided as appendix in the supplementary material.. We use the standard precision-recall measure of  with nearest neighbor matching, and with a maximum of 1000 feature points per image. In case of the DTU and Strecha datasets, the scenes are non-planar and we rely on the 3D models and camera projection matrices to map a point from one viewpoint to another. Such mapping is used instead of the homography, followed by the overlap test in . Results are summarized with the mean Average Precision (mAP) as in , where mAP is effectively the Area Under Curve of the precision-recall graph.
We compare against both descriptors that require orientation estimations (ORB , BRISK , FREAK , SURF , SIFT , KAZE , BiCE , Daisy , and the learning-based VGG ), as well as rotation invariant descriptors (LIOP , MROGH , and sGLOH ). Note that descriptors are generally designed for a specific detector (for example they typically have different range of scale of operation) and for fair comparisons, we do not interchange the detector and descriptors. We use the feature point detectors presented when the descriptors were introduced. In case of the VGG descriptor, we use the descriptor pre-learned with the liberty dataset , as other sequences are partially included in our test set. We employ the Edge Foci (EF)  detector for VGG as it showed better performance than the Difference of Gaussians (DoG) detector, which was used in the original work of VGG. We will denote this method as EF-VGG.
We also use EF detector with the Daisy descriptor and the SIFT descriptor, as this particular detector was designed with these two descriptors in mind. We will refer them as EF-Daisy and EF-SIFT, respectively.
To demonstrate the effectiveness of our method, we evaluate the descriptor matching performances with and without our orientation assignment. We first show the performance gain we obtain for SIFT, SURF, Daisy, and VGG descriptors and then compare our performance against other state-of-the-art methods. Note that for each descriptor, we only train our method once using the Oxford dataset and test on all the other datasets.
To demonstrate the performance gain we obtain by using our orientation assignments, we learned orientations for SIFT, SURF, Daisy, and VGG descriptors. We denote descriptors computed using our learned orientation assignments with a + and a at the end; we use + when orientations are learned with respective descriptors, and when learned with SIFT descriptors. We also compare against using multiple dominant orientations. Note that using multiple orientations effectively amounts to creating duplicate feature points, which resulted in 34% increase in descriptor extraction time and 79% increase in matching time within our evaluation framework.
As shown in Fig. 4, we gain a consistent boost in descriptor matching performance with our orientation estimation. This includes the learning-based VGG descriptor, showing that learning-based methods also can benefit from a better orientation assignment. We also obtain a larger gain on average compared to using multiple orientations. The best performance is achieved with EF-VGG.
Interestingly, for EF-Daisy and EF-VGG, learning with the SIFT descriptor gave larger boost in performances than learning with the respective descriptors. We suspect that this is due to the characteristics of the two descriptors being less sensitive to orientations than SIFT descriptors, resulting in the Jacobians with respect to orientations to vanish.
Based on the comparison results in Fig. 4, in the remainder of the results section we will report the performance of the best performing handcrafted descriptor with our orientations, EF-Daisy, and the best performing learning-based descriptor with our orientations, EF-VGG. In Section 4.3, as the best performance was achieved by learning with the SIFT descriptor (EF-VGG), we will use EF-SIFT+ to evaluate the influence of different activation functions.
As shown in Fig. 5, both EF-Daisy and EF-VGG outperform all compared methods, with EF-VGG outperforming all others by a large margin. Specifically, EF-VGG performs 27.4% better in terms of mAP compared to EF-VGG, which is the best performing competitor. Note that without our orientation estimation, although the best among the competitors, the gap is small. Also, as pointed out in descriptor performance surveys [1, 19, 28, 29], SIFT or EF-SIFT generally give comparable results to the state-of-the-art.
As average results can be influenced by certain sequences being too easy or hard, we also investigate the average rank of each method on the entire dataset similarly to . In Table 1, we show the rank of each method on the datasets according to the average ranks of their mAP on each sequence. We also show the average mAP with all datasets for each method. Again, best results are obtained with our methods, EF-Daisy and EF-VGG.
To evaluate the influence of the proposed GHH activation function, we compared the matching performance of EF-SIFT+ with different activation functions. All parameters were set to be identical except for the activation type and the number of outputs in the fully connected layers. Specifically, we used 1600 hidden nodes for ReLU, Tanh, and PReLU , 400 for maxout  with four outputs inside the max. Note that PReLU has slightly more parameters than other activations, as an additional parameter is introduced for each output of the layer.
As shown in Fig. 6, we have a consistent gain in performance when using the proposed GHH activation function instead of ReLU, Tanh, maxout, and PReLU. This shows that indeed using the GHH activation, which is a generalization of several common activation functions, is suitable for learning orientations.
We observed in the existing datasets a general tendency for the images to be carefully taken with an upright posture. As a result, it is possible to assign a ground truth orientation to them by using a constant orientation. We will denote this upright assignment of orientations with the suffix “Up”, and compare their assignments with our orientation assignments. We group the datasets into “upright” and “non-upright” ones, depending on whether the systematic assignment to an upright orientation performs better than using the original orientation assignments, and compare the performance of EF-Daisy, EF-VGG, EF-Daisy, EF-VGG, EF-Daisy-Up, and EF-VGG-Up.
Fig. 7 shows the results of these experiments. As expected, in case of upright datasets, using a systematic upright orientation performs the best, which can be seen as a upper bound for the descriptor performances. The performances however degrade when tested on non-upright datasets. However, our methods EF-Daisy and EF-VGG perform comparably to the upper bounds for upright datasets and are significantly better on non-upright datasets, achieving state-of-the-art. Note that EF-VGG performs similar to EF-VGG-Up, showing that inaccurate orientation assignments are not helpful.
We also apply our orientation estimations for a MVS application [44, 45]. Fig. 8 shows MVS results using EF-Daisy, EF-VGG and EF-VGG. Due to better matching performances, our method EF-VGG gives best MVS results, followed by EF-VGG. Specifically, for the fountain sequence of the Strecha dataset, we obtain vertices with EF-Daisy, with EF-VGG, and with EF-VGG. For Scene 55 of the DTU dataset, we obtain , , and vertices for EF-Daisy, EF-VGG, and EF-VGG, respectively.
We have introduced a learning scheme using a Convolutional Neural Network for the estimation of a canonical orientation for feature points, which improves the performance of existing descriptors. We proposed to train Siamese network to predict an orientation, which avoided the need of explicitly defining a “good” orientation to learn. We also proposed a new GHH activation function, which generalizes existing piece-wise linear activation functions and performs better for our task. We evaluated the effectiveness of our learned orientations by comparing the descriptor performances with and without our orientation assignment. Descriptors using our orientations gained consistent performance increase and outperformed state-of-the-art descriptors on all datasets. We finally investigated the influence of the GHH activation function showing its effectiveness.
Although we were able to enhance the performance of the learning-based VGG descriptor as well, an interesting future research direction is to fully integrate our method with learning-based descriptors, such as the recent descriptor presented in . In which case, we can have a fully differentiable Siamese network which learns both the orientation assignment and the descriptor at the same time.
This work was supported in part by the EU FP7 project MAGELLAN under the grant number ICT-FP7-611526 and in part by the EU project EDUSAFE.
Full Orientation Invariance and Improved Feature Selectivity of 3D SIFT with Application to Medical Image Analysis.In CVPR, 2008.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification.In ICCV, 2015.
In this appendix, we provide details on the implementations used in the experiments.
To keep the maximum number of features points to 1000, we sort the detected feature points according to their respective response scores and keep the best 1000. Details for the implementations of the compared methods are as follows:
BRISK : Provided by the authors – http://www.asl.ethz.ch/people/lestefan/personal/BRISK
We used threshold of 20, with default values for other parameters.
sGLOH : Provided by the authors – http://www.math.unipa.it/fbellavia/htm/research.html
Default parameters were used.
EF  and BiCE : Provided by the authors – http://research.microsoft.com/en-us/um/people/larryz/edgefoci/edge_foci.htm
Default parameters were used.
VGG : Provided by the authors –
Patches were extracted with the VLFeat library, with a relativeExtent of , which is the same as what SIFT uses. We use the pre-learned model learned with the liberty dataset from , as the other two datasets are partially included in our test set.