Learning Local Feature Descriptor with Motion Attribute for Vision-based Localization

08/03/2019 ∙ by Yafei Song, et al. ∙ Beihang University Peking University 1

In recent years, camera-based localization has been widely used for robotic applications, and most proposed algorithms rely on local features extracted from recorded images. For better performance, the features used for open-loop localization are required to be short-term globally static, and the ones used for re-localization or loop closure detection need to be long-term static. Therefore, the motion attribute of a local feature point could be exploited to improve localization performance, e.g., the feature points extracted from moving persons or vehicles can be excluded from these systems due to their unsteadiness. In this paper, we design a fully convolutional network (FCN), named MD-Net, to perform motion attribute estimation and feature description simultaneously. MD-Net has a shared backbone network to extract features from the input image and two network branches to complete each sub-task. With MD-Net, we can obtain the motion attribute while avoiding increasing much more computation. Experimental results demonstrate that the proposed method can learn distinct local feature descriptor along with motion attribute only using an FCN, by outperforming competing methods by a wide margin. We also show that the proposed algorithm can be integrated into a vision-based localization algorithm to improve estimation accuracy significantly.



There are no comments yet.


page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image local features are important for many tasks and applications, e.g., image-based localization [Song2016TMM, li2014high, li2013high, concha2015dpptam], structure from motion [Huang_2018_ICRA, Zhu_2018_CVPR] and simultaneous localization and mapping [Mur_2017_orb_slam2, schneider2018maplab, li2013optimization]. Previous researchers have designed various and powerful local feature detectors and descriptors, e.g., SIFT [SIFT_2004_ijcv], ORB [ORB_2011_iccv] and FREAK [FREAK_2012_cvpr]

. Due to the recent success of deep neural networks on various tasks

[Ren_2017_faster_rcnn, Shelhamer_2017_FCN], several previous papers have attempted to learn robust image local features automatically [TILDE_2015_cvpr, LIFT_2016_eccv, Tian_2017_cvpr, Mishchuk_2017_nips, superpoint_2018_cvpr].

Fig. 1: The moving points are ubiquitous in the images, including short term moving points such as in (a) and long term moving points such as in (b). All these points will decrease the performance of a localization system.

For a vision-based localization system, the motion attribute of each feature point could be exploited to improve the performance, e.g., as demonstrated in Fig. 1(a), the points on moving people should be excluded. Existing systems usually perform this step by eliminating the point matches via epipolar geometry constraints [MVG_2004_book]. These methods can successfully remove points corresponding to fast-moving objects but have difficulties to block points corresponding to slow-moving objects. In addition, this strategy will fail on long-term moving points, e.g., the points on the parked vehicles, as shown in Fig. 1(b). Such points will be static in a short time but will move to another place in a few minutes or even days. This will inevitably lead to reduced performance for a couple of localization operations (e.g., loop closure detection or re-localization[schneider2018maplab, lynen2015get, zhang2019large]). Beyond that, there are also some points in unstable areas, such as the sky. In this paper, we focus on selecting the long term static points for localization systems by estimating the motion attribute of each point.

Some previous methods have also exploited the motion attribute to perform localization or mapping. Kaneko et. al. [mask_slam_2018_cvpr] first segmented the input image and then only used the points with long term static motion attribute derived from the semantic label. However, semantic segmentation is a difficult task, which will dramatically increase the computational cost. In this paper, we transform the task to estimate the motion attribute for each feature point, which is easier than semantic segmentation but is sufficient for vision based localization. Naseer et. al. [Naseer_2017_icra] exploited geometrically robust regions to determine the position of the input image, which is a different task from localization.

Motion attribute estimation is generally considered as a highly computational costly task for a robotic platform, and thus efficiency is of significant importance for algorithm design. Inspired by the recent progress on local feature learning [LIFT_2016_eccv, Tian_2017_cvpr, Mishchuk_2017_nips, superpoint_2018_cvpr], we try to use a single network to perform the descriptor calculation and the motion attribute estimation tasks simultaneously, since these tasks can be formulated into similar computation processes. A lot of existing methods, e.g., [LIFT_2016_eccv, Tian_2017_cvpr, Mishchuk_2017_nips], focused on extracting distinct features from an image patch. However, it is difficult to estimate the motion attribute only from a local patch since the context is essential. Moreover, patch-based algorithms usually have poorer computational efficiency than whole image-based algorithms due to the repetitive computation between different patches, e.g., [rcnn_2014_cvpr] and [Ren_2017_faster_rcnn].

With the above analysis in mind, we design a fully convolutional network, named MD-Net, which takes the whole image as input and performs motion attribute estimation and feature description simultaneously. MD-Net has a shared backbone network, named , and two network branches, named and . The heavy backbone network takes the whole image as input and outputs shared features to each following network branch. The motion attribute estimation branch then assigns each point as unstable, moving or static. Finally, the feature description branch extracts distinctive descriptor for each point.

The most similar methods with this paper are from DeTone et. al., i.e., [superpoint_2018_cvpr] and [superpoint_v2_2018_arxiv]. In [superpoint_2018_cvpr], DeTone et. al. have designed an FCN [Shelhamer_2017_FCN]

based single network to perform local feature localization and description simultaneously. However, the localized points are not robust enough due to the poor localization ability of the standard convolutional neural network (CNNs)

[Szegedy_2013_nips]. Therefore, we resort to hand-crafted detectors in this paper. Moreover, the strategies to learn descriptor are also different. In [superpoint_v2_2018_arxiv], the model also estimates whether a point is steady. However, this work can only handle short term moving points as shown in Fig. 1(a), but not long term moving points as shown in Fig. 1(b). We also experimentally show that, when integrated into vision-based localization, the proposed algorithm is able to achieve significantly better accuracy, compared to [superpoint_v2_2018_arxiv].

The contributions of this paper mainly lie in four aspects:

  1. We design an FCN to estimate the motion attribute of each local feature point to distinguish static points from moving or unstable points.

  2. We further enhance the FCN to calculate the descriptor of each point simultaneously, which is more efficient than patch-based methods.

  3. We integrate the proposed local feature processing pipeline into a vision-based localization method to seek accuracy gain.

  4. Experimental results demonstrate that both the proposed local feature method and integrated localization algorithm outperform competing state-of-the-art algorithms by a wide margin.

Ii Related Work

A local feature algorithm usually can be divided into two steps: feature localization and feature descriptor calculation. In this section, we briefly introduce some well-known local features, including hand-crafted and learning-based, from these two aspects.

Hand-crafted local features. Over the last few decades, researchers have designed various algorithms to localize each robust point as well as calculate its distinct descriptor. At the very beginning, many methods devoted to detecting and localizing corners in the image, e.g., Harris corner [Harris_1988_AVC], FAST [FAST_2006_eccv]. Besides corner detection methods, researchers also introduced the scale-space theory and detected the extrema as feature points, e.g., Laplacian of Gaussian (LoG) [Lindeberg_1998_ijcv], difference of Gaussians (DoG) [SIFT_2004_ijcv] and MSER [MSER_2002_bmvc]. Researchers have also designed various local feature descriptors, e.g., SIFT [SIFT_2004_ijcv], SURF [SURF_2006_eccv], and HOG [Zhu_2006_cvpr]. As robotics tasks usually need real-time algorithms, some fast binary descriptors also have been proposed, e.g., BRIEF [brief_2010_eccv], FREAK [FREAK_2012_cvpr], ORB [ORB_2011_iccv]

. These local feature detectors and descriptors have been successfully applied in many tasks. However, with the successful applications of deep learning based methods on various tasks, researchers have attempted to learn more robust local features from the data automatically.

Learning-based local features. Learning based methods can be divided into three classes. The first class methods only learn robust detectors [TILDE_2015_cvpr, Savinov2017, Zhang2017]. One challenge is how to generate ground truth. To this end, Verdie et. al. [TILDE_2015_cvpr] applied DoG algorithm on all images of a unique scene and selected the points which can be detected in most cases. This strategy can enable the learned detector to outperform the baseline detector. Savinov et. al. [Savinov2017] formulated and solved the problem in an unsupervised manner. Zhang et. al.[Zhang2017] focused on learning a discriminative and transformation covariant detector. These detectors cannot remove the points on moving objects.

The second class methods aim at learning robust descriptors. To this end, several large-scale datasets [Brown_2007_ijcv, HPatches_2017_cvpr, PS_Dataset_2018_arxiv] are collected for training and evaluation. Tian et. al. [Tian_2017_cvpr] designed a convolutional neural network named L2-Net to extract descriptors from a local patch. Mishchuk et. al. [Mishchuk_2017_nips] further improved the performance with a large margin via minimizing the hard negative samples. Luo et. al. [Luo2018] exploited geometric constraints to learn more robust descriptors. However, these methods calculate descriptor for each image patch, which may be not efficient than FCN based algorithms.

Yi et. al. [LIFT_2016_eccv] and Ono et. al. [Ono2018] divided the whole process into several steps as previous algorithms and performed each step using a neural network. These methods can take full advantage of previous knowledge on the problem but not efficient than FCN based methods. DeTone et. al. [superpoint_2018_cvpr] designed an FCN based single network to perform local feature localization and description simultaneously. However, these methods are not designed for localization and cannot remove moving points. To this end, DeTone et. al. [superpoint_v2_2018_arxiv] further estimated whether a point is steady and only selected the static points. However, this work can only handle short term moving points but not long term moving points.

Fig. 2: The training process for our MD-Net. We feed a mini-batch of training samples into the backbone network. Subsequently, the extracted features are feed into two branches simultaneously. For motion attribute, we use cross-entropy loss (3), and for descriptor, we use mean square error loss (4

). Note that, Conv denotes a convolutional layer, ConvB denotes that it is followed by a batch normalization layer and a ReLU layer, K denotes kernel size, P denotes padding size, S denotes stride size, and F denotes the number of output feature maps.

Iii Learning Descriptor with Motion Attribute

In this section, we first briefly overview the whole method and detailedly introduce each module subsequently. As demonstrated in Fig. 2, we design an FCN named MD-Net to estimate motion attribute and calculate descriptor for each point in the input grayscale image. The MD-Net consists of a heavy backbone network named to extract shared features from the input image, a light branch to estimate motion attribute, and another light branch to calculate descriptors. For a fair comparison, the backbone network has the same structure with the backbone network in [superpoint_2018_cvpr], which consists of convolutional layers and max-pooling layers. Each convolutional layer is followed up with a batch normalization layer [Ioffe_Szegedy_2015_batchnorm] and a ReLU activation layer [Nair_Hinton_2010_relu]. The hyper-parameters of each layer also can be found in Fig. 2(b). The backbone network outputs a group of feature maps to the following branches. This structure has high computational efficiency benefiting from the fully convolutional structure and the down-sampling pooling layer.

Iii-a Motion Attribute Estimation

A local feature detector usually localizes a large number of points in an image. However, not all of these points are suitable for localization or mapping. In this paper, we divide the motion attribute into three classes, i.e., , where denotes unstable, denotes moving, denotes static. The unstable points are on the image regions, which will vary fast, such as the cloud in the sky. The moving points are on the moving objects, including short term and long term. The static points are steady for a long time. A localization system usually can automatically discard short-term fast-moving points as these points could not meet the epipolar constraint. However, when it comes to very-slowly-moving points, it becomes more challenging. Moreover, the long-term moving points can not be removed, which will result in decreased performance, even with a self-improving method [superpoint_v2_2018_arxiv]. When localizing an input image, all of the unstable and moving points will decrease the performance. To alleviate this phenomenon, we propose to assign a point’s motion attribute according to its semantic label. As previous works have constructed many semantic segmentation datasets, e.g., Cityscapes [Cordts2016Cityscapes], we only need to perform a label transforming process.

To estimate the motion attribute, we add a network branch following the backbone network, as shown in Fig. 2

, which consists of two convolutional layers, one batch normalization layer, one ReLU layer and one softmax layer. This branch can transform the input features of each point to the probability of each motion attribute. The hyperparameters of this branch can be found in Fig.

2(c). To train this branch, we adopt the cross-entropy loss


where only if the motion attribute of point is , otherwise , is the predicted probability.

The cross-entropy loss can effectively supervise a network when the distribution of sample sizes of all classes is uniform. Otherwise, the model will incline to the class with large size [Li_2018_arxiv]. This unbalance bias is ubiquitous in the training data. To this end, we re-weight the loss of each class according to its size as


where is the sample size of class

. Then the loss function (

1) can be transformed to its re-weighted version as


With the re-weighting strategy, each class would contribute equally to the loss function to avoid the influence of unbalance classes distribution.

Iii-B Descriptor Calculation

To learn distinct descriptors from the data automatically, previous methods have constructed several elaborate datasets, e.g., [Brown_2007_ijcv, HPatches_2017_cvpr, PS_Dataset_2018_arxiv]. As these datasets only have image patches, most methods also based on patches [Tian_2017_cvpr, Mishchuk_2017_nips, Luo2018]. For deep learning algorithms, however, image-based models usually are more efficient than patch-based models. If the network structure is unique, the computational cost has a linear relationship with the size of the input image. Previous methods typically take a patch as input. Its computation cost will be equal to a whole image-based model if the input image is and the detector only detects points. In practical, there are often more than points in an image. In other words, a patch-based model usually need several times of computational cost compared with the corresponding whole image-based model.

In this paper, we design an FCN network to calculate feature descriptor from the whole input image as in Fig. 2(a)(b)(d). However, it is a challenge to design an effective training strategy. DeTone et. al. [superpoint_2018_cvpr] directly used all points in the image. However, this strategy will introduce plenty of noisy samples. To take full advantage of existing datasets, we resort to a teacher-student framework, which is usually exploited for feature embedding [Song_2017_iccv]. Without loss of generality, we transform the HardNet model [Mishchuk_2017_nips] to an image-based model and take it as the teacher model, which can be used to supervise the descriptor learning. As shown in Fig. 2(e), we can see the detailed structure of the modified HardNet. Compared with the initial version, we only remove the last reshape layer and add padding on the last convolutional layer. This network surgery has little influence on the effectiveness of the model.

To train the descriptor branch , which is the student model, we minimize the mean square error between the outputs of and HardNet. The loss function can be defined as


where is the dimension of one output descriptor, is the -th dimension feature at point outputted by , is the counterpart outputted by HardNet.

Iii-C Multi-task Learning

As our model complete two tasks simultaneously, it is a typical multi-task learning problem to train the model. Under deep learning framework, it is easy to perform this process. We can simply combine the two loss functions as


where and are the parameters to adjust the weight of each loss. In our experiment, we set empirically. With the multi-task learning strategy, we can simultaneously optimize the backbone network and two task-specific branches. One problem for multi-task learning is that the tasks may conflict with each other, which lead to an unsatisfied performance on some tasks. In this paper, this phenomenon has not appeared, which indicates that it is reasonable to integrate these two tasks into one model to save computational cost.

The complete training process can be summarized as follows. As demonstrated in Fig. 2, we first generate a mini-batch training sample from the training set and feed them into our model. Then our model outputs the motion attribute probabilities and descriptors. For the motion attribute, the ground truth is generated from semantic segmentation annotations. For descriptor, we take the output of HardNet as the supervision signal. The model can be optimized via minimizing the loss function (5) using Adam optimization algorithm. We set the batch size as , set the initial learning rate as and gradually reduce it after the

-th epoch as


where is the total number of epochs, is the factor to control the rate of decay. To avoid over-fitting, we also set the weight decay as in all experiments.

Fig. 3: The process to integrate our model with a local feature detector.

Iv Association with localization System

We integrate our model with a localization system to verify its effectiveness. As illustrated in Fig. 3, we combine our model with a local feature detector. The detected feature points are filtered according to its motion attribute. Only static points are reserved for the following steps and the other points are discarded. Subsequently, the descriptor of each reserved point can be obtained from the outputs of the branch

. Note that, as our model down-samples the resolution, we up-sample the results using bilinear interpolation.

The localization algorithm used to test the proposed local feature pipeline is a sliding-window based visual-inertial SLAM method. Visual-inertial SLAM is widely used recently by utilizing the complementary properties of both cameras and inertial measurement units (IMU) to greatly enhance the localization performance [li2013high, li2013optimization, qin2017vins, zhang2019large, leutenegger2015keyframe]. The exact implementation follows that of [zhang2019large], with the following changes. Firstly and most importantly, the proposed local feature pipeline is used instead of FREAK in [zhang2019large]. Secondly, we do not use odometry measurements during our tests. As a result, the cost function does not contain the odometry terms. Due to the same reason, the poses are expressed with respect to IMU frame, and IMU to odometry extrinsic parameters are ignored. Lastly, the pose integration and keyframe selection policy are purely based on IMU integration.

Fig. 4: Motion attribute estimation results. Top two rows are results on Cityscapes dataset. Bottom two rows are results on the data collected in Alibaba campus, which are not in the training set.

V Experiments and Comparisons

In this section, we first introduce experimental details to train our MD-Net and evaluate our motion attribute estimation results, our descriptors, and localization with our model in turn.

 Motion attribute Semantic class
 unstable sky, vegetation, terrain
 moving human, vehicle, static, dynamic, traffic light
 static ground, flat, building, wall, fence, guard rail, bridge,
tunnel, pole, pole group, traffic sign
TABLE I: The correspondence between motion attribute and semantic class in Cityscapes dataset.

V-a Motion Attribute Estimation Results

To train our MD-Net, we use the Cityscapes dataset [Cordts2016Cityscapes], which is with finely annotated semantic segmentation ground truth. This dataset is collected from various urban street scenes and consists of a training set with images and a validation set with images. Before using the annotations, we transform the initial semantic classes into motion attribute as in Tab. I

. The final model is obtained after trained on the training set. We implement our training process using PyTorch

[PyTorchNIPS2017] on a PC with an NVIDIA 1080 Ti GPU.

As shown in the third row of Tab. II, we can see that our model can accurately predict the motion attribute with a mean IoU as . To verify the effectiveness of the class-reweighting strategy, we perform an ablation experiment by removing this strategy whose results are shown in the second row. The proportion of each class is also presented in the fourth row. We can see that this strategy can markedly improve the performance of the class with a small size, i.e.  on unstable class, and increases the mean IoU about . As shown in the top two rows of Fig. 4, our model also achieves good visual results. Moreover, to verify the generalization of our model, we also applied our model on some data from our work park. The results are shown in the bottom 2 rows of Fig. 4, which show that our model can well generalize to other data.

unstable moving static mean
  Our w/o re-weighting 62.9 74.6 85.3 74.3
  Our 68.3 74.5 85.7 76.2
  Proportion 2.7% 35.8% 61.5% -
TABLE II: IoU of Motion attribute on the Cityscapes validation dataset.
Fig. 5: The performance of 8 different descriptors on HPatches.

V-B Performance of the Descriptor

To evaluate our descriptors, we use the HPatches to perform patch verification, matching and retrieval tasks on the FULL split. The details about the tasks and evaluation protocols can be found in [HPatches_2017_cvpr]. The quantitative results are demonstrated in Fig. 5. For a fair comparison, we direct use the pre-computed results of SIFT [SIFT_2004_ijcv], RootSIFT, BRIEF [brief_2010_eccv], ORB [ORB_2011_iccv], and DDESC [ddesc_2015_iccv] released by HPatches. The results of SuperPoint [superpoint_2018_cvpr] and HardNet [Mishchuk_2017_nips] are calculated using the released models by the authors. We can see that our model can successfully mimic HardNet and achieve comparable results on verification and matching tasks compared with previous FCN based method SuperPoint, and outperforms it on the retrieval task with a large margin. The results also indicate that HardNet, a patch-based method, can obtain better performance than FCN based methods, which would be a future work of FCN based methods.

V-C Performance in Vision based Localization

In this section, we present the results of vision-based localization using the proposed features from MD-Net. In our experiments, the sensor suite consists of a MYNT camera and a BOSCH BMI160 IMU. The camera captured images at 10Hz, and the IMU measurements were provided at 200Hz. The temporal, intrinsic, and extrinsic parameters of those sensors were calibrated offline using the method in [li2014high]. During the data collection process, the sensor suite was mounted on ground vehicles (robots or cars).

The first experiment is to evaluate the inlier feature ratio via two-view based RANSAC algorithm at three different testing environments, i.e., normal scene, with pedestrians, and with a couple of slowly moving vehicles. Tab. III shows the RANSAC inlier ratio for different local feature algorithms, which is computed by averaging all image pairs used in localization. For the compared local feature algorithms, the same detector was used for focusing on the descriptor comparison. Results in Tab. III demonstrate that the proposed algorithm is able to reach the best RANSAC inlier ratio, meaning that features corresponding to moving and unstable objects can be pre-removed. The proposed algorithm also outperforms SuperPoint in all cases.

  Detector Descriptor Scene1 Scene2 Scene3 Mean
  FAST FREAK 87.6 87.1 84.1 86.2
  FAST SuperPoint 88.8 88.4 90.2 89.1
  FAST Our 91.2 93.3 92.2 92.2
TABLE III: Inlier ratio on the data from 3 different scenes.
Fig. 6: Localization trajectory and drift.
  Detector Descriptor Drift (x, y) [m] Drift [m]
  FAST FREAK (-1.9, -0.4) 1.94
  FAST SuperPoint (-5.4, -1.8) 5.69
  SuperPoint SuperPoint (4.4, 1.8) 4.75
  FAST Our w/o mot. att. (0.8, 1.1) 1.36
  FAST Our (-1.2, 0.3) 1.23
TABLE IV: Localization drift.

The second and the third experiments are to show the performance of the proposed visual localization algorithm using our local feature method, in two different environments. In the second test, we collected a dataset around a local restaurant by a ground robot, whose trajectory started and stopped at the exactly the same location. This allows us to compute the ‘final drift’ as an error metric. Tab. IV and Fig. 6 show the final drifts by using different feature detectors and descriptors but under the same localization method. A couple of observations can be made from the result. Firstly, by explicitly modeling motion attribute, the localization error is reduced compared to the alternative method without the attribute. This is consistent with our claim that motion attribute is able to filter out more ‘bad’ features. Additionally, we note that our proposed method achieves the best precision, compared to all other methods, including SuperPoint families and classic FAST/FREAK combination. These results show that the proposed method is the best one for performing vision-based localization, at least among the experiments we conducted.

  Detector Descriptor On series1 On series2
  FAST FREAK 2.87 27.06
  FAST SuperPoint 10.02 Failed
  SuperPoint SuperPoint 7.96 Failed
  FAST Our w/o mot. att. 7.82 20.86
  FAST Our 1.10 7.25
Series length 279.1 880.8
TABLE V: The root-mean-squared-error[m] between the result and the ground-truth trajectories.

In the last experiment, we collected two series of urban street view datasets by mounting our sensors along with RTK-GPS on top of a car. The representative images captured during the data collection process are shown in Fig. 7. The readings from RTK-GPS are taken as the ground-truth pose. For quantitative evaluation, the computed trajectory is aligned to the ground-truth using Umeyama’s method [umeyama1991least]. The root-mean-square errors of each method can be found in Tab. V. We can see that our method obtains the lowest error among several comparison methods. We also note that the dataset is extremely challenging, where some images are full of or dominated by moving vehicles (see Fig. 7). However, by filtering out improper features by motion attribute, the proposed method can still achieve high-precision localization.

Fig. 7: Representative images captured during the data collection of the urban street view datasets.

Vi Discussion and Conclusion

In this paper, we design a fully convolutional network MD-Net to perform motion attribute estimation and feature description simultaneously. MD-Net has three modules: the backbone network to extract shared features from the input image, the motion attribute branch to estimate motion attribute, the description branch to extract distinctive descriptor. We further integrate the proposed method into a visual-inertial localization system to perform high-precision pose estimation. Experimental results demonstrate that the proposed method can improve the performance and outperforms previous similar algorithms, especially in complicated dynamic environments when multiple moving objects exist. The limitation of this paper is that the proposed method still relies on a third-party feature detector, which will be integrated into our FCN model in the future to enhance the performance further.


We would like to thank Dongsheng Hong for his invaluable help. This work was supported by grants from the National Key R&D Program of China (2017YFB1002400), the National Natural Science Foundation of China (61672072, U1611461 and 61825101), Beijing Nova Program (Z181100006218063), and China Postdoctoral Science Foundation (2018M641110).