Method Towards CVPR 2021 Image Matching Challenge

08/10/2021 ∙ by Xiaopeng Bi, et al. ∙ Megvii Technology Limited 0

This report describes Megvii-3D team's approach towards CVPR 2021 Image Matching Workshop.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Pipeline

Figure 1: Method Pipeline

2 Feature

2.1 Detector & Descriptor

In our method, we implemented three kinds of features, pure DISK feature, pure SuperPoint[2] feature, and a mix of SuperPoint and DISK[6] feature.

For SuperPoint, we additionally applied test-time homographic adaptation with 100 iterations to detect keypoints, and in some cases, we added a trained convolutional autoencoder to halve the feature dimension.

The Non-Maximum Suppression (NMS) window size and keypoints threshold was estimated theoretically based on the size of the image, to achieve a rather equally keypoints distribution across the image. Those values were then fine tuned around their initial estimates, to obtain the best performance on the validation set for each feature for each dataset.

2.2 Pyramid Extraction

We observed large-scale variation between images in some sets, such as the lizard in PragueParks. It troubled our matcher since the same feature under different scales may look very different. To solve this, we extracted the features in three scales and then concatenate those associated with the same keypoint.

Other than scale, sometimes orientation is also a problem. Similar to the approach taken for the scale problem, we extracted the feature in seven orientations by affine transformations (i.e. rotate 90 degrees left and right, respectively) and perspective transformations (i.e. horizontally left/right or vertically up/down by 45 degrees, respectively), and then concatenate them.

In order to save some computational resources and make the task more efficient, we only apply scale and/or orientation in some conditions, which will be elaborated in the Appendix.

2.3 Pre-Process & Post-Process

Masking For the Phototoursim and GoogleUrban datasets, some dynamic objects occur frequently, which introduce unreliable and unrepeatable keypoints to our solution. To overcome this, we segmented the scene and masked out some classes, including Person, Sky, Car, Bus, and Bicycle. Except for the class Person, which was segmented by Mask-RCNN[3]

trained on COCO, all other classes were removed by the pspnet

[7]

trained on ade20k.

While those object segmentation nets worked well under most of the circumstances, they performed poorly when distinguishing the sculpture from humans; thus we additionally trained a binary classifier to take pedestrians apart from sculptures.

Moreover, in order to preserve the details of buildings, when the masking was enabled, we eroded the area masked by a 5x5 kernel for sky and 3x3 for the other classes.

Refinement We applied an argsoftmax function for keypoints refinement with a radius of 2.

3 Matching

We trained the SuperGlue[5] together with its official feature extractor SuperPoint[2] in an end-to-end manner on the MegaDepth dataset with IMW2021 competition testset removed. More specifically, we splited the original SuperPoint[2] into two networks, the first one was fixed with the official weights to extract the keypoints from the image, while the other one was fine-tuned to provide descriptions. However, we found this adjustment advanced the model performance slightly, since the SuperGlue[5] can already match the given points well.

Furthermore, we take the advantage of the latest feature DISK[6] instead of SuperPoint[2], which built upon a simple CNN backbone. For the SuperGlue[5] matcher compatible with DISK[6], we only trained the matcher part, while directly used the official weights of DISK[6] and did not conduct any fine-tuning on it. We leveraged DISK[6] to extract points for each image in the training phase, and points in the testing phase. In this way, as shown in Table.1, the AUC could be improved significantly. The evaluation was conducted following the methodology in the SuperGlue[5] paper.

We have trained an indoor and an outdoor version for the SuperGlue matcher compatible with either SuperPoint or DISK, but we did not find the indoor one help with performance advancement; thus we stuck with the outdoor weights for this competition.

Exact AUC
10° 20°
SuperPoint + SuperGlue (Official) 38.72 59.13 75.81
SuperPoint + SuperGlue (Our trained, outdoor) 38.88 59.27 75.71
DISK + SuperGlue (Outdoor) 42.72 62.54 97.68
Table 1: SuperGlue Weights Evaluation

Guided Pyramid Matching Only when the number of matches found by SuperGlue was less than 100, SuperGlue matching would be applied on the pyramid extraction results, i.e. multiple scales and/or multiple orientations, we might combine the matches in different scales (ALL) or trust the one with the most number of matches (MAX).

4 Outlier Rejection & Adaptive F/H

Based on our experiments, DegenSAC[4] outperformed other outlier rejection methods in most cases. So we applied DegenSAC to find the fundamental matrix and under certain circumstances, the homography simultaneously.

Inspired by ORB-SLAM[1], different transformation matrix should be selected for different scenes and then be decomposed to get pose. Although the process is fixed in the offical bankend, considering different transformation matrix still helps us to filter matching outliers. In the GoogleUrban dataset, for example, many correct matches are on the flat buildings along the street, but many wrong matches are on the ground or on the isolation belt of the road. Suffering from the weak constraint of the F matrix, some wrong matches also could pass the filter (the distance to the epipolar line is less than the threshold). While if the H matrix is selected, only the correct matches on the street-side flat building is retained. Even if the pose is decomposed by calculating the F matrix later in the offical process, the accuracy can still be improved since some wrong matches could be removed by adaptive F/H strategy. For the method implementation, we refer to the ORB-SLAM[1], respectively calculating the F matrix and the H matrix, as well as their scores (SF and SH), which are related to the symmetric transfer error less than the threshold. After obtaining SF and SH, calculate RH=SH/(SH+SF), and then we select H matrix if RH is greater than 0.45. Otherwise, we select the F matrix, which s called adaptive F-H policy. In the Pragueparks dataset, there are not many these kind of cases, so th e performance improvement was not obvious by applying adaptive F/H. In the Phototourism dataset, most of the non-planar matches are correct matches, so applying adaptive F/H led to the correct non-planar matches being removed and thus worsen the accuracy.

5 Conclusion

This report provided Megvii-3D team’s strategies towards CVPR2021 IMW competition for both the unlimited keypoints category and the restricted one.

Limitations of our method

From our analysis on the corner cases, we realized that a simple feature-matcher-filter solution could not suffice all conditions, without multiple scales and/or multiple orientations matching augmentation. We observed there are image pairs with scale difference more than a factor of 3, more than 45 degree of rotation, or large perspective transformation. A single level solution can not work them out by any means.

While those cases are rare but still possible in some applications for real life, for example, when you are using an AR map guider and conducting a sharp turn-around or you suddenly fail down, visual localizer is likely to fail. However, we argue that those extreme cases could be leveraged by compensations from other type of data or sensors, such as the IMU, GPS, and QR code positioning system. On the other hand, the performance of current feature-matcher might be saturated already since the convolutional neural network block has its own limitations, including the limited receptive field, hard to break the spatial relationships between pixels, cannot model shape deformations (though we have DCNs now), etc.

Despite that many excellent researchers will continue on improving the accuracy, robustness together with the generalization of feature extraction and matching, it might be doubted that stereo matching plays a dominant role in optimizing the performance of the whole visual localization system.

References

  • [1] C. Campos, R. Elvira, J. J. G´omez, J. M. M. Montiel, and J. D. Tard’os (2020)

    ORB-SLAM3: an accurate open-source library for visual, visual-inertial and multi-map SLAM

    .
    arXiv preprint arXiv:2007.11898. Cited by: §4.
  • [2] D. DeTone, T. Malisiewicz, and A. Rabinovich (2017) SuperPoint: self-supervised interest point detection and description. CoRR abs/1712.07629. External Links: Link, 1712.07629 Cited by: §2.1, §3, §3.
  • [3] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: Link, 1703.06870 Cited by: §2.3.
  • [4] D. Mishkin, J. Matas, and M. Perdoch (2015) MODS: fast and robust method for two-view matching. Computer Vision and Image Understanding. External Links: ISSN 1077-3142, Document, Link Cited by: §4.
  • [5] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020)

    SuperGlue: learning feature matching with graph neural networks

    .
    In CVPR, External Links: Link Cited by: §3, §3.
  • [6] M. Tyszkiewicz, P. Fua, and E. Trulls (2020) DISK: learning local features with policy gradient. Advances in Neural Information Processing Systems 33. Cited by: §2.1, §3.
  • [7] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §2.3.

Appendix: Details about each Submission

Keypoints Matching (SuperGlue) RANSAC (DEGENSAC)
Submission Dataset
Image
Size
Detector &
Descriptor
Keypoint
Threshold
NMS Weights
Auto
Encoder
Pyramid
#Keypoint
Nums
#Keypoints
after Mask
Mask Threshold Weights Pyramid
Sinkhorn
Iteration
Threshold
(stereo/mv)
#Iterations Adapt FH
Phototourism 1600 0.0005 4 official × scale (ALL) 5000 2048 0.2 official scale (ALL) 150 1.1/1.1 100k ×
PragueParks 2048 0.0005 4 official × scale (ALL) 5000 2048 × 0.2 official scale (ALL) 150 2.5/2.5 100k ×
mssscalev2 GoogleUrban 1600 SuperPoint 0.0005 4 official × scale (ALL) 5000 2048 0.2 official scale (ALL) 150 1.1/1.2 100k ×
Phototourism 2000 0.0005 4 official × scale (ALL) 8000 3586 0.2 official scale (ALL) 150 0.9/0.9 1000k ×
PragueParks 2048 0.0005 4 official × scale (ALL) 8000 3586 × 0.2 official scale (ALL) 150 2.2/2.7 1000k ×
mss_scale_adapt_f_8k GoogleUrban 1600 SuperPoint 0.0005 4 retrain × scale (ALL) 8000 3586 0.2 retrain scale (ALL) 150 0.8/0.8 1000k only mv
Phototourism 1600 0.0005 4 official × orien (ALL) 5000 2048 0.2 official orien (ALL) 150 1.1/1.1 100k ×
PragueParks 2048 0.0005 4 official × orien (ALL) 5000 2048 × 0.2 official orien (ALL) 150 2.5/2.5 100k ×
mss_orien GoogleUrban 1600 SuperPoint 0.0005 4 official × orien (ALL) 5000 2048 0.2 official orien (ALL) 150 1.1/1.2 100k ×
Phototourism 1600 0.0005 4 official × 5000 2048 0.2 official × 150 1.1/1.1 100k ×
PragueParks 2048 0.0005 4 official × 5000 2048 × 0.2 official × 150 2.5/2.5 100k ×
mss_degensac GoogleUrban 1600 SuperPoint 0.0005 4 official × 5000 2048 0.2 official × 150 1.2/1.2 100k ×
Phototourism 1600 × 3 official × scale (ALL) 10000 8000 0.7 retrain scale (ALL) 100 1.1/1.1 500k ×
PragueParks 1600 × 7 official × scale (ALL) 10000 8000 × 0.7 retrain scale (ALL) 100 1.5/2.5 500k ×
disk_scale_8k GoogleUrban 1600 disk × 4 official × scale (ALL) 10000 8000 0.7 retrain scale (ALL) 100 1.1/1.5 500k ×
Phototourism 1600 0.0005 4 retrain(only mv) ×
orien (ALL),
only stereo
5000 2048 0.2 retrain(only mv)
orien (ALL),
only stereo
150 1.1/1.1 100k ×
PragueParks 2048 0.0005 4 official ×
orien (ALL),
only stereo
scale (ALL),
only mv
5000 2048 × 0.2 official
orien (ALL),
only stereo
scale (ALL),
only mv
150 2.5/2.5 100k ×
sp_scale_adapt_f_orien GoogleUrban 1600 SuperPoint 0.0005 4 official ×
scale (ALL),
only mv
5000 2048 0.2 official
scale (ALL),
only mv
150 1.1/1.2 100k only mv
Phototourism 2000 0.0005 4 official × scale (ALL) 8000 3586 0.2 official scale (ALL) 150 0.9/0.9 1000k ×
PragueParks 2048 0.0005 4 official × scale (ALL) 8000 3586 × 0.2 official scale (ALL) 150 2.2/2.7 1000k ×
mss_scale_8k GoogleUrban 1600 SuperPoint 0.0005 4 retrain × scale (ALL) 8000 3586 0.2 retrain scale (ALL) 150 0.8/0.8 1000k ×
Phototourism 1600/1600 0.0005/x 4/3 official/official × scale(ALL)/scale(ALL) 5000/8000 2048/5900 0.2/0.7 official/retrain scale (ALL) 150/100 1.1/1.1 500k ×
PragueParks 2048/1600 0.0005/x 4/8 official/official × scale(ALL)/scale(ALL) 5000/8000 2048/5900 × 0.2/0.7 official/retrain scale (ALL) 150/100 2.5/2.5 500k ×
sp_disk_scale_8k GoogleUrban 1600/1600 SuperPoint/disk 0.0005/x 4/4 official/official × scale(ALL)/scale(ALL) 5000/8000 2048/5900 0.2/0.7 official/rerain scale (ALL) 150/100 1.1/1.5 500k ×
Table 2: Submission Details, the column pyramid/guided pyramid refers to the pyramid feature extraction/guided pyramid matching. scale (ALL) denotes that we combined and used all of the matches found in all scales, orien (ALL) denotes that we combined and used all of the matches found in all orientations.