Visual self-localization from consecutive video frames of a freely moving camera has a long history 
and is one of the key challenges in 3D computer vision. SLAM methods have been applied in robotics and UAVs and are a crucial element in augmented reality pipelines  as well as medical applications . Besides well known methods based on direct image alignment [19, 20, 27, 46], different sparse feature based methods are also well studied [31, 45, 30, 16].
Direct methods incorporate the image information directly from pixel intensities, which can be error-prone due to illumination changes, moving objects or shutter effects . However, a dense image alignment can help with dense reconstructions of the scene . Feature based methods rely on distinctive feature points extracted from the image input, which can account for illumination changes while reducing the computational complexity. Due to their sparseness, they are more suitable for SLAM methods with loop closures and bundle adjustment; however reconstructions are not dense .
The first step in feature based visual odometry and SLAM systems is to detect and to match keypoints between consecutive frames.
Quality and robustness of this step is vital for camera pose estimation and all subsequent computations in the pipeline.
Errors in pose estimation are usually treated in a second stage by pose optimization with local and global bundle adjustment or graph based optimization schemes [32, 56].
Motivation. Natura non facit saltus.111Latin for ”nature does not make jumps”. This principle of natural philosophy was a crucial element in the formulation of infinitesimal calculus and classical mechanics . Consequently, as
we assume smooth motion of an object in space, which is also true for its projection onto a camera image. Knowledge of the motion at time thus helps to approximate the projected location in the next frame.
Given a video sequence, extracted feature points around descriptive parts of the image (e.g. some object in the scene) are grouped into local feature groups by our novel clustering algorithm. The spatial 2D displacement of corresponding groups is then propagated from previous frames by a motion proxy to constrain the search space for new feature matches. Since close features likely belong to the same scene structure, their motion is similar and inter-frame matches between corresponding groups can reinforce each other. This is justified by statistical measures based on a binomial distribution detailed in section3.2. It follows that for a certain number of features in a group, a minimum number of matches between the groups is needed to confirm a true positive match (cf. Fig. 2).
Contributions and Outline. DynaMiTe combines two complementary elements of spatially coherent motion and temporarily smooth inter-frame displacements - analogous to its eponym - in its joint formulation for feature matching between consecutive image frames. To this end, DynaMiTe contributes:
A dynamic local motion model encapsulating the differentiable spatial motion prior with temporal coherency constraints through frame-to-frame information passing.
A statistical quality criteria to determine noise-free feature correspondences.
An efficient clustering scheme through a light data structure to form groups of close-by feature points.
To the best of our knowledge, DynaMiTe is the first method that uses a generic data structure to form clusters of feature points and combines spatial and temporal constraints for feature matching, formulated in a unified probabilistic model. We motivate our method by analyzing the shortcomings of similar approaches in Sec. 3. We then give an overview of the general procedure of DynaMiTe, introduce our proposed dynamic local motion model (Sec. 3.1), extend the probabilistic model of reinforcing support matches between groups of features (Sec. 3.2, 3.3), and deduce robust statistics from it (Sec. 3.4). In the experiments we show matching quality, robustness and repeatability on different datasets for DynaMiTe as well as runtime performance, outperforming SOTA even in challenging scenes.
2 Related Work
Feature based visual odometry methods have shown to achieve the tight real-time constraints to compute accurate camera poses and sparse 3D maps of the scene , even for long sequences . Accurate feature matching has immediate effect on the subsequent tasks of pose estimation and map generation [47, 28, 33, 49, 65, 22]
. Pose interpolation[8, 54] and filtering  techniques can be utilized to circumvent the real-time constraint for camera pose estimation from video sequences to some extent. To improve feature matching capabilities, multiple feature detectors and descriptors have been developed , also specifically targeting real-time applications . One major area of research focuses on the development of robust descriptors which are less variant and more distinctive, thus enabling better matching performance . Different descriptors [4, 43] and learning based pipelines [1, 58, 42, 59] enable a variety of vision applications [11, 55, 61, 67]. Some scholars design descriptor and detector together [67, 48, 17, 53, 18], or additionally learn the matching task  and also including semantic information . Chli and Davison  propose to actively search for features by propagating information from the previous frame. Targeting specifically wide baseline, Yu et al.  proposed an efficient end-to-end pipeline for learning to find correspondences.
Differentiating between true correspondences and mismatches still remains as primary difficulty. Methods like the ratio test  improve feature matching quality by comparing the best and second best potential feature match. Cross check is an alternative to the ratio test, where the nearest neighbor matches are checked for consistency. Statistical approaches such as RANSAC  and its modifications [12, 41]
are effective to remove outliers but may increase runtime due to their iterative execution, especially for large inputs. FLANN finds approximate nearest neighbors in large datasets and can improve computation times.
By grouping joint motion pairs  different methods have been proposed in order to distinguish between true and false matches [35, 34]. Despite showing compelling results, their elaborate formulations result in complex and costly constraints. Other methods assume similar motion smoothness by matching patches between images [2, 26], or learn to match patches . Sparse  and dense  optical flow algorithms  also assume neighboring points in 3D to move coherently.
Bian et al.  (GMS) were the first to formulate the idea of motion smoothness in space within a probabilistic model utilizing a predefined fixed pixel grid. Without GPU acceleration, their method is limited by its initial brute force matching to find potential candidates, many of which are being discarded as mismatches afterwards. Ma et al.  transfer the idea of close-by feature point matching directly to the Euclidean distance within consecutive frames. This approximation does not hold true in general and fails in practice for forward/backward translations, where the depth dependent projection scales non-uniformly. Also  employ locality information to filter match outliers and Zheng et al.  compute cluster centers from fixed grid patches to compare between frames. Wrong matches in the grid cells, however, shift the cluster center and the method requires initial brute force matching.
We also focus on improving matching quality by using close-by features for reinforcement, but propose a different clustering scheme, together with an improved probabilistic model for noise-free robust feature matches in video sequences. Unlike matching patches, we match single features where features around some landmark support each other.
Problem Statement. Recent feature matching approaches [5, 70] for wide-baseline scenarios have introduced a simple probabilistic model to distinguish between correct matches and mismatches, where additional matches of proximate features reinforce each other. They are limited by analyzing those matches on regular grids or require expensive clustering algorithms. In the former scenario 
, high quantities of uniformly distributed feature points across the entire image are matched, and supporting matches within a regular grid are analyzed. A high number of feature points and uniform sampling lead to pairs of many keypoints with poor descriptor quality, resulting in noisy matches and a heavy computation. Proposed clustering algorithms as in are very restrictive and show large variation based on their input, caused by unstable feature point detection between frames. The tight realtime constraint is problematic in both cases, as extracting and matching around 1E5 keypoints  is solely possible with GPU acceleration. Expensive clustering algorithms  aggravate the issue.
DynaMiTe. We take inspiration of supporting neighbouring matches  and extend the approach with our dynamic local clustering method to form groups of close-by feature points around descriptive landmarks.
After feature computation, our proposed method encapsulates the spatial group displacement by the cluster representative. The spatial cluster information is passed throughout the sequence in the temporal domain as motion proxy, resulting in a dynamic local motion model. Assuming mainly static scenes and smooth camera motion, only features of clusters within a certain search space around the cluster center in the previous frame need to be considered for matching. Hence, the group motion is used as prior to restrict the search space for potential matches, which are finally evaluated with our improved and robust probabilistic model. Algorithm 1 gives a general overview of our proposed pipeline, which is schematically detailed in Fig. 3.
We detail the foundation on how to establish a statistical measure for feature matching between patches with a matching score, and improve the base model with bi-directional matching to filter low confidence matches. Additionally, we extend the model with a locally adaptive clustering approach and adapt the underlying statistics for the probabilistic model. We show that the final probabilistic measure for feature matching only depends on the number of neighbouring feature points within a group and the number of supporting matches.
3.1 Dynamic Local Motion Model
We propose a fast and dynamic feature clustering approach by exploiting the nature of many feature detection operators to form clusters around descriptive landmarks in the scene. For this, Union-Find Disjoint Sets (UFDS)  is utilized for efficient grouping of close-by feature points. The data structure models groups in DynaMiTe as collection of disjoint sets.
UFDS is essentially a forest of multi-way trees, where each tree represents a disjoint subset of elements. A forest of trees can be implemented as an array of size items. records the index of the parent of item . If , then item is the root of this tree and also the representative item of the subset that contains item (cf. Fig. 4).
This allows to determine which set an item belongs to, check if two items belong to the same set, and merge two disjoint sets into one in nearly constant time (e.g. ). In our 2 dimensional implementation, items are feature points and sets are groups. The efficiency of this operation is crucial as identifying the group of keypoints is a frequent operation and the correctness of every match is examined by the correlation between two groups.
Our 2D UFDS data structure considers the maximum size of a cluster in pixels as well as the min. and max. amount of features per group. This is justified by the probabilistic model derived hereafter (cf. 3.2
). The analytic matching probabilities give the intervalas a quality criterion for our group sizes which we also use in all our experiments. Cluster centers are initialized at random over the set of all extracted feature points. Algorithm details can be found in the suppl. material.
3.2 Probabilistic Model
Similar to  we assume that features within a close vicinity will match with a high probability to the same area in another image from a different viewpoint, matches of close-by features can reinforce each other. After feature points have been grouped with our proposed clustering algorithm, we analyze all enclosed features per intersecting groups between video frames. The rate of feature matches between patches compared to the number of enclosed keypoints gives a measure of certainty for the match. We can derive a probabilistic model by examining the matching events between correlated and uncorrelated image patches and deduce a binomial distribution which is only dependent on the number of enclosed keypoints in the patch. More specifically, we can define a threshold for a true positive match as the minimum amount of supporting feature matches between two groups relative to their enclosed keypoints.
Figure 1 illustrates the possible matching events (see Table 1 for notation). In case of observing corresponding patches (green case ) in two images and , we can observe a feature (green star) in patch that has its nearest neighbor (NN) in descriptor space in patch (). This feature can either be correctly matched (), or mismatched with some other feature in while its true NN lies still in (). We observe that those mismatches () still contribute as ”noisy” support match between the patches. Similar observations can also be made for the false case , in which we analyze uncorrelated patches (e.g. patch and are not identical regions in the scene), where the feature is mismatched to its NN in (). By analyzing the matching events, it becomes apparent that there is a high probability of finding multiple matches between correlated patches which support each other.
Mathematical Justification. Let be one of features in , which we denote to correctly match to some feature out of features in as with . In case that feature matches wrongly (i.e. ), its NN can be any of the other features in B. Thus, we can write
For correlated patches (case ), we denote the probability that a feature in has its NN in by . Examining the possible cases for as depicted in Fig. 1, this consists of a correct match , or a mismatch while the NN is still in patch . Therefore we can write:
We assume that each group can be treated equally and that groups have similar numbers of features . Analogously for uncorrelated patches and (case ) we can derive:
3.3 False Positive Reduction
Assuming some feature matches correctly or incorrectly with the same chances, i.e. , and with , we get a wide separation between and (see Eqs. (4) and (5)). However, this is partly due to including noisy false positive matches, which is not desirable (compare noise for GMS in Fig. 1). To reduce noisy false positive matches (e.g. event () in Fig. 1), we introduce a consistency check via bidirectional matching (compare Fig. 5). However, bidirectional matching has an influence on the terms in Eq. 4. Details on the derivation below are given in the suppl. material.
True Matches. Given correct patch associations , cross check helps to reduce noisy matches. Let, similar to Eq. (3), the probability of a feature in having its NN in under cross check be , then it holds:
False Matches. In analogy for uncorrelated patches it holds:
3.4 Robust Statistics
Naive bidirectional matching between all features is expensive, especially for the extraction of a large quantity (around ) of uniformly distributed features in the image as in . Additionally, the fraction of becomes small, as a few features in a patch are compared against all features of the entire image, which would reduce the separation between and .
As our proposed model embeds spatial and temporal information and can serve as a motion proxy of the displacement of encapsulated feature points, the potential feature matches are restricted to the intersecting clusters within a certain search space. Thus, not only the computational bottleneck is reduced, but also decreases significantly. With the assumption of small inter-frame motion, the number of features in and are similar () and the fraction in Eq. 7 and Eq. 8 reaches , yielding again a wide separation between and . Additionally, we suppose to be larger than which increases the wide seperation.
Matching Quality Criterion. Matching of an individual feature is generally independent of other features. Thus, we can use the derivations from above similar to  to formulate a binomial distribution which describes the probability of finding additional support matches between correlated or uncorrelated groups for some feature match . Our matching quality criterion is dependent of the number on feature points in a patch:
From a statistical viewpoint, this allows us to formulate a reliable criterion to decide whether or not two groups are correlated and therefore enclose true matches. The objective is to identify a wide separation between true and false cases. Such a division is given, if one event is at least standard deviations apart from the mean (cf. Fig. 6). This reduces the probabilities to a simple threshold .
For a given number of features in a group, we can compute and compare with the number of other supporting matches between the patches. Is the number of supporting matches higher than , the patches are correlated and the feature matches between them identified as correct.
4 Experimental Evaluation
We quantitatively compare our method against a number of proposed classical matching approaches GMS , SIFT , SURF , ORB , BD , BF , GAIM , USC  as well as learning based methods DM  and LIFT . We compare on different datasets with small (TUM ) and large (Kitti ) baselines as well as scenes with little texture (Cabinet ).
Evaluation aspects are based on matching accuracy, robustness and runtime. To quantify matching accuracy, we evaluate the accuracy of pose estimation from matched features and follow the evaluation protocol of Bian et al.  and use their results for comparison on the TUM split. Pose success ratio is reported as a measure of correctly recovered poses under a certain error threshold. The pose is recovered by the estimated essential matrix from feature matches with a RANSAC scheme. The improved results over the SOTA confirm that our proposed spatial-temporal probabilistic model is beneficial in a wide range of textured scenes and different baselines. We observe less convincing results in scenes with limited texture (”Cabinet”, see Fig. 1 as example) and explain this as limited ability to form feature groups for such scenes. We justify our assumption by analyzing the average inlier ratio of feature matches of the RANSAC scheme during pose estimation. Matching repeatability is analyzed as reprojection error of feature matches in static scenes. Additional qualitative results are provided as well as an ablation study by disabling parts of the method, thus examining the limitations of our approach.
All experiments are conducted on an Intel Core i7 CPU. We use the publicly available ORB implementation of OpenCV . For more details on the maximum number of extracted feature points and parametrization of UFDS please refer to the suppl. material.
Matching Accuracy. Evidently, DynaMiTe outperforms other methods in textured scenarios , as the full potential of our joint formulation of spatial and temporal constraints can unfold (Fig. 7 [Left]).
Runtime. We have tested runtime performance on Kitti  and TUM . Our method outperforms SIFT and optical flow (OF)  as baselines and even GMS  with GPU acceleration (compare GMS-GPU  in Tab. 2).
Runtime vs. Accuracy. For better comparison we evaluate accuracy against runtime (cf. Fig. 7 [Right]). DynaMiTe consistently outperforms other methods in terms of runtime vs. success ratio.
Low-Texture scene. For the low-texture scene ”Cabinet”, tracking a large number of group associations throughout the entire sequence is challenging. Only a small number of groups with enough feature points cluster around well defined landmarks. DynaMiTe still performs on par with other methods, which also have difficulties in this scenario and ranks top in terms of runtime vs. accuracy (compare Fig. 8).
Inlier ratio. To analyze the inferior results on ”Cabinet”, we add the inliers of RANSAC during camera pose estimation in Tab. 3. The results reflect our findings, as both pose success and inlier ratio for the textured scenes are superior with DynaMiTe, whereas the Cabinet scene with little structure is challenging.
|OF ||SIFT ||SIFT*||GMS ||Ours|
Matching repeatability. In Tab. 4 the average match reprojection error for different static scenes from the TILDE webcam dataset  are summarized. For a perfect match, the norm would be assumed to be , as the scene and the camera remain static throughout the video. This metric can be interpreted as a measure for matching repeatability and the accuracy of the matching scheme as high errors indicate wrong and noisy matches and the inability to robustly handle repetitive patterns. DynaMiTe considerably outperforms SIFT as baseline and reports superior results compared to GMS.
As an additional measure to the evaluation in Tab. 4, the error relative to the number of extracted feature points for GMS and our method is analyzed. We calculate the average error normalized per features for each sequence and report the average of those as for GMS and for DynaMiTe, which underlines the favourable efficiency and accuracy of our approach.
Qualitative Robustness Evaluation. We present additional qualitative results on matching robustness in different scenes. Our method filters out noisy, not meaningful matches of the texture-less background. Furthermore, our proposed cluster grouping and spatial-temporal formulation robustly tracks reliable features around landmarks with high image information (e.g. edges and corners of the cabinet, see Fig. 1 and 9 [Left]). DynaMiTe can also handle repetitive patterns in the Kitti sequence, such as the windows on the white building, due to its local clustering algorithm, whereas regular grids such as in GMS fail (see Fig. 9 [Right]).
Ablation Study. The experiments above show applicability on small and wide baseline scenarios (TUM/Kitti). Here, we specifically force the algorithm to only keep matches between groups which have been matched throughout the sequence of consecutive frames and not to establish new group associations between frames. Due to large inter-frame forward motion, only a few groups in the center of the image are reliably visible throughout all frames. While our assumptions hold true for consecutive frames, tracking the complete sequence from frame at time step to is problematic as our constraints are violated in this particular setting. Fig. 10 illustrates the limitations of our proposed method in this specific case. DynaMiTe can still be applied in scenarios with very large baselines, however at the cost of a relaxed constraint for inter-frame motion by increasing the search space for the temporal motion prior.
The reported results clearly show the fundamental trade-off between the ability to correctly match feature points and comply with the runtime constraint for different matching methods. DynaMiTe reduces this limitation with its joint formulation, as it efficiently passes information throughout the sequence, encapsulated in the joint spatial-temporal model This enables very robust feature matching, as well as reduced noise in low-textured scenes, and high framerates without GPU acceleration. High-confidence noise-free feature matches are beneficial for camera pose estimation, which is what our method focuses on. The same holds true for reconstruction purposes, one of the various possible application scenarios for DynaMiTe. Generally, our proposed pipeline utilizes solely the information of the feature descriptor and its pixel location in the image, while being agnostic to the underlying descriptor itself. Our model achieves robust feature matching even in difficult scenarios and arbitrary inter-frame motion such as scaling and in-plane rotations, as we rely neither on regular grids nor restrictive clustering methods.
Learning local feature descriptors with triplets and shallow convolutional neural networks.. In BMVC, Vol. 1, pp. 3. Cited by: §2.
-  (2009) PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG) 28 (3), pp. 24. External Links: Cited by: §2.
-  (2013) Metaphysics: A Critical Translation with Kant’s Elucidations, Selected Notes, and Related Materials. A&C Black. Cited by: §1.
-  (2008) Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding 110 (3), pp. 346–359. External Links: Cited by: §2, §4.
GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2828–2837. External Links: Cited by: Figure 1, §2, §3.2, §3.4, §3.4, §3, §3, Table 2, Table 3, Table 4, §4, §4, §4.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §4.
-  (2018) Camera Pose Filtering with Local Regression Geodesics on the Riemannian Manifold of Dual Quaternions. In Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017, Vol. 2018-Janua, pp. 2436–2445. External Links: Cited by: §2.
-  (2016-10) Quaternionic Upsampling: Hyperspherical Techniques for 6 DoF Pose Tracking. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 629–638. External Links: Cited by: §2.
-  (2018) Markerless Inside-Out Tracking for 3D Ultrasound Compounding. In Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and Navigation, Cham, pp. 56–64. External Links: Cited by: §1.
-  (2008-10) Active matching.. In ECCV, pp. 72–85. External Links: Cited by: §2.
-  (2016) Universal correspondence network. In Advances in Neural Information Processing Systems, pp. 2414–2422. Cited by: §2.
-  (2005) Matching with PROSAC — Progressive Sample Consensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, pp. 220–226. External Links: Cited by: §2.
-  (2014) An analysis of errors in graph-based keypoint matching and proposed solutions. In European Conference on Computer Vision, pp. 138–153. Cited by: §4.
-  (2017) Direct methods for 3d reconstruction and visual slam. In 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), pp. 34–38. Cited by: §1.
-  (2018-07) From handcrafted to deep local invariant features. arXiv preprint arXiv:1807.10254. External Links: Cited by: §2.
-  (2007) MonoSLAM: real-time single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1052–1067. Cited by: §1.
SuperPoint: Self-Supervised Interest Point Detection and Description.
CVPR Deep Learning for Visual SLAM Workshop. External Links: Cited by: §2.
-  (2019) D2-net: a trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561. Cited by: §2.
-  (2017) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 4 (3), pp. 611–625. Cited by: §1.
-  (2014) LSD-SLAM: Direct Monocular SLAM. In European Conference on Computer Vision (ECCV), Cham, pp. 834–849. External Links: Cited by: §1, §1.
-  (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In Image Analysis, pp. 363–370. External Links: Cited by: §2.
-  (2017) Quaternion based camera pose estimation from matched feature points. arXiv preprint arXiv:1704.02672. Cited by: §2.
-  (1981-06) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. External Links: Cited by: §2.
-  (1964) An Improved Equivalence Algorithm. Communications of the ACM 7 (5), pp. 301–303. External Links: Cited by: §3.1.
-  (2013) Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Figure 12, Figure 13, item 4, Figure 10, Figure 9, §4, §4.
-  (2011) Non-rigid dense correspondence with applications for image enhancement. In ACM transactions on graphics (TOG), External Links: Cited by: §2.
-  (1991) Direct multi-resolution estimation of ego-motion and structure from motion. In Visual Motion, 1991., Proceedings of the IEEE Workshop on, pp. 156–162. External Links: Cited by: §1.
-  (2011) A Direct Least-Squares (DLS) method for PnP. In 2011 International Conference on Computer Vision, pp. 383–390. External Links: Cited by: §2.
-  (1981) Determining optical flow. Artificial Intelligence 17 (1-3), pp. 185–203. External Links: Cited by: §2.
-  (2006) Full-3D Edge Tracking with a Particle Filter.. In BMVC, pp. 1119–1128. Cited by: §1.
-  (2007) Parallel Tracking and Mapping for Small AR Workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234. Cited by: §1, §2.
-  (2011) G2o: A general framework for graph optimization. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3607–3613. External Links: Cited by: §1.
-  (2009) EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision 81 (2), pp. 155–166. External Links: Cited by: §2.
-  (2014) Bilateral Functions for Global Motion Modeling. In European Conference on Computer Vision, pp. 341–356. External Links: Cited by: §2, §4.
-  (2018) CODE: Coherence Based Decision Boundaries for Feature Correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 34–47. External Links: Cited by: §2.
-  (2014) Feature matching with bounded distortion. ACM Transactions on Graphics (TOG) 33 (3), pp. 26. Cited by: §4.
-  (2004) Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: Figure 1, §2, §2, Table 2, Table 3, Table 4, §4.
-  (1981) An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, San Francisco, CA, USA, pp. 674–679. External Links: Cited by: §2, Table 2, Table 3, §4.
-  (2017) Locality Preserving Matching. International Journal of Computer Vision, pp. 1–20. External Links: Cited by: §2.
-  (1988) Incremental estimation of dense depth maps from image sequences. In Computer Vision and Pattern Recognition, 1988. Proceedings CVPR’88., Computer Society Conference on, pp. 366–374. Cited by: §1.
-  (1992) Analysis of the least median of squares estimator for computer vision applications. In Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pp. 621–623. External Links: Cited by: §2.
-  (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems, pp. 4826–4837. Cited by: §2.
ASIFT: A New Framework for Fully Affine Invariant Image Comparison. SIAM Journal on Imaging Sciences 2 (2), pp. 438–469. External Links: Cited by: §2.
-  (2009) Fast approximate nearest neighbors with automatic algorithm configuration. IN VISAPP INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, pp. 331–340. External Links: Cited by: §2.
ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Cited by: §1, §1, §2.
-  (2011) KinectFusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. Cited by: §1.
-  (2004-06) An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (6), pp. 756–770. External Links: Cited by: §2.
-  (2018) LF-net: learning local features from images. In Advances in Neural Information Processing Systems, pp. 6234–6244. Cited by: §2.
-  (2013) Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (10), pp. 2387–2400. External Links: Cited by: §2.
-  (2013) USAC: a universal framework for random sample consensus.. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 2022–2038. Cited by: §4.
-  (2011-11) ORB: An efficient alternative to SIFT or SURF. In Computer Vision (ICCV), 2011 IEEE international conference on, pp. 2564–2571. External Links: Cited by: Figure 1, §2, §4.
-  (2019) SuperGlue: learning feature matching with graph neural networks. External Links: Cited by: §2.
-  (2019) RF-net: an end-to-end image matching network based on receptive field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8132–8140. Cited by: §2.
-  (1985) Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’85, pp. 245–254. External Links: Cited by: §2.
-  (2015) Discriminative Learning of Deep Convolutional Feature Point Descriptors. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 118–126. External Links: Cited by: §2.
-  (2011) Double window optimisation for constant time visual SLAM. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2352–2359. External Links: Cited by: §1.
-  (2012) A benchmark for the evaluation of RGB-D SLAM systems. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 573–580. Cited by: Figure 13, Figure 1, item 4, Figure 7, Figure 8, Figure 9, §4, §4, §4.
-  (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §2.
-  (2019) SOSNet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §2.
-  (2017) Deep semantic feature matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015-06) TILDE: A Temporally Invariant Learned DEtector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5279–5288. External Links: Cited by: item 4, §2, Table 4, §4.
-  (2017) From monocular slam to autonomous drone exploration. In 2017 European Conference on Mobile Robots (ECMR), pp. 1–8. Cited by: §1.
-  (2018) Gaussian field consensus: A robust nonparametric matching method for outlier rejection. Pattern Recognition 74, pp. 305–316. External Links: Cited by: §2.
-  (2013) DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392. Cited by: §4.
-  (2003) Complete solution classification for the perspective-three-point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8), pp. 930–943. External Links: Cited by: §2.
-  (2015-06) MatchNet: unifying feature and metric learning for patch-based matching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3279–3286. External Links: Cited by: §2.
-  (2016) LIFT: Learned Invariant Feature Transform. European Conference on Computer Vision, pp. 467–483. External Links: Cited by: §2, §4.
-  (2018) Learning to Find Good Correspondences. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (1988) The Motion Coherence Theory. In 1988 Second International Conference on Computer Vision, pp. 344–353. External Links: Cited by: §2.
-  (2018) UGC: Real-time, Ultra-robust Feature Correspondence via Unilateral Grid-based Clustering. IEEE Access. External Links: Cited by: §2, §3.
Appendix 0.A Details on 3.3 False Positive Reduction
Here, we detail the derivation of Eq. (6) from the main paper for better understandability.
True Matches. Let the probability of a feature in having its Nearest Neighbor in under cross check be , then it holds:
False Matches. In analogy for uncorrelated patches:
Appendix 0.B Runtime Analysis
Our method achieves realtime performance on CPU for the full pipeline from feature extraction, matching and applying ourspatial and temporal constraints, without any GPU acceleration. In Fig. 11 the runtime advantage of our method against GMS is clearly visible. GMS is by a factor of 4 slower with GPU acceleration and for CPU-only even by a factor of 15. The matching step, which contributes to a majority of the overall time consumption for GMS and other methods, has now been decreased significantly. The bottleneck for our proposed method is now solely the feature extraction itself.
Appendix 0.C Parameter Discussion
0.c.1 Features Points
We limit the maximum number of extracted feature points in the image. Speaking purely from the perspective of estimating camera poses, a small number of feature matches is sufficient. However, for our proposed method we assume a certain number of feature points to be detected for forming local feature groups from feature clusters around well defined structures in the image:
The FAST threshold of ORB is set to to ensure a high number of detected feature points while not compromising the feature descriptor quality:
0.c.2 Local Motion Model
The parameters for our local motion model are justified by our proposed probabilistic model. Chosen parameters have been used throughout our evaluation, and have therefore been proven to be applicable for different image content and scenarios.
We define a maximum size for a local group in pixels:
For every feature, the algorithm will find its neighbors within a 30-pixel-by-30-pixel region centering around the feature. Accompanied with a certain size limit of groups, it enables more nearby features being grouped into a group while keeping the group’s size in an appropriate range. This parameter may be adjusted for HR images.
Derived from our probabilistic model, we need a minimum number of features per group for applying the statistical criteria. A maximum number of feature points per group should also be considered. The maximum number is to prevent the group from exceeding expansion, and for very large numbers of features, the quality criterion reaches a saturation stage.
Appendix 0.D Qualitative Results
Appendix 0.E Algorithms
For a better understanding of our proposed method together with the source code, we provide an overview of the pipeline as pseudo-code. An overview of the overall pipeline can be found in Algorithm LABEL:DynaMiTe_Algo. The grouping algorithm for finding dynamic local feature groups is summarized in Algorithm LABEL:Grouping_Algo.