Log In Sign Up

DynaMiTe: A Dynamic Local Motion Model with Temporal Constraints for Robust Real-Time Feature Matching

Feature based visual odometry and SLAM methods require accurate and fast correspondence matching between consecutive image frames for precise camera pose estimation in real-time. Current feature matching pipelines either rely solely on the descriptive capabilities of the feature extractor or need computationally complex optimization schemes. We present the lightweight pipeline DynaMiTe, which is agnostic to the descriptor input and leverages spatial-temporal cues with efficient statistical measures. The theoretical backbone of the method lies within a probabilistic formulation of feature matching and the respective study of physically motivated constraints. A dynamically adaptable local motion model encapsulates groups of features in an efficient data structure. Temporal constraints transfer information of the local motion model across time, thus additionally reducing the search space complexity for matching. DynaMiTe achieves superior results both in terms of matching accuracy and camera pose estimation with high frame rates, outperforming state-of-the-art matching methods while being computationally more efficient.


page 2

page 3

page 5

page 7

page 14

page 22

page 23


Efficient 2D-3D Matching for Multi-Camera Visual Localization

Visual localization, i.e., determining the position and orientation of a...

End2End Multi-View Feature Matching using Differentiable Pose Optimization

Learning-based approaches have become indispensable for camera pose esti...

Accurate Visual-Inertial SLAM by Feature Re-identification

We propose a novel feature re-identification method for real-time visual...

Feature-based visual odometry prior for real-time semi-dense stereo SLAM

Robust and fast motion estimation and mapping is a key prerequisite for ...

TransCamP: Graph Transformer for 6-DoF Camera Pose Estimation

Camera pose estimation or camera relocalization is the centerpiece in nu...

Hybrid Camera Pose Estimation with Online Partitioning

This paper presents a hybrid real-time camera pose estimation framework ...

Light3DPose: Real-time Multi-Person 3D PoseEstimation from Multiple Views

We present an approach to perform 3D pose estimation of multiple people ...

1 Introduction

Visual self-localization from consecutive video frames of a freely moving camera has a long history [40]

and is one of the key challenges in 3D computer vision. SLAM methods have been applied in robotics and UAVs 

[62] and are a crucial element in augmented reality pipelines [19] as well as medical applications [9]. Besides well known methods based on direct image alignment [19, 20, 27, 46], different sparse feature based methods are also well studied [31, 45, 30, 16].

Figure 1: Comparison of feature matching for consecutive image frames on a challenging low-textured object of the TUM RGB-D dataset [57]. Features in frame in yellow; in green; Matches as blue lines. SIFT [37] is texture-sensitive. ORB [51] (2000 extractions) is efficient but unstable. GMS [5] produces noisy, wrong matches in uniform regions while our method runs the fastest with minimal incorrect matches.

Direct methods incorporate the image information directly from pixel intensities, which can be error-prone due to illumination changes, moving objects or shutter effects [20]. However, a dense image alignment can help with dense reconstructions of the scene [14]. Feature based methods rely on distinctive feature points extracted from the image input, which can account for illumination changes while reducing the computational complexity. Due to their sparseness, they are more suitable for SLAM methods with loop closures and bundle adjustment; however reconstructions are not dense [45].

The first step in feature based visual odometry and SLAM systems is to detect and to match keypoints between consecutive frames. Quality and robustness of this step is vital for camera pose estimation and all subsequent computations in the pipeline. Errors in pose estimation are usually treated in a second stage by pose optimization with local and global bundle adjustment or graph based optimization schemes [32, 56].

Motivation. Natura non facit saltus.111Latin for ”nature does not make jumps”. This principle of natural philosophy was a crucial element in the formulation of infinitesimal calculus and classical mechanics [3]. Consequently, as


we assume smooth motion of an object in space, which is also true for its projection onto a camera image. Knowledge of the motion at time thus helps to approximate the projected location in the next frame.

Given a video sequence, extracted feature points around descriptive parts of the image (e.g. some object in the scene) are grouped into local feature groups by our novel clustering algorithm. The spatial 2D displacement of corresponding groups is then propagated from previous frames by a motion proxy to constrain the search space for new feature matches. Since close features likely belong to the same scene structure, their motion is similar and inter-frame matches between corresponding groups can reinforce each other. This is justified by statistical measures based on a binomial distribution detailed in section 

3.2. It follows that for a certain number of features in a group, a minimum number of matches between the groups is needed to confirm a true positive match (cf. Fig. 2).

Figure 2: Matching and reinforcement. Support matches (blue) between groups of feature points reinforce each other. Areas with little structure or blurry parts can lead to noisy false matches (orange). The proposed locally adaptive clustering algorithm encapsulates proximate feature points. All clusters within a defined search space (dashed line) are potential matching candidates.

Contributions and Outline. DynaMiTe combines two complementary elements of spatially coherent motion and temporarily smooth inter-frame displacements - analogous to its eponym - in its joint formulation for feature matching between consecutive image frames. To this end, DynaMiTe contributes:

  1. A dynamic local motion model encapsulating the differentiable spatial motion prior with temporal coherency constraints through frame-to-frame information passing.

  2. A statistical quality criteria to determine noise-free feature correspondences.

  3. An efficient clustering scheme through a light data structure to form groups of close-by feature points.

  4. An efficient and robust feature matching pipeline for camera pose estimation in image sequences that significantly improves the state-of-the-art evaluated on the three datasets KITTI [25], TUM RGB-D [57], and TILDE webcam [61].

To the best of our knowledge, DynaMiTe is the first method that uses a generic data structure to form clusters of feature points and combines spatial and temporal constraints for feature matching, formulated in a unified probabilistic model. We motivate our method by analyzing the shortcomings of similar approaches in Sec. 3. We then give an overview of the general procedure of DynaMiTe, introduce our proposed dynamic local motion model (Sec. 3.1), extend the probabilistic model of reinforcing support matches between groups of features (Sec. 3.23.3), and deduce robust statistics from it (Sec. 3.4). In the experiments we show matching quality, robustness and repeatability on different datasets for DynaMiTe as well as runtime performance, outperforming SOTA even in challenging scenes.

2 Related Work

Feature based visual odometry methods have shown to achieve the tight real-time constraints to compute accurate camera poses and sparse 3D maps of the scene [31], even for long sequences [45]. Accurate feature matching has immediate effect on the subsequent tasks of pose estimation and map generation [47, 28, 33, 49, 65, 22]

. Pose interpolation 

[8, 54] and filtering [7] techniques can be utilized to circumvent the real-time constraint for camera pose estimation from video sequences to some extent. To improve feature matching capabilities, multiple feature detectors and descriptors have been developed [37], also specifically targeting real-time applications [51]. One major area of research focuses on the development of robust descriptors which are less variant and more distinctive, thus enabling better matching performance [15]. Different descriptors [4, 43] and learning based pipelines [1, 58, 42, 59] enable a variety of vision applications [11, 55, 61, 67]. Some scholars design descriptor and detector together [67, 48, 17, 53, 18], or additionally learn the matching task [52] and also including semantic information [60]. Chli and Davison [10] propose to actively search for features by propagating information from the previous frame. Targeting specifically wide baseline, Yu et al. [68] proposed an efficient end-to-end pipeline for learning to find correspondences.

Differentiating between true correspondences and mismatches still remains as primary difficulty. Methods like the ratio test [37] improve feature matching quality by comparing the best and second best potential feature match. Cross check is an alternative to the ratio test, where the nearest neighbor matches are checked for consistency. Statistical approaches such as RANSAC [23] and its modifications [12, 41]

are effective to remove outliers but may increase runtime due to their iterative execution, especially for large inputs. FLANN 

[44] finds approximate nearest neighbors in large datasets and can improve computation times.

By grouping joint motion pairs [69] different methods have been proposed in order to distinguish between true and false matches [35, 34]. Despite showing compelling results, their elaborate formulations result in complex and costly constraints. Other methods assume similar motion smoothness by matching patches between images [2, 26], or learn to match patches [66]. Sparse [38] and dense [21] optical flow algorithms [29] also assume neighboring points in 3D to move coherently.

Bian et al. [5] (GMS) were the first to formulate the idea of motion smoothness in space within a probabilistic model utilizing a predefined fixed pixel grid. Without GPU acceleration, their method is limited by its initial brute force matching to find potential candidates, many of which are being discarded as mismatches afterwards. Ma et al. [39] transfer the idea of close-by feature point matching directly to the Euclidean distance within consecutive frames. This approximation does not hold true in general and fails in practice for forward/backward translations, where the depth dependent projection scales non-uniformly. Also [63] employ locality information to filter match outliers and Zheng et al. [70] compute cluster centers from fixed grid patches to compare between frames. Wrong matches in the grid cells, however, shift the cluster center and the method requires initial brute force matching.

We also focus on improving matching quality by using close-by features for reinforcement, but propose a different clustering scheme, together with an improved probabilistic model for noise-free robust feature matches in video sequences. Unlike matching patches, we match single features where features around some landmark support each other.

3 Methodology

Problem Statement. Recent feature matching approaches [5, 70] for wide-baseline scenarios have introduced a simple probabilistic model to distinguish between correct matches and mismatches, where additional matches of proximate features reinforce each other. They are limited by analyzing those matches on regular grids or require expensive clustering algorithms. In the former scenario [5]

, high quantities of uniformly distributed feature points across the entire image are matched, and supporting matches within a regular grid are analyzed. A high number of feature points and uniform sampling lead to pairs of many keypoints with poor descriptor quality, resulting in noisy matches and a heavy computation. Proposed clustering algorithms as in 

[70] are very restrictive and show large variation based on their input, caused by unstable feature point detection between frames. The tight realtime constraint is problematic in both cases, as extracting and matching around 1E5 keypoints [5] is solely possible with GPU acceleration. Expensive clustering algorithms [70] aggravate the issue.

DynaMiTe. We take inspiration of supporting neighbouring matches [5] and extend the approach with our dynamic local clustering method to form groups of close-by feature points around descriptive landmarks.

Figure 3: [Left] Schematic illustration of DynaMiTe pipeline: Temporal information is passed through the image sequence for each group (upper row). The boxes (light grey) illustrate the enlarged search space around groups between time and . A feature match is considered true, if enough other matches between the groups can support the match (green). Groups may disappear (crossed out group at ) and new ones emerge (orange group at ). [right] Matched features and groups as overlay on the source image.

After feature computation, our proposed method encapsulates the spatial group displacement by the cluster representative. The spatial cluster information is passed throughout the sequence in the temporal domain as motion proxy, resulting in a dynamic local motion model. Assuming mainly static scenes and smooth camera motion, only features of clusters within a certain search space around the cluster center in the previous frame need to be considered for matching. Hence, the group motion is used as prior to restrict the search space for potential matches, which are finally evaluated with our improved and robust probabilistic model. Algorithm 1 gives a general overview of our proposed pipeline, which is schematically detailed in Fig. 3.

1 forall Frame in video do
2       Extract feature points;
3       Establish feature groups (FG);
4       Calculate intersection of old and new FGs;
5       Match intersected FGs;
6       Compute match score and retrieve inlier;
7       Establish new FGs;
8       Pass FG information to next frame ;
Algorithm 1 DynaMiTe Pipeline for Feature Matching

We detail the foundation on how to establish a statistical measure for feature matching between patches with a matching score, and improve the base model with bi-directional matching to filter low confidence matches. Additionally, we extend the model with a locally adaptive clustering approach and adapt the underlying statistics for the probabilistic model. We show that the final probabilistic measure for feature matching only depends on the number of neighbouring feature points within a group and the number of supporting matches.

3.1 Dynamic Local Motion Model

We propose a fast and dynamic feature clustering approach by exploiting the nature of many feature detection operators to form clusters around descriptive landmarks in the scene. For this, Union-Find Disjoint Sets (UFDS) [24] is utilized for efficient grouping of close-by feature points. The data structure models groups in DynaMiTe as collection of disjoint sets.

UFDS is essentially a forest of multi-way trees, where each tree represents a disjoint subset of elements. A forest of trees can be implemented as an array of size items. records the index of the parent of item . If , then item is the root of this tree and also the representative item of the subset that contains item (cf. Fig. 4).

This allows to determine which set an item belongs to, check if two items belong to the same set, and merge two disjoint sets into one in nearly constant time (e.g. ). In our 2 dimensional implementation, items are feature points and sets are groups. The efficiency of this operation is crucial as identifying the group of keypoints is a frequent operation and the correctness of every match is examined by the correlation between two groups.

Our 2D UFDS data structure considers the maximum size of a cluster in pixels as well as the min. and max. amount of features per group. This is justified by the probabilistic model derived hereafter (cf. 3.2

). The analytic matching probabilities give the interval

as a quality criterion for our group sizes which we also use in all our experiments. Cluster centers are initialized at random over the set of all extracted feature points. Algorithm details can be found in the suppl. material.

Figure 4: Overview of UFDS for efficient clustering of feature points.

3.2 Probabilistic Model

Similar to [5] we assume that features within a close vicinity will match with a high probability to the same area in another image from a different viewpoint, matches of close-by features can reinforce each other. After feature points have been grouped with our proposed clustering algorithm, we analyze all enclosed features per intersecting groups between video frames. The rate of feature matches between patches compared to the number of enclosed keypoints gives a measure of certainty for the match. We can derive a probabilistic model by examining the matching events between correlated and uncorrelated image patches and deduce a binomial distribution which is only dependent on the number of enclosed keypoints in the patch. More specifically, we can define a threshold for a true positive match as the minimum amount of supporting feature matches between two groups relative to their enclosed keypoints.

Notation Description A feature in matches correctly; A feature in matches incorrectly; Patch and view the identical location Patch and view a different location NN of is in NN of is NOT in Probability of in matches correctly AND NN of is in Probability of in matches wrongly AND NN of is in Probability of in matches correctly GIVEN
Table 1: Overview of used notation. NN = Nearest Neighbor in feature space
figureIllustration of possible events during feature matching. See text for description and Tab. 1 for notation.

Figure 1 illustrates the possible matching events (see Table 1 for notation). In case of observing corresponding patches (green case ) in two images and , we can observe a feature (green star) in patch that has its nearest neighbor (NN) in descriptor space in patch (). This feature can either be correctly matched (), or mismatched with some other feature in while its true NN lies still in (). We observe that those mismatches () still contribute as ”noisy” support match between the patches. Similar observations can also be made for the false case , in which we analyze uncorrelated patches (e.g. patch and are not identical regions in the scene), where the feature is mismatched to its NN in (). By analyzing the matching events, it becomes apparent that there is a high probability of finding multiple matches between correlated patches which support each other.

Mathematical Justification. Let be one of features in , which we denote to correctly match to some feature out of features in as with . In case that feature matches wrongly (i.e. ), its NN can be any of the other features in B. Thus, we can write


For correlated patches (case ), we denote the probability that a feature in has its NN in by . Examining the possible cases for as depicted in Fig. 1, this consists of a correct match , or a mismatch while the NN is still in patch . Therefore we can write:


With the assumption of independence for single feature matches, we are independent of . Using Baye’s rule, the notation from Table 1 and with Eq. 2, we get:


We assume that each group can be treated equally and that groups have similar numbers of features . Analogously for uncorrelated patches and (case ) we can derive:


3.3 False Positive Reduction

Assuming some feature matches correctly or incorrectly with the same chances, i.e. , and with , we get a wide separation between and (see Eqs. (4) and (5)). However, this is partly due to including noisy false positive matches, which is not desirable (compare noise for GMS in Fig. 1). To reduce noisy false positive matches (e.g. event () in Fig. 1), we introduce a consistency check via bidirectional matching (compare Fig. 5). However, bidirectional matching has an influence on the terms in Eq. 4. Details on the derivation below are given in the suppl. material.

Figure 5: Cross check matching for improved robustness and reduced noise. For normal matching ( to or to ), false matches between patches would still contribute to the matching probability as false positives (cf. hatched areas). Cross check consistency results in the union . Area size does not depict probability.

True Matches. Given correct patch associations , cross check helps to reduce noisy matches. Let, similar to Eq. (3), the probability of a feature in having its NN in under cross check be , then it holds:


As before (see Eqs. (3) and (4)), with and being the equivalent for and and by substitution after the binomial expansion, we can reduce this to:


False Matches. In analogy for uncorrelated patches it holds:


3.4 Robust Statistics

Naive bidirectional matching between all features is expensive, especially for the extraction of a large quantity (around ) of uniformly distributed features in the image as in [5]. Additionally, the fraction of becomes small, as a few features in a patch are compared against all features of the entire image, which would reduce the separation between and .

As our proposed model embeds spatial and temporal information and can serve as a motion proxy of the displacement of encapsulated feature points, the potential feature matches are restricted to the intersecting clusters within a certain search space. Thus, not only the computational bottleneck is reduced, but also decreases significantly. With the assumption of small inter-frame motion, the number of features in and are similar () and the fraction in Eq. 7 and Eq. 8 reaches , yielding again a wide separation between and . Additionally, we suppose to be larger than which increases the wide seperation.

Matching Quality Criterion. Matching of an individual feature is generally independent of other features. Thus, we can use the derivations from above similar to [5] to formulate a binomial distribution which describes the probability of finding additional support matches between correlated or uncorrelated groups for some feature match . Our matching quality criterion is dependent of the number on feature points in a patch:


From a statistical viewpoint, this allows us to formulate a reliable criterion to decide whether or not two groups are correlated and therefore enclose true matches. The objective is to identify a wide separation between true and false cases. Such a division is given, if one event is at least standard deviations apart from the mean (cf. Fig. 6). This reduces the probabilities to a simple threshold .

Figure 6: Qualitative illustration of the matching quality criterion together with the support threshold . True and False cases have a wide separation dependent on the number of feature points in the cluster.

As is small (see Eqs. (8) and  (11)) and is mainly dependent on the number of features (for in Eq. (11), the becomes very small), we can write the support threshold as:


For a given number of features in a group, we can compute and compare with the number of other supporting matches between the patches. Is the number of supporting matches higher than , the patches are correlated and the feature matches between them identified as correct.

4 Experimental Evaluation

We quantitatively compare our method against a number of proposed classical matching approaches GMS [5], SIFT [37], SURF [4], ORB [51], BD [36], BF [34], GAIM [13], USC [50] as well as learning based methods DM [64] and LIFT [67]. We compare on different datasets with small (TUM [57]) and large (Kitti [25]) baselines as well as scenes with little texture (Cabinet [57]).

Evaluation aspects are based on matching accuracy, robustness and runtime. To quantify matching accuracy, we evaluate the accuracy of pose estimation from matched features and follow the evaluation protocol of Bian et al. [5] and use their results for comparison on the TUM split. Pose success ratio is reported as a measure of correctly recovered poses under a certain error threshold. The pose is recovered by the estimated essential matrix from feature matches with a RANSAC scheme. The improved results over the SOTA confirm that our proposed spatial-temporal probabilistic model is beneficial in a wide range of textured scenes and different baselines. We observe less convincing results in scenes with limited texture (”Cabinet”, see Fig. 1 as example) and explain this as limited ability to form feature groups for such scenes. We justify our assumption by analyzing the average inlier ratio of feature matches of the RANSAC scheme during pose estimation. Matching repeatability is analyzed as reprojection error of feature matches in static scenes. Additional qualitative results are provided as well as an ablation study by disabling parts of the method, thus examining the limitations of our approach.

All experiments are conducted on an Intel Core i7 CPU. We use the publicly available ORB implementation of OpenCV [6]. For more details on the maximum number of extracted feature points and parametrization of UFDS please refer to the suppl. material.

Matching Accuracy. Evidently, DynaMiTe outperforms other methods in textured scenarios [57], as the full potential of our joint formulation of spatial and temporal constraints can unfold (Fig. 7 [Left]).

Figure 7: Results on TUM Split [57] with varying scene structure. [Left] Matching Accuracy as pose success ratio against pose error threshold. [Right] Runtime vs. Accuracy as pose success ratio in relation to computation time (log time scale).

Runtime. We have tested runtime performance on Kitti [25] and TUM [57]. Our method outperforms SIFT and optical flow (OF) [38] as baselines and even GMS [5] with GPU acceleration (compare GMS-GPU [5] in Tab. 2).

OF [38] SIFT [37] GMS [5] GMS-GPU Ours
Kitti 14 18 3 12 44
TUM 48* 22 4 14 63
Table 2: Runtime in frames per second (fps). For OF* we report fastest observed fps, as it varies extensively depending on the scene structure and camera displacement.

Runtime vs. Accuracy. For better comparison we evaluate accuracy against runtime (cf. Fig. 7 [Right]). DynaMiTe consistently outperforms other methods in terms of runtime vs. success ratio.

Low-Texture scene. For the low-texture scene ”Cabinet”, tracking a large number of group associations throughout the entire sequence is challenging. Only a small number of groups with enough feature points cluster around well defined landmarks. DynaMiTe still performs on par with other methods, which also have difficulties in this scenario and ranks top in terms of runtime vs. accuracy (compare Fig. 8).

Figure 8: Results on low-texture scene ”Cabinet” [57], analogous to Fig. 7.

Inlier ratio. To analyze the inferior results on ”Cabinet”, we add the inliers of RANSAC during camera pose estimation in Tab. 3. The results reflect our findings, as both pose success and inlier ratio for the textured scenes are superior with DynaMiTe, whereas the Cabinet scene with little structure is challenging.

OF [38] SIFT [37] SIFT* GMS [5] Ours
TUM Split 0.58 0.16 0.54 0.18 0.32
Cabinet 0.50 0.20 0.61 0.24 0.22
Kitti 0.37 0.11 0.64 0.85 0.87
Table 3: Avg. inlier ratio of RANSAC scheme for pose recovery relative to matches. SIFT* includes additional filtering of matches with ratio test.

Matching repeatability. In Tab. 4 the average match reprojection error for different static scenes from the TILDE webcam dataset [61] are summarized. For a perfect match, the norm would be assumed to be , as the scene and the camera remain static throughout the video. This metric can be interpreted as a measure for matching repeatability and the accuracy of the matching scheme as high errors indicate wrong and noisy matches and the inability to robustly handle repetitive patterns. DynaMiTe considerably outperforms SIFT as baseline and reports superior results compared to GMS.

Chamonix Courbevone Frankfurt Mexico Panorama St. Louis
SIFT [37] 196.03 184.70 298.17 175.24 592.85 215.27
GMS [5] 3.48 4.34 7.21 9.47 142.10 8.33
Ours 1.92 2.47 9.45 6.75 2.80 3.97
Table 4: Feature matching repeatability test on TILDE dataset [61] as average reprojection error in pixels.

As an additional measure to the evaluation in Tab. 4, the error relative to the number of extracted feature points for GMS and our method is analyzed. We calculate the average error normalized per features for each sequence and report the average of those as for GMS and for DynaMiTe, which underlines the favourable efficiency and accuracy of our approach.

Qualitative Robustness Evaluation. We present additional qualitative results on matching robustness in different scenes. Our method filters out noisy, not meaningful matches of the texture-less background. Furthermore, our proposed cluster grouping and spatial-temporal formulation robustly tracks reliable features around landmarks with high image information (e.g. edges and corners of the cabinet, see Fig. 1 and 9 [Left]). DynaMiTe can also handle repetitive patterns in the Kitti sequence, such as the windows on the white building, due to its local clustering algorithm, whereas regular grids such as in GMS fail (see Fig. 9 [Right]).

Figure 9: Qualitative robustness comparison on TUM [57] [Left] and Kitti [25] [Right] dataset. Ours (top) filters noisy and wrong matches in textureless regions and around repetitive patterns.

Ablation Study. The experiments above show applicability on small and wide baseline scenarios (TUM/Kitti). Here, we specifically force the algorithm to only keep matches between groups which have been matched throughout the sequence of consecutive frames and not to establish new group associations between frames. Due to large inter-frame forward motion, only a few groups in the center of the image are reliably visible throughout all frames. While our assumptions hold true for consecutive frames, tracking the complete sequence from frame at time step to is problematic as our constraints are violated in this particular setting. Fig. 10 illustrates the limitations of our proposed method in this specific case. DynaMiTe can still be applied in scenarios with very large baselines, however at the cost of a relaxed constraint for inter-frame motion by increasing the search space for the temporal motion prior.

Figure 10: Consecutive frame matching (top) in comparison to limited matching capabilities through multiple frames (bottom) in seq. of [25].

5 Discussion

The reported results clearly show the fundamental trade-off between the ability to correctly match feature points and comply with the runtime constraint for different matching methods. DynaMiTe reduces this limitation with its joint formulation, as it efficiently passes information throughout the sequence, encapsulated in the joint spatial-temporal model This enables very robust feature matching, as well as reduced noise in low-textured scenes, and high framerates without GPU acceleration. High-confidence noise-free feature matches are beneficial for camera pose estimation, which is what our method focuses on. The same holds true for reconstruction purposes, one of the various possible application scenarios for DynaMiTe. Generally, our proposed pipeline utilizes solely the information of the feature descriptor and its pixel location in the image, while being agnostic to the underlying descriptor itself. Our model achieves robust feature matching even in difficult scenarios and arbitrary inter-frame motion such as scaling and in-plane rotations, as we rely neither on regular grids nor restrictive clustering methods.


  • [1] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In BMVC, Vol. 1, pp. 3. Cited by: §2.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG) 28 (3), pp. 24. External Links: Link, ISBN 9781605587264, Document, ISSN 07300301 Cited by: §2.
  • [3] A. Baumgarten (2013) Metaphysics: A Critical Translation with Kant’s Elucidations, Selected Notes, and Related Materials. A&C Black. Cited by: §1.
  • [4] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool (2008) Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding 110 (3), pp. 346–359. External Links: Link, Document, ISSN 1077-3142 Cited by: §2, §4.
  • [5] J. Bian, W. Lin, Y. Matsushita, S. Yeung, T. Nguyen, and M. Cheng (2017-07) GMS: Grid-Based Motion Statistics for Fast, Ultra-Robust Feature Correspondence. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 2828–2837. External Links: Link, Document Cited by: Figure 1, §2, §3.2, §3.4, §3.4, §3, §3, Table 2, Table 3, Table 4, §4, §4, §4.
  • [6] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §4.
  • [7] B. Busam, T. Birdal, and N. Navab (2018) Camera Pose Filtering with Local Regression Geodesics on the Riemannian Manifold of Dual Quaternions. In Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017, Vol. 2018-Janua, pp. 2436–2445. External Links: Link, ISBN 9781538610343, Document Cited by: §2.
  • [8] B. Busam, M. Esposito, B. Frisch, and N. Navab (2016-10) Quaternionic Upsampling: Hyperspherical Techniques for 6 DoF Pose Tracking. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 629–638. External Links: Link, ISBN 978-1-5090-5407-7, Document Cited by: §2.
  • [9] B. Busam, P. Ruhkamp, S. Virga, B. Lentes, J. Rackerseder, N. Navab, and C. Hennersperger (2018) Markerless Inside-Out Tracking for 3D Ultrasound Compounding. In Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and Navigation, Cham, pp. 56–64. External Links: ISBN 978-3-030-01045-4 Cited by: §1.
  • [10] M. Chli and A. Davison (2008-10) Active matching.. In ECCV, pp. 72–85. External Links: Document Cited by: §2.
  • [11] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal correspondence network. In Advances in Neural Information Processing Systems, pp. 2414–2422. Cited by: §2.
  • [12] O. Chum and J. Matas (2005) Matching with PROSAC — Progressive Sample Consensus. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, pp. 220–226. External Links: Link, ISBN 0-7695-2372-2, Document, ISSN 1063-6919 Cited by: §2.
  • [13] T. Collins, P. Mesejo, and A. Bartoli (2014) An analysis of errors in graph-based keypoint matching and proposed solutions. In European Conference on Computer Vision, pp. 138–153. Cited by: §4.
  • [14] D. Cremers (2017) Direct methods for 3d reconstruction and visual slam. In 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), pp. 34–38. Cited by: §1.
  • [15] G. Csurka and M. Humenberger (2018-07) From handcrafted to deep local invariant features. arXiv preprint arXiv:1807.10254. External Links: Link, Document, ISSN 15206904 Cited by: §2.
  • [16] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse (2007) MonoSLAM: real-time single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1052–1067. Cited by: §1.
  • [17] D. DeTone, T. Malisiewicz, and A. Rabinovich (2017-12) SuperPoint: Self-Supervised Interest Point Detection and Description.

    CVPR Deep Learning for Visual SLAM Workshop

    External Links: Link Cited by: §2.
  • [18] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-net: a trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561. Cited by: §2.
  • [19] J. Engel, V. Koltun, and D. Cremers (2017) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 4 (3), pp. 611–625. Cited by: §1.
  • [20] J. Engel, T. Sch, and D. Cremers (2014) LSD-SLAM: Direct Monocular SLAM. In European Conference on Computer Vision (ECCV), Cham, pp. 834–849. External Links: ISBN 978-3-319-10604-5, Document, ISSN 16113349 Cited by: §1, §1.
  • [21] G. Farnebäck (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In Image Analysis, pp. 363–370. External Links: Link, Document Cited by: §2.
  • [22] K. Fathian, J. P. Ramirez-Paredes, E. A. Doucette, J. W. Curtis, and N. R. Gans (2017) Quaternion based camera pose estimation from matched feature points. arXiv preprint arXiv:1704.02672. Cited by: §2.
  • [23] M. A. Fischler and R. C. Bolles (1981-06) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. External Links: Link, ISBN 0934613338, Document, ISSN 00010782 Cited by: §2.
  • [24] B. A. Galler and M. J. Fisher (1964) An Improved Equivalence Algorithm. Communications of the ACM 7 (5), pp. 301–303. External Links: Link Cited by: §3.1.
  • [25] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: Figure 12, Figure 13, item 4, Figure 10, Figure 9, §4, §4.
  • [26] Y. HaCohen, E. Shechtman, D. B. Goldman, and D. Lischinski (2011) Non-rigid dense correspondence with applications for image enhancement. In ACM transactions on graphics (TOG), External Links: Link, ISBN 9781450309431, Document, ISSN 07300301 Cited by: §2.
  • [27] K.J. Hanna (1991) Direct multi-resolution estimation of ego-motion and structure from motion. In Visual Motion, 1991., Proceedings of the IEEE Workshop on, pp. 156–162. External Links: Link, ISBN 0-8186-2153-2, Document Cited by: §1.
  • [28] J. A. Hesch and S. I. Roumeliotis (2011) A Direct Least-Squares (DLS) method for PnP. In 2011 International Conference on Computer Vision, pp. 383–390. External Links: Link, ISBN 978-1-4577-1102-2, Document Cited by: §2.
  • [29] B. K.P. Horn and B. G. Schunck (1981) Determining optical flow. Artificial Intelligence 17 (1-3), pp. 185–203. External Links: Link, ISBN 0867204524, Document, ISSN 00043702 Cited by: §2.
  • [30] G. Klein and D. W. Murray (2006) Full-3D Edge Tracking with a Particle Filter.. In BMVC, pp. 1119–1128. Cited by: §1.
  • [31] G. Klein and D. Murray (2007) Parallel Tracking and Mapping for Small AR Workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234. Cited by: §1, §2.
  • [32] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard (2011) G2o: A general framework for graph optimization. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3607–3613. External Links: Link, ISBN 978-1-61284-386-5, Document Cited by: §1.
  • [33] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision 81 (2), pp. 155–166. External Links: Link, Document, ISSN 0920-5691 Cited by: §2.
  • [34] W. D. Lin, M. Cheng, J. Lu, H. Yang, M. N. Do, and P. Torr (2014) Bilateral Functions for Global Motion Modeling. In European Conference on Computer Vision, pp. 341–356. External Links: Link, Document Cited by: §2, §4.
  • [35] W. Lin, F. Wang, M. Cheng, S. Yeung, P. H.S. Torr, M. N. Do, and J. Lu (2018) CODE: Coherence Based Decision Boundaries for Feature Correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1), pp. 34–47. External Links: Link, Document, ISSN 0162-8828 Cited by: §2.
  • [36] Y. Lipman, S. Yagev, R. Poranne, D. W. Jacobs, and R. Basri (2014) Feature matching with bounded distortion. ACM Transactions on Graphics (TOG) 33 (3), pp. 26. Cited by: §4.
  • [37] D. G. Lowe (2004) Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: Figure 1, §2, §2, Table 2, Table 3, Table 4, §4.
  • [38] B. D. Lucas and T. Kanade (1981) An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, San Francisco, CA, USA, pp. 674–679. External Links: Link Cited by: §2, Table 2, Table 3, §4.
  • [39] J. Ma, J. Zhao, J. Jiang, H. Zhou, and X. Guo (2017) Locality Preserving Matching. International Journal of Computer Vision, pp. 1–20. External Links: Link, Document, ISSN 0920-5691 Cited by: §2.
  • [40] L. Matthies, R. Szeliski, and T. Kanade (1988) Incremental estimation of dense depth maps from image sequences. In Computer Vision and Pattern Recognition, 1988. Proceedings CVPR’88., Computer Society Conference on, pp. 366–374. Cited by: §1.
  • [41] D. Mintz, P. Meer, and A. Rosenfeld (1992) Analysis of the least median of squares estimator for computer vision applications. In Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pp. 621–623. External Links: Link, ISBN 0-8186-2855-3, Document Cited by: §2.
  • [42] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In Advances in Neural Information Processing Systems, pp. 4826–4837. Cited by: §2.
  • [43] J. Morel and G. Yu (2009)

    ASIFT: A New Framework for Fully Affine Invariant Image Comparison

    SIAM Journal on Imaging Sciences 2 (2), pp. 438–469. External Links: Link, Document, ISSN 1936-4954 Cited by: §2.
  • [44] M. Muja, M. Muja, and D. G. Lowe (2009) Fast approximate nearest neighbors with automatic algorithm configuration. IN VISAPP INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, pp. 331–340. External Links: Link Cited by: §2.
  • [45] R. Mur-Artal and J. D. Tardos (2017-10)

    ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras

    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. External Links: Link, Document, ISSN 1552-3098 Cited by: §1, §1, §2.
  • [46] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) KinectFusion: Real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. Cited by: §1.
  • [47] D. Nister (2004-06) An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (6), pp. 756–770. External Links: Link, Document, ISSN 0162-8828 Cited by: §2.
  • [48] Y. Ono, E. Trulls, P. Fua, and K. M. Yi (2018) LF-net: learning local features from images. In Advances in Neural Information Processing Systems, pp. 6234–6244. Cited by: §2.
  • [49] A. Penate-Sanchez, J. Andrade-Cetto, and F. Moreno-Noguer (2013) Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (10), pp. 2387–2400. External Links: Link, Document, ISSN 0162-8828 Cited by: §2.
  • [50] R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J. Frahm (2013) USAC: a universal framework for random sample consensus.. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 2022–2038. Cited by: §4.
  • [51] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011-11) ORB: An efficient alternative to SIFT or SURF. In Computer Vision (ICCV), 2011 IEEE international conference on, pp. 2564–2571. External Links: Link, ISBN 978-1-4577-1102-2, Document, ISSN 1550-5499 Cited by: Figure 1, §2, §4.
  • [52] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2019) SuperGlue: learning feature matching with graph neural networks. External Links: 1911.11763 Cited by: §2.
  • [53] X. Shen, C. Wang, X. Li, Z. Yu, J. Li, C. Wen, M. Cheng, and Z. He (2019) RF-net: an end-to-end image matching network based on receptive field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8132–8140. Cited by: §2.
  • [54] K. Shoemake (1985) Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’85, pp. 245–254. External Links: Link, ISBN 0897911660, Document Cited by: §2.
  • [55] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative Learning of Deep Convolutional Feature Point Descriptors. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 118–126. External Links: Link, ISBN 978-1-4673-8391-2, Document Cited by: §2.
  • [56] H. Strasdat, A. J. Davison, J.M.M. Montiel, and K. Konolige (2011) Double window optimisation for constant time visual SLAM. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2352–2359. External Links: Link, ISBN 978-1-4577-1102-2, Document, ISSN 1550-5499 Cited by: §1.
  • [57] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of RGB-D SLAM systems. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 573–580. Cited by: Figure 13, Figure 1, item 4, Figure 7, Figure 8, Figure 9, §4, §4, §4.
  • [58] Y. Tian, B. Fan, and F. Wu (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–669. Cited by: §2.
  • [59] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas (2019) SOSNet: second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11016–11025. Cited by: §2.
  • [60] N. Ufer and B. Ommer (2017) Deep semantic feature matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [61] Y. Verdie, Kwang Moo Yi, P. Fua, and V. Lepetit (2015-06) TILDE: A Temporally Invariant Learned DEtector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5279–5288. External Links: Link, ISBN 978-1-4673-6964-0, Document, ISSN 10636919 Cited by: item 4, §2, Table 4, §4.
  • [62] L. von Stumberg, V. Usenko, J. Engel, J. Stückler, and D. Cremers (2017) From monocular slam to autonomous drone exploration. In 2017 European Conference on Mobile Robots (ECMR), pp. 1–8. Cited by: §1.
  • [63] G. Wang, Y. Chen, and X. Zheng (2018) Gaussian field consensus: A robust nonparametric matching method for outlier rejection. Pattern Recognition 74, pp. 305–316. External Links: Link, Document, ISSN 00313203 Cited by: §2.
  • [64] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1385–1392. Cited by: §4.
  • [65] Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng (2003) Complete solution classification for the perspective-three-point problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8), pp. 930–943. External Links: Link, Document, ISSN 0162-8828 Cited by: §2.
  • [66] Xufeng Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg (2015-06) MatchNet: unifying feature and metric learning for patch-based matching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3279–3286. External Links: Document, ISSN 1063-6919 Cited by: §2.
  • [67] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: Learned Invariant Feature Transform. European Conference on Computer Vision, pp. 467–483. External Links: Link Cited by: §2, §4.
  • [68] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to Find Good Correspondences. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [69] A.L. Yuille, N.M. Grzywacz, and M. Norberto (1988) The Motion Coherence Theory. In 1988 Second International Conference on Computer Vision, pp. 344–353. External Links: Link, ISBN 0-8186-0883-8, Document Cited by: §2.
  • [70] Z. Zheng, Y. Ma, H. Zheng, J. Ju, and M. Lin (2018) UGC: Real-time, Ultra-robust Feature Correspondence via Unilateral Grid-based Clustering. IEEE Access. External Links: Link, Document, ISSN 2169-3536 Cited by: §2, §3.

Appendix 0.A Details on 3.3 False Positive Reduction

Here, we detail the derivation of Eq. (6) from the main paper for better understandability.

True Matches. Let the probability of a feature in having its Nearest Neighbor in under cross check be , then it holds:

False Matches. In analogy for uncorrelated patches:

Appendix 0.B Runtime Analysis

Our method achieves realtime performance on CPU for the full pipeline from feature extraction, matching and applying our

spatial and temporal constraints, without any GPU acceleration. In Fig. 11 the runtime advantage of our method against GMS is clearly visible. GMS is by a factor of 4 slower with GPU acceleration and for CPU-only even by a factor of 15. The matching step, which contributes to a majority of the overall time consumption for GMS and other methods, has now been decreased significantly. The bottleneck for our proposed method is now solely the feature extraction itself.

Figure 11: Relative runtime comparison. Primary axis shows the relative time consumption per frame for each step of the feature matching pipeline in percentage (numbers are also depicted in the respective bar). Secondary axis (log scale) shows the overall time consumption relative to our proposed method.

Appendix 0.C Parameter Discussion

0.c.1 Features Points


We limit the maximum number of extracted feature points in the image. Speaking purely from the perspective of estimating camera poses, a small number of feature matches is sufficient. However, for our proposed method we assume a certain number of feature points to be detected for forming local feature groups from feature clusters around well defined structures in the image:



The FAST threshold of ORB is set to to ensure a high number of detected feature points while not compromising the feature descriptor quality:


0.c.2 Local Motion Model

The parameters for our local motion model are justified by our proposed probabilistic model. Chosen parameters have been used throughout our evaluation, and have therefore been proven to be applicable for different image content and scenarios.

Group Area.

We define a maximum size for a local group in pixels:


For every feature, the algorithm will find its neighbors within a 30-pixel-by-30-pixel region centering around the feature. Accompanied with a certain size limit of groups, it enables more nearby features being grouped into a group while keeping the group’s size in an appropriate range. This parameter may be adjusted for HR images.

Group Size.

Derived from our probabilistic model, we need a minimum number of features per group for applying the statistical criteria. A maximum number of feature points per group should also be considered. The maximum number is to prevent the group from exceeding expansion, and for very large numbers of features, the quality criterion reaches a saturation stage.


Appendix 0.D Qualitative Results

Figures 121314 and  15 illustrate a few more qualitative results on different datasets. See figure description for more details.

Figure 12: Direct comparison of SIFT, GMS, and DynaMiTe (Ours) together with the corresponding runtime on a driving scene from [25].
Figure 13: Examples of feature point matches for the TUM-RGBD  [57] and Kitti dataset [25]. Note the challenging scenes with blur (top right) and large rotations (bottom).
Figure 14: Ours (top) reliably tracks only stable feature points as opposed to GMS.
Figure 15: Generally ours (left) has reduced noise and less false positive matches in textureless areas compared to GMS.

Appendix 0.E Algorithms

For a better understanding of our proposed method together with the source code, we provide an overview of the pipeline as pseudo-code. An overview of the overall pipeline can be found in Algorithm LABEL:DynaMiTe_Algo. The grouping algorithm for finding dynamic local feature groups is summarized in Algorithm LABEL:Grouping_Algo.