1 Introduction
Image feature matching (a.k.a. correspondence selection) is a cornerstone of many computer vision and robotic tasks, such as optical flow baker2011database, structurefrommotion snavely2008modeling, stereo matching hirschmuller2008stereo, simultaneous localization and mapping benhimane2004real, and image stitching brown2007automatic. The main purpose of image feature matching is to discover the corresponding relationship between feature points of two images, which serves as the foundation for analysis at higher levels. Despite being a trivial task for human vision, image feature matching becomes more challenging for machines especially in the presence of large variation in illumination and viewpoint. It is often nontrivial to pursue robust features invariant to illumination and viewpoint.
The problem is further complicated when the scene is dynamic instead of static. The majority of existing approaches have been developed for static scenes only  i.e., the corresponding relationship between two images is characterized by a global transformation (e.g., affine transformation and perspective transformation, as shown in Fig. 1a). Such singleconsistency feature matching is not appropriate for dynamic scenes in which there are multiple separated local transformations associated with several moving objects (e.g., Fig. 1b). Note that due to the existence of multiple consistency, conventional wisdom of improving the robustness of singleconsistency matching such as RANSAC fischler1981random and USAC (the modified version of RANSAC) Raguram2013USAC easily fails.
In this project, we approach the problem of multiconsistency image feature matching by formulating it as a generalized clustering problem. The key insight behind our approach lies in that multiple consistencies between two images are determined by a collection of homographies corresponding to either planar surface in the background or independent moving objects in the foreground in dynamic scenes. In addition to correspondence establishment, determining the total number of consistencies homographies
is a new issue that has not been addressed in the open literature. By taking errorfree correspondences between two images as input features, one can solve the problem by clustering in the feature space in a similar fashion to kmeans (note that the number of clusters
is often specified by the user). In view of practical limitations (i.e., local correspondences are errorprone), we have to develop robust clustering solutions insensitive to the possible outliers in local correspondences.
The motivation behind our approach is twofold. On one hand, gametheoretic matching (GTM) albarelli2010game has been developed as a powerful technique for establishing singleconsistency correspondence even in the presence of elastic deformation Rodola2014Elastic; however, it has not been extended to multiple consistencies to the best of our knowledge. The conventional framework of GTM is inappropriate for multiconsistency feature matching because the inliers associated with different moving objects are incompatible, which violates the fundamental assumption with a global geometric compatibility between feature correspondences. To overcome this limitation, we propose a novel payoff function that considers both geometric and descriptive compatibility in the definition of payoff function. With newly defined payoff function, we can play multiple local games simultaneously by following the classical evolutionary stable strategy (ESS) algorithm weibull1997evolutionary.
On the other hand, we propose an iterative consistency clustering procedure to group compatible correspondences and estimate the unknown number of clusters
based on the results of local noncooperative games. Since the compatibility between each two tentative matches determined by local games is measured by the newly defined payoff function, it appears plausible to use this compatibilitybased metric to group those correspondences with high compatibility and infer the local transformation generated from each correspondence cluster. Conceptually similar to kmeans, we can alternate the steps of compatibilitybased clustering and local transformation estimation. During the iterations, image feature pairs falsely eliminated in local games can be recovered by clustering. More specifically, by calculating the consistency with estimated local transformations, we can recover inliers because they should be consistent with at least one estimated local transformation. Such iteration can be terminated whenever no new transformation can be found (i.e., the estimated reaches the maximum).The other contribution of this work is on performance evaluation for multiconsistency feature matching. In conventional singleconsistency matching, three metrics, i.e., precision (P), recall (R), and Fmeasure (F), have been used in lin2014bilateral; bian2017gms; ma2019locality; but they are sensitive to the unbalanced saliency of different underlying consistencies and therefore inappropriate for evaluating the performance of multiconsistency feature matching. Note that the distribution of keypointlevel correspondences are often sparse and nonuniform; therefore, the number of included correspondences is likely significantly vary from cluster to cluster. To address this issue, we propose to use three new metrics for multiconsistency evaluation  i.e., weightedprecision (WP), weightedrecall (WR), and weightedFmeasure (WF). The key idea is to adaptively weight each correspondence based on the underlying consistency, with the aim of amplifying the effect of less salient consistencies. The implementation details of the benchmark^{1}^{1}1The code and dataset will be available at: https://github.com/sailorz/icGTM. will be introduced in Sec. 5. ^{2}^{2}2This paper is an extended version of the conference paper zhao2018scalable.
In a nutshell, the contributions of this paper are summarized as follows:

A formulation of multiconsistency image feature matching problem and theoretical analysis about its relationship to singleconsistency matching and generalized clustering.

A novel payoff function robust to common disturbance to guide both playing local games and clustering global consistencies, with the consideration of both geometric and descriptive compatibility.

An iterative clustering with GraphTheoretic Matching (icGTM) framework for multiconsistency image feature matching, which has significantly outperformed other competing methods on both the multiconsistency and singleconsistency datasets.
2 Related work
Parametric algorithms. A popular strategy for correspondence selection is based on the classical RANSAC Raguram2013USAC
. In RANSAC, one alternately samples a subset of correspondences to generate a hypothesized parametric model and verifies the confidence of the generated model by some geometric metrics (e.g., reprojected errors and epipolar distances). These metrics can also be used to select consistent correspondences under the constraint of the finally estimated model. However, the hypothesis generated by random sampling is sensitive to the inlier ratio of initial correspondence set. The confidence of hypothesis testing tends to decline rapidly if the majority of initial correspondences are incorrect. Although some efforts e.g., PROSAC
chum2005matching, LORANSAC chum2003locally, and USAC Raguram2013USAC, have been proposed to improve the robustness to low initial inlier ratios, parametric methods still have fundamental limitations in the scenarios of nonrigid feature matching and multiconsistency feature matching.For nonrigid feature matching, the underlying transformation between two images is too complex to be accurately represented by a global transformation, e.g., homography matrix or essential matrix. For multiconsistency feature matching, some researches fit multiple parametric models by generating multiple hypothesis sampling magri2015robust; chin2010accelerated; wong2011dynamic. These approaches generally suppose that correspondences associated with the same structure share a common hypotheses. For example, Chin et al. proposed a guidedsampling scheme (MultiGS) chin2010accelerated where a series of hypotheses are generated in advance and the preference lists ordered according to the compatibility to the hypotheses are expected to be similar. However, image feature matching and image segmentation are intertwisted in multiconsistency (i.e., like a chickenandegg problem).
Nonparametric algorithms.
An alternative approach to correspondence selection is via nonparametric models
albarelli2010game; Ma2014Robust; bian2017gms; ma2019locality. For instance, in gametheoretic matching (GTM) albarelli2010game, inliers are assumed to be compatible with each other, which result in a larger payoff in a noncooperative game. The vector filed consensus (VFC)
Ma2014Robustapproach is based on the assumption that noise around inliers and outliers observes the Gaussian distribution and the uniform distribution respectively. It follows that a maximum a posteriori (MAP) estimation of a mixture model with latent variables determining inliers is obtained by the EM algorithm. Gridbased motion statistics (GMS)
bian2017gms is based on the observation that the matching quality is positively correlated to the number of correspondences in small grid regions under the assumption with motion smoothness. Most recently, a locality preserving matching (LPM) ma2019locality method was developed based on the observation that the spatial neighborhood relationship between two keypoints of a correct matching should be well preserved. Although these nonparametric approaches can be applied to multiconsistency feature matching, their performance is prone to declining in the context of some specific challenges, e.g., the large scale rotation, translation, or/and zoom. It remains an open research problem how to discover and recognize the potential consistent relationship from the selected correspondences in nonparametric methods.Learningbased algorithms.
Recently, deep learning has found many successful applications in image processing and computer vision such as image classification, image segmentation, object detection and recognition. Naturally, it is desirable to pursue a learningbased approach toward correspondence selection. Some attempts
yi2018learning; zhao2019nm have been made along this direction, which translated the correspondence selection problem to a permatch binary classification problem (i.e., inlier vs. outlier). However, these supervised leaningbased approaches require enormous annotated training data; acquiring such annotations for the multiconsistency feature matching task is often impractical because the workers have to manually label thousands of correspondences in an image pair.3 Method
Fig. 2 includes the overview of our icGTM framework consisting of four steps. Initialization step generates initial correspondences prone to a large number of mismatches; The step of Block Matching (Sec. 3.2) divides the images into nonoverlapping blocks and searches for the block pairs; Local Game step (Sec. 3.3) carries out a series of noncooperative games with all block pairs simultaneously and identifies plausible candidates; finally, Iterative Clustering (Sec. 3.4) clusters the correspondences survived from local games and recovers the incorrectly discarded inliers in an iterative manner.
3.1 Problem Formulation
Given a pair of images , detected keypoints , and local patch descriptions , an initial correspondence set can be generated by some adhoc strategies such as brute force matching between and . In the situation of dynamic scenes, tends to contain multiple errorprone consistencies representing different moving objects in the foreground. Due to inevitable errors of keypoint detection and intrinsic ambiguity of feature descriptions, there are many nuisances in , i.e., outliers . The goal of multiconsistency matching is to reject while identifying the correct subset of (i.e., ). Then it will be straightforward to estimate a set of local transformations based on the multiconsistency matching result. We will elaborate the details of three steps in the following subsections.
3.2 Block Matching
Directly applying RANSAC fischler1981random or gametheoretic matching (GTM) albarelli2010game is illsuited for multiconsistency feature matching due to the lack of a single global transformation. Based on this observation, it is natural to work with local regions instead of the image as a whole. One possible solution is to leverage offtheshelf image segmentation algorithm, but we note that segmentation is an overkill for multiconsistency matching. As a compromised solution, we propose a simple yet effective gridding method that divides the image into nonoverlapping blocks (). Then we can search for the corresponding block pairs between two images, with the expectation that each pair of matched blocks only contains a single consistency.
In the presence of largescale rigid transformation between (e.g., viewpoint changes and rotation), it is nontrivial to exactly search for the corresponding block in for each in . Drawing inspiration from bian2017gms that suggests the confidence of a correspondence is locally correlated with the number of matches, we propose to address this issue in a statistical manner. That is, the similarity of each block pair is quantified by the number of correspondences located within the pair. Formally, we have
(1) 
where is the similarity of blocks and correspondence is defined by
(2) 
with . For each block in , the corresponding block in is given by the most similar one or the one with the largest number of correspondences  i.e.,
(3) 
Meantime, if the number of correspondences contained in the matched block pair is smaller than a predefined threshold, the pair is discarded because it is likely caused by the interference from background or clutter. For example, the grouping tends to be ambiguous in some blocks such as those including the edge or corner of an object as shown in the top right image of Fig. 2. The matched blocks found in these regions are prone to be eliminated due to a small number of correspondences.
3.3 Play Local Games
The step of block matching supplies multiple consistencies assigned to different regions. However, those tentative matching results have not been optimized. Inspired by the success of GTM for singleconsistency matching, it is intuitively desirable to optimize the matching results by playing several local games simultaneously in these local regions. Similar to the GTM for singleconsistency matching albarelli2010game, at the core of each local game is the payoff function which represents the compatibility of two correspondences. Players who achieve higher payoffs are more popular as the game evolves, which suggests the correspondences they select are inliers. However, the payoff function employed in traditional GTM only considers the geometric compatibility between two correspondences, whose reliability becomes questionable when large transformation is present. To overcome this limitation, we propose a more robust payoff function considering both geometric and descriptive compatibility in this work.
More specifically, each player chooses a correspondence from , where denotes the pair of matched keypoints and is the corresponding pair of local descriptions. Each two players will then receive a payoff function positively correlated with the compatibility of their choices. The payoff function is defined by
(4) 
where is the overall payoff of , and respectively indicate the compatibility of geometric structures and local descriptions, which we will elaborate next.
Geometric compatibility. Inspired by the recent work ma2019locality, geometric structures in the neighborhood of inliers tent to be homogeneous as shown in Fig. 3 (a) (two quadrangles from different viewpoints), which results in consistent local transformations. By contrast, the variation of geometric structures around outliers can be large and irregular as shown in Fig. 3 (b) (quadrangle and triangle), leading to inconsistent local transformations. Based on the above observations, we suggest the use of Euclidean distance between the keypoint positions projected by the pair of local transformations around two correspondences as a measure of geometric compatibility. In other words, from the perspective of local geometric variations, we can define geometric compatibility as
(5) 
where represents the norm, is the projected position of (an exemplar keypoint) through local transformation and calculated by ( in the same way)
(6) 
where , being the affine information arrond , and is a scale coefficient. To obtain , one might use the offtheshelf keypoing detection method (e.g., Hessianaffine detector Mikolajczyk2004Hessian).
Descriptive compatibility. Consider a salient keypoint in the real world; the projections of this keypoint onto two imaging planes ()  two local descriptive features  should be similar. A straightforward approach of measuring the similarity of descriptive features (e.g., SIFT descriptor lowe2004distinctive) is to calculate their Euclidean distance between of . However, it is well known that norm is not robust to outliers and easily confused by nuisances such as nonuniform illumination, motion blur, and viewpoint variations. A more robust strategy is to use relative (instead of absolute) difference between descriptive features. For example, the socalled divisive normalization lyu2008nonlinear strategy shows improved robustness over conventional norm.
Here we have adopted the ratio test as an alternative metric whose robustness has been shown in lowe2004distinctive. The ratiobased descriptive compatibility is defined by
(7) 
where is the descriptor vector in closest to and is the second closest descriptor vector. A credible correspondence is expected to achieve a significant distinctiveness between the closest match and the second closest match, resulting in a smaller ratio . To measure the compatibility of two correspondences from the perspective of local feature embedding, we expect both two correspondences perform prominent distinctiveness if they are consistent. Therefore, we can define the descriptive compatibility payoff term by
(8) 
where is a scale coefficient.
Evolutionary Stable Strategy. With the definition of theabove designed payoff functions, the popularity of all players can be iteratively updated by the evolutionary stable strategy (ESS) weibull1997evolutionary algorithm as
(9) 
where is the popularity vector of all players and is the payoff matrix generated by
(10) 
where . As the game going on, the popularity of players who acquire larger payoffs from the other players is significantly higher, which indicates their selections are comparable to the majority of other correspondences. Quantitatively, is determined as a correct match if the corresponding is higher than an adaptive threshold calculated by the OTSU otsu1979threshold algorithm.
3.4 Iterative Clustering
Local noncooperative games produce some candidate correspondences for multiconsistency matching. However, they are not optimized and need to be further refined/clustered for the following reasons. First, gridbased block matching is an efficient yet approximate process. As illustrated in Fig. 4, the red blocks contain corners or edges of the object, and fewer correspondences are located in these areas, which are prone to be eliminated by block matching after thresholding. Second, GTM is known to suffer from high miss detection or false rejection rate (i.e., poor recall performance) because many correct matches are falsely rejected by local games Rodola2014Elastic. As shown in Fig. 2
, the correspondences after playing local games are overwhelmingly consistent but sparse. Third, local games cannot resolve the ambiguity underlying multiple consistencies. In other words, the collection of correspondences still need to be classified and assigned to different consistency classes.
To address these issues, we propose an iterative clustering process for simultaneously recovering falselyrejected inliers and classifying different consistencies. Intuitively, our approach can be interpreted as a generalized clustering in the space of matched correspondences. Specifically, our iterative clustering method consists of four steps. First, we recompute the payoff matrix for all selected candidates as
(11) 
where and is the set of candidates. Second, we find out the most currently consistent pair of correspondences as an anchor, which corresponds to the maximum element in . Third, we search for and cluster the other correspondences consistent with the anchor by comparing the corresponding elements with a threshold defined as
(12) 
Fourth, the clustered correspondences are removed by set elements in the corresponding row and column to be zero. The steps 24 can be iterated until the size of clustered subset is lower than a predefined threshold (4 in our experiment).
To quantitatively evaluate the consistencies represented by parametric transformations, we perform RANSAC (other parametric methods can be used as well) within each cluster and simultaneously calculate a set of parametric transformations (homography matrices in our approach) as ( is the number of clusters). Finally, is employed to check the initial correspondences and recover the falsely eliminated inliers by computing reprojected errors as
(13) 
where , and . is determined as an inlier if the minimum element in is lower than a threshold ( in default).
4 Performance Evaluation
For single consistency feature matching, precision (P), recall (R), and Fmeasure (F) are commonly used for performance measurement lin2014bilateral; bian2017gms; ma2019locality. However, these metrics are not appropriate in case of multiconsistency feature matching. As demonstrated in Fig. 6, due to the difference of geometric structures or image textures, the spatial distribution of correspondences tends to be vary across different regions. Therefore, there is a large gap between underlying consistencies associated with different moving objects especially from the perspective of saliency. For example, less salient consistencies that contain fewer correspondences are often prone to be eliminated (e.g., the consistencies highlighted by red color in Fig. 6), but its impact on the actual performance measured by P, R, and F will be insignificant because the eliminated inliers only make up a small portion of correspondences.
To make up the above deficiency, we propose to use three new metrics, i.e., weightedprecision (WP), weightedrecall (WR), and weightedFmeasure (WF) that are more appropriate for evaluating the performance of multiconsistency matching. The key new insight is to introduce the idea of weighting while distinguishing correspondences among different consistencies. More specifically, each match is weighted according to the number of correspondences within the consistency it belongs to. That is, the weight is negatively correlated with the number of incliers consistent with the associated model (homography H)  i.e.,
(14) 
where is the weight of an inlier belonging to the th consistency , is the number of inliers consistent with , and is the total number of inliers. For outliers, the weight is calculated by
(15) 
Note that we choose the maximum operation for the purpose of magnifying the penalty of outliers on performance metrics. Therefore, WP, WR, and WF are respectively defined by
(16) 
(17) 
(18) 
where is the sum of all ’s (including and ), , , and respectively represent True Positive, False Positive, and False Negative results (i.e., , , and ) generated by the evaluation method.
5 Experiments
In this section, we will elaborate on our benchmark 5.1 including the datasets (5.1.1) and the experimental setup (5.1.2), present the quantitative (5.2) and qualitative (5.3) experimental results, and conduct some analysis (5.4) including the analysis about payoff function (5.4.1) and ablation study (5.4.2) to better illustrate the mechanism behind icGTM.
5.1 Benchmark
5.1.1 Datasets
In the field of correspondence selection, most existing datasets have only considered the cases of single consistency and have not covered dynamic scenes which contain multiple consistencies due to the presence of moving objects/camera. To fill the gap of multiconsistency evaluation, we have set up a dataset consisting of three dynamic scenes with varying challenges, i.e., translation, rotation, clutter, and occlusion. The ground truth in our dataset is a subset of manually labelled correspondences (inliers). To make our benchmark more comprehensive, we have also included a classical public dataset, i.e., AdelaideRMF wong2011dynamic, in which each image pair includes multiple consistencies among different structures. Meantime, we still employ VGG dataset mikolajczyk2005comparison to evaluate the generalization of icGTM for single consistency feature matching. Some examples and characteristics of those datasets in our benchmark are shown in Fig. 5 and Table 1.
Dataset  Challenges  Ground truth  # Image pairs 

Scene1  Translation and zoom  Manually labeled inliers  15 
Scene2  Translation and rotation  Manually labeled inliers  15 
Scene3  Translation, rotation, clutter, and occlusion  Manually labeled inliers  15 
AdelaideRMF wong2011dynamic  Multiple structures and viewpoint change  Manually labeled inliers  38 
VGG mikolajczyk2005comparison  Zoom, rotation, blur, light change, viewpoint change, and JPEG compression  Homography matrix  40 
5.1.2 Experimental setup
We have evaluated icGTM along with seven other competing methods including RANSAC fischler1981random, GTM albarelli2012imposing, MultiGS chin2010accelerated, USAC Raguram2013USAC, VFC Ma2014Robust, GMS bian2017gms and LPM ma2019locality
. The evaluation metrics are weightedprecision (WP), weightedrecall (WR), weightedFmeasure (WF), and efficiency (T) for our dataset. Fmeasure (F) is also used in our dataset to verify the superiority of our metrics. For single consistency, we have used precision (P), recall (R), Fmeasure (F), and efficiency (T) in VGG. Each image is divided into
blocks, and the initial correspondence set is generated by bruteforce matching lowe2004distinctive with the combination of Hessianaffine detector Mikolajczyk2004Hessian and SIFT descriptor lowe2004distinctive. Notably, since only SIFT detector is provided in AdelaideRMF and the affine information in Eq. 6 is unavailable, only quantitative results are shown on this dataset.5.2 Quantitative Results
5.2.1 Single consistency
RANSAC  GTM  MultiGS  USAC  VFC  GMS  LPM  icGTM  

Case1  Precision (%)  79.79  43.22  13.64  80.57  67.19  63.61  42.5  78.50 
(zoom rotation)  Recall (%)  91.68  79.6  7.39  97.49  86.11  11.45  83.54  99.29 
Fmeasure (%)  84.86  53.69  9.2  86.32  74.27  18.46  54.75  85.38  
Time (s)  0.03  0.43  2.30  0.02  0.005  0.001  0.01  0.03  
Case2  Precision (%)  37.91  67.00  15.46  38.90  29.44  41.47  27.73  57.63 
(blur)  Recall (%)  39.16  56.74  8.62  40.00  51.41  50.45  54.75  73.35 
Fmeasure (%)  38.32  61.12  10.00  39.44  35.27  45.54  35.86  64.09  
Time (s)  0.06  12.05  4.70  0.75  0.02  0.001  0.02  0.25  
Case3  Precision (%)  72.40  44.92  19.60  57.43  49.38  58.57  44.59  71.92 
(zoom rotation)  Recall (%)  77.50  52.16  14.79  57.06  99.22  57.21  76.43  99.90 
Fmeasure (%)  73.52  44.83  14.00  57.23  61.91  56.35  55.10  82.40  
Time (s)  0.05  54.29  7.70  0.52  0.04  0.002  0.05  0.92  
Case4  Precision (%)  64.08  50.94  25.86  52.75  57.86  57.05  45.08  69.30 
(viewpoint change)  Recall (%)  57.74  68.55  27.76  53.50  97.08  75.52  83.87  84.94 
Fmeasure (%)  56.67  55.69  25.00  53.12  71.23  64.55  56.56  74.68  
Time (s)  0.04  43.50  6.90  0.53  0.04  0.002  0.04  0.84  
Case5  Precision (%)  88.75  68.90  31.65  96.26  71.99  64.89  57.65  84.32 
(light change)  Recall (%)  80.94  80.35  26.88  99.91  100  87.95  84.46  100 
Fmeasure (%)  81.25  73.94  26.00  98.05  82.49  74.37  67.90  91.38  
Time (s)  0.04  53.08  8.10  0.10  0.05  0.001  0.05  2.29  
Case6  Precision (%)  49.85  33.45  5.62  35.88  31.18  57.10  26.72  58.79 
(blur)  Recall (%)  21.19  39.10  2.21  39.65  40.00  47.00  66.81  87.31 
Fmeasure (%)  23.74  34.29  2.40  37.65  35.02  50.80  35.82  68.95  
Time (s)  0.08  50.70  7.40  0.74  0.04  0.002  0.04  0.42  
Case7  Precision (%)  93.04  80.66  49.15  97.57  89.48  79.87  75.87  91.45 
(JPEG compression)  Recall (%)  98.48  94.42  58.01  100  100  96.70  93.38  100 
Fmeasure (%)  95.63  86.88  53.00  98.77  94.26  87.25  83.43  95.46  
Time (s)  0.009  50.70  7.90  0.06  0.04  0.002  0.05  2.86  
Case8  Precision (%)  74.59  72.05  25.43  76.71  72.40  80.86  62.67  84.60 
(viewpoint change)  Recall (%)  72.49  73.42  8.60  77.66  75.74  76.08  69.64  99.28 
Fmeasure (%)  48.68  39.98  26.24  63.23  44.86  74.04  77.80  90.77  
Time (s)  0.04  50.74  7.30  0.27  0.05  0.002  0.05  0.60  
average  Precision (%)  70.05  23.31  57.64  67.01  58.61  62.96  47.85  74.57 
Recall (%)  67.40  18.87  68.88  70.78  81.67  62.42  78.14  93.01  
Fmeasure (%)  65.91  18.54  60.48  68.53  66.27  59.17  57.38  81.64  
Time (s)  0.05  6.54  39.44  0.37  0.04  0.002  0.04  1.02 
Although the focus of this work is multiconsistency matching, the proposed icGTM can be easily generalized for singleconsistency matching. From a performance evaluation perspective, we think it is worth including the comparison for static scenes or single consistency feature matching as the starting point. As shown in Table 2, icGTM achieves superior average performance, i.e., , , and , which outperform all other competing algorithms by a large margin. Moreover, icGTM has achieved promising results in all eight cases with varying challenges from geometric structure diversity to image quality variations, which confirms the robustness property of icGTM. Meantime, it should be noted that the performance of icGTM in case2 and case6 in the presence of blurred images is significantly lower than other cases. This is because the descriptive compatibility item in Eq. 4 is confused by severe degradation of image qualities when blur occurs.
5.2.2 Multiple consistencies
RANSAC  GTM  MultiGS  USAC  VFC  GMS  LPM  icGTM  

Scene1  WP (%)  77.36  41.1  21.99  62.93  29.97  43.69  44.89  79.67 
WR (%)  48.15  62.00  53.03  72.84  87.22  75.24  90.62  92.11  
WF (%)  57.93  48.40  29.18  66.46  43.98  54.76  59.63  85.17  
Fmeasure (%)  65.77  62.40  38.07  78.17  56.99  64.19  69.59  90.50  
Time (s)  0.03  5.23  3.59  0.13  0.01  0.001  0.02  0.53  
Scene2  WP (%)  83.13  53.41  46.05  72.56  50.83  69.49  66.64  81.25 
WR (%)  42.26  28.61  38.26  54.40  74.00  73.27  97.20  89.60  
WF (%)  55.13  36.41  38.70  61.31  58.64  71.01  78.82  85.08  
Fmeasure (%)  59.70  40.75  40.94  66.65  62.26  73.94  81.64  87.23  
Time (s)  0.08  12.28  4.79  0.13  0.02  0.001  0.03  0.73  
Scene3  WP (%)  83.02  43.70  32.71  68.62  39.01  65.37  57.72  77.36 
WR (%)  31.69  29.38  22.97  48.07  55.36  76.70  97.21  75.64  
WF (%)  44.20  34.76  24.49  55.56  39.81  70.02  72.22  75.62  
Fmeasure (%)  48.68  39.98  26.73  62.85  44.86  74.04  77.80  79.23  
Time (s)  0.07  13.55  5.27  1.06  0.02  0.001  0.07  0.70  
Average  WP (%)  81.17  46.07  33.58  68.04  39.94  59.52  56.42  79.43 
WR (%)  40.7  40.00  38.09  58.44  72.19  75.07  95.01  85.78  
WF (%)  52.42  39.86  30.79  61.11  47.48  65.26  70.22  81.96  
Fmeasure (%)  58.05  47.71  35.04  69.13  54.70  70.72  76.34  85.65  
Time (s)  0.07  13.55  5.27  1.06  0.02  0.001  0.07  0.65 
In Fig. 7, we have plotted and compared the curves of three performance metrics (WP, WR and WF) for three dynamic scenes. Different competing methods are represented by distinct color codes; it can be observed that the yellow color that represents the performance of icGTM is the best performing curve in most cases. Although the method of RANSAC (blue color) demonstrates strong WP performance, its WR and WF performance dramatically fall behind. Such observation confirms that RANSAC is only good at discovering one kind of consistency, which is not appropriate for multiconsistency matching. LPM (black color) performs well in terms of WR, but its WP performance is disappointing. This is because LPM is not a selective method, with a relatively loose constraint. By contrast, icGTM is capable of striking an improved tradeoff between WP and WR, achieving the best overall WF performance. The merits of local games and iterative clustering jointly contribute to its excellent performance.
The superiority of icGTM to other competing methods can also been verified from the quantitative results shown in Table 3. We have compared the individual performance in each scene as well as the average performance on the entire dataset. As demonstrated in Table 3
, icGTM performs the best in terms of WF and F, outperforming other approaches by a large margin. Besides, the computational efficiency of icGTM is improved at least an order of magnitude when compared with traditional GTM. This is because a large global payoff matrix in GTM is divided into some small matrices processed simultaneously by local games. Moreover, we note that the newly developed metrics (WF as an example) seem more reasonable than traditional metrics (e.g., F). Taking RANSAC as an example, the WF scores are remarkably lower than the corresponding F scores; this is because RANSAC tends to miss less salient consistencies. This deficiency is better reflected by the degradation of WF performance than that of Fperformance.
5.3 Qualitative Results
We have also included visual comparison between icGTM and other competing methods on some exemplar scenes from our own dataset, VGG, and AdelaideRMF as shown in Fig. 8 and Fig. 9. In Fig. 8, icGTM finds out most underlying consistencies on our dataset, which are highlighted by different colors (outliers or incorrect mismatches are represented by black color). Other methods such as RANSAC miss many correct matches and cannot recognize multiple consistencies in dynamic scenes. For VGG dataset, we have selected a few challenging cases, in which many other approaches are ineffective (e.g., dominated by black lines). By contrast, icGTM still achieves superior distinctiveness between the inlier and outlier in the presence of large scale viewpoint changes and blur. Note that the last row of Fig. 8 contains large scale zoom and rotation (the most challenging case); icGTM works noticeably better than others but still suffer from many errors.
In Fig. 9, we have used a different visualization methodology to compare different feature matching methods. Differently colored dots are used to indicate the keypoint positions of selected correspondences. Note that only icGTM produces multiconsistency matching results highlighted by different colors. In our experiment, SIFT detector provided in AdelaideRMF dataset is replaced by Hessianaffine detector in order to obtain the essential affine information required by icGTM. It is easy to see that only icGTM is capable of discovering the rich underlying consistencies characterized by either multiple planes in static scenes (e.g., building surfaces in the left and right two columns) or multiple moving objects in dynamic scenes (e.g., toys and books in the middle three columns).
5.4 Analysis
5.4.1 Payoff function
There are several alternative choices of the payoff function in Eq. 4. To compare their differences, we have evaluated the objective performance of icGTM using four different payoff functions on our dataset and VGG respectively. The comparison results are shown in Table 4. DES (descriptor) means the Euclidean distance between matched descriptor vectors which is defined as
(19) 
DIS (distance) represents the first item of Eq. (4), R_T (ratio test) corresponds to the second item of Eq. (4), and denotes the sum of R_T and DIS (both terms). We make the following observations from the reported comparison results.
First, when compared with DES, R_T achieves better performance in both multiconsistency and singleconsistency scenarios, which confirms that the ratio test is more effective and robust than Euclidean distance. Meantime, DIS outperforms R_T on our own dataset (dynamic scenes), but is worse than R_T on the VGG dataset (static scenes). One possible explanation is that geometric compatibility is more effective than descriptive compatibility for less challenging static scenes in the absence of nonuniform illumination or motion blur.
Second, R_T + DIS achieves the best performance on Scene1, Scene2, and Scene3, outperforming both R_T and DIS. Such result verifies that the combination of two payoff functions takes the advantages of both items, which demonstrates improved robustness for multiconsistency feature matching. However, we note that R_T alone achieves the best results on VGG dataset, even surpassing R_T + DIS. This shows the strategy of fusion has not been optimized in all scenarios; there is still room left to improve the choice of payoff function design (e.g., one might consider productbased instead of sumbased fusion).
DES  R_T  DIS  R_T + DIS  

Scene1  WP (%)  80.00  81.71  81.67  79.67 
WR (%)  53.05  59.07  88.63  92.11  
WF (%)  59.44  66.17  84.65  85.17  
Scene2  WP (%)  83.69  83.56  82.39  81.25 
WR (%)  45.76  45.39  73.77  89.60  
WF (%)  57.69  57.17  76.78  85.08  
Scene3  WP (%)  81.69  81.14  79.28  77.36 
WR (%)  41.63  42.66  66.04  75.64  
WF (%)  53.56  54.15  70.28  75.62  
VGG mikolajczyk2005comparison  P (%)  80.45  82.28  72.39  74.57 
R (%)  91.82  94.50  87.05  93.01  
F (%)  85.00  87.09  77.97  81.64 
5.4.2 Ablation study
WP (%)  WR (%)  WF (%)  
Scene1  WithoutEN  76.20  28.56  40.73 
EN  79.67  92.11  85.17  
Scene2  WithoutEN  78.70  14.30  24.13 
EN  81.25  89.60  85.08  
Scene3  WithoutEN  65.02  15.16  24.26 
EN  77.36  75.64  75.62  
P (%)  R (%)  F (%)  
VGG mikolajczyk2005comparison  WithoutEN  66.84  26.20  34.23 
EN  74.57  93.01  81.64 
Last but not the least, we report some ablation study result to further illustrate how the proposed icGTM method works. In particular, we want to shed some light to the relationship between playing local games (Sec. IIIC) and iterative clustering (Sec. IIID). As shown in Table 5, the implementation with iterative clustering surpasses the one without iterative clustering by a large margin in all cases. The performance gap is especially remarkable in terms of of WR, which implies that a significant number of falselyrejected inliers are recovered by the proposed iterative clustering process. Moreover, icGTM dramatically achieves better precision performance than conventional GTM because there is a doublecheck procedure reassuring the soundness of initial correspondences.
6 Applications
6.1 Dynamic Image Mosaicing
The problem of image mosaicing (a.k.a. image stitching) has been extensively studied for static scenes where the alignment of two images is determined by a global transformation (homography matrix). However, traditional image mosaicing technique easily fails when applied to dynamic scenes as illustrated in Fig. 10 (a). Misalignment or misregistration is inevitable when a single global transformation is insufficient to characterize the geometric relationship between the input pair. The mosaicing result suffers from unnatural ”ghosting” artifact (as highlighted by dot boxes).
We propose to generalize the traditional image mosaicing problem into dynamic scenes. Such dynamic image mosaicing zhi2011toward
can support multiframe image superresolution and video mosaicing. Based on the developed multiconsistency matching or image alignment method, one can simply project multiple distinct objects in the source image by different local transformations. And accordingly, the mosaicing of stitching of each object can be guided by the corresponding local transformation as shown in Fig.
10 (b).6.2 Dynamic Object Tracking
The other niche application of multiconsistency feature matching is dynamic object tracking in video. Although the problem of object tracking has also been widely studied, robust tracking of multiple objects has remained a longopen problem bernardin2008evaluating. The accuracy of current stateoftheart tracking algorithms is merely below 60% kristan2017visual. In some challenging scenario such as video captured from unmanned aerial vehicles (UAV), the task of multiple object tracking face several adversary factors e.g., viewpoint changes, scale variations, and camera rotations.
This work provides a new set of tools for tackling the problem of dynamic object tracking. As demonstrated in Fig. 11, icGTM was capable of separately clustering the selected correspondences regardless of the large viewpoint changes. Robust feature correspondences as highlighted by different colors provide plausible bounding box proposals that can be used as the initial hypothesis by devoted object tracking algorithms. Due to space limitation, we will report more quantitative experimental results (e.g., VOT2017) in the future.
7 Conclusion
In this paper, we presented an iterative clustering with GameTheoretic Matching (icGTM) method focusing on selecting correct matches in context of multiple coherent correspondences. This method is robust to common nuisances and significantly outperforms other stateoftheart approaches on both the multiconsistency feature matching task and single consistency feature matching task. In addition, to fill the gap of multiconsistency evaluation, we proposed a benchmark including a dataset set up in three scenes and three new metrics that are more reasonable for multiconsistency measurement. The code and benchmark will be available at: https://github.com/sailorz/icGTM.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant 61876211 and by the 111 Project on Computational Intelligence and Intelligent Control under Grant B18024.
Comments
There are no comments yet.