1 Introduction
Template matching is a basic component in a variety of computer vision applications. In this paper, we address the problem of template matching in unconstrained scenarios. That is, a rigid/nonrigid object moves in 3D space, with variant/invariant background and the object may undergo rigid/nonrigid deformations and partial occlusions, as demonstrated in Figure.
1.As the most crucial technique in template matching tasks, similarity measure has been studied for decades and yields in various methods from the classic sum of absolute differences(SAD), the sum of squared distances (SSD) to recent best buddies similarity (BBS) [Oron et al.(2018)Oron, Dekel, Xue, Freeman, and Avidan] and deformable diversity similarity (DDIS) [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor]
. However, several aspects still need to be improved: (1) Most real applications prefer showing matching results with bounding boxes in variable sizes to include object regions than a fixed size. Nevertheless, setting geometric parameters can result in an expansion of candidates for evaluation, which requires a distinctive similarity measure against scaling change. (2) Template matching is usually dense and all the pixels/features within the template and a candidate window are taken into account to measure the similarity even some parts are not desirable (e.g., occlusions, appearance changes brought by deformation), this requires a similarity measure to deal with noises and outliers. (3) Due to the possible deformation with the template, a good similarity measure is expected to be independent with the spatial correlation (e.g., when the object within a candidate window is strongly rotated compared to the template, the spatial correlation between the template and the candidate in raster scan order can become untrustworthy). In this paper, scalable diversity similarity (SDS) is proposed to address the above problems. SDS can be applied with the multiscale sliding window and is not limited by any specific parametric deformation models.
Both BBS and DDIS focus on settling the above problem (2) by exploiting the properties of the nearest neighbor (NN). Each NN is defined by a pair of patches between template and target. In BBS, if and only if each patch in a patch pair is the NN of the other, a match is defined and the number of such matches determines the BBS score. DDIS further improves the BBS by introducing relevant diversity of patch subsets between the target and template, which leads to the robustness of BBS against the occlusions and deformation. Although these methods can deal with deformation within a window to some extent, there are limitations especially on the problem (1) and (3). We extend DDIS to propose SDS based on the relevant diversity statistics.
SDS has the following two advantages concerning the problem (1) and (3). The first is that SDS allows similarity measure between two sets of points in different sizes, and the magnitude of the score is scalerobust. Usually, the magnitude of the DDIS or BBS score grows with the increase of the point set’s scale, which makes the larger candidate windows more favorable to be selected as final results. To alleviate the unfairness, SDS introduces bidirectional relevant diversity and penalizes on the change of scales to make the employment of multiscale sliding window feasible, and the score can converge to the correct scale. This property of SDS is well statistically justified in Sec. 2.4.
The second advantage of SDS is its robustness to the intense rotation. Both BBS and DDIS involve a spatial distance term in NN search or the final similarity calculation, which poses a limitation that the NN of a point must be spatially close. The limitation is a strong prior that can indeed reduce the number of outliers, but at the same time decrease the robustness against intense rotation. In this paper, instead of Cartesian coordinate, we exploit the polar angle of the polar coordinate for the calculation of spatial distance, which releases the limitation brought by the prior. Besides, rank information of appearance within a local circle is employed for searching NN along with local appearance, which helps to find more confident NN and yields in a significant improvement for intense rotation cases. This property of SDS is also statistically justified in Sec. 2.4.
To summarize, the main contributions of this paper can be concluded as (a) SDS introduces bidirectional relevant diversity and penalizes on the change of scales to deal with scaling. (b) The rank of local appearance information and the polar radius is exploited to make the SDS robust against intense rotation change. (c) We originally collect a comprehensive dataset with 498 templatetarget pairs in the unconstrained environments for testing the matching performance, which includes 166 image pairs for scaling, rotation, scaling+rotation, respectively.
1.1 Related work
Template matching is a classic research topic mainly for object localization. The mechanism is straightforward: a large number of candidate windows are sampled in the target image, followed by a similarity measure between each candidate window and template. The similarity score plays a core role in measuring confidence and distinguishing the true target from the other candidates. Most widely used offtheshelf measures are pixelwise methods such as sum of difference (SSD), sum of absolute difference (SAD) and normalized crosscorrelation (NCC), owing to their simplicity and efficiency. To deal with geometric changes on the target, extending the candidate sampling step with planar parametric transformations have been considered in many works, such as translation [Elboher and Werman(2013), Chen et al.(2003)Chen, Chen, and Chen, Pele and Werman(2008)], similarity transformation [Kim and De Araújo(2007)], affine transformation [Korman et al.(2013)Korman, Reichman, Tsur, and Avidan, Zhang and Akashi(2015)] and projective transformation [Zhang and Akashi(2016)]. However, these methods usually fail in complex deformations because the pixelwise similarity method relays on the correct correspondences between the pixels in template and candidate, which is highly limited by the planar geometric models.
In unconstrained environments, to deal with nonrigid transformations and other noises, involving global information instead of pixelwise local information for designing a robust similarity is a key cue. Histogram matching (HM) [Hafner et al.(1995)Hafner, Sawhney, Equitz, Flickner, and Niblack, Comaniciu et al.(2000)Comaniciu, Ramesh, and Meer, Pérez et al.(2002)Pérez, Hue, Vermaak, and Gangnet], which mainly measure the similarity between two color histograms, is not restricted by geometric transformation. However, it is usually not a good choice when background clutter and occlusions appear within the windows. Earth mover’s distance (EMD) [Rubner et al.(2000)Rubner, Tomasi, and Guibas]
is proposed to measure the similarity between two probability distributions. Furthermore, a more robust approach
[Oron et al.(2015)Oron, BarHillel, Levi, and Avidan] is proposed by using spatialappearance representation to measure the EMD. Tone mapping similarity measure [HelOr et al.(2014)HelOr, HelOr, and David] is proposed for handling noise, which is approximated by a piecewise constant/linear function. Asymmetric correlation [Elboher and Werman(2013)]is proposed to deal with both the noise and illumination changes. Other measures focus on improving the robustness against noise as proposed in Mestimator
[Chen et al.(2003)Chen, Chen, and Chen, Sibiryakov(2011)] and Hammingbased distance [Shin et al.(2007)Shin, Park, and Lee, Pele and Werman(2008)]. We refer the interested readers to a comprehensive survey [Ouyang et al.(2012)Ouyang, Tombari, Mattoccia, Di Stefano, and Cham].An eyecatching family of similarity measures in recent years is to explore a global statistic property over the two point sets. Bidirectional similarity (BDS) [Simakov et al.(2008)Simakov, Caspi, Shechtman, and Irani] proposes that two point sets are considered similar if all points of one set are contained in the other, and vice versa. Bestbuddiessimilarity (BBS) [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman, Oron et al.(2018)Oron, Dekel, Xue, Freeman, and Avidan] counts the twoside NNs as a similarity statistic. Deformable diversity similarity (DDIS) [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor] measures the diversity of feature matches between the two sets and is reported to outperform BBS by revealing the “deformation” of the NN field. Despite the robustness of BBS and DDIS against the transformations within the search windows, scaling and rotation on the whole search windows have not been considered. In this paper, we propose a scaling and rotation independent similarity measure which leads to a significant improvement and allows multiscale template matching in unconstrained environments.
2 Methodology
Given a template cropped from a reference image and a target image related by unknown geometric and photometric transformations, our purpose is to design a similarity measure, which can distinctively localize a region in the target image that includes the same object with the template by finding the maximum value. Each candidate region in the target image is defined by a rectangular window, and the candidate windows in the target image are generated in a multiplescale sliding window fashion. Taking the template image and a candidate window from target image as inputs, a SDS score in real number can be calculated, where the and represent nonoverlapped patch from the template and a candidate window, respectively. and can also be treated as points when and are treated as point sets. , and .
Nearest neighbor has been shown to be a strong feature for designing similarity measure in some prior researches. To better address the difference, we first recall BBS [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman] which counts the number of bidirectional NN matches between and :
(1) 
where is a function returns the NN of with respect to , and the is a distance function. The denotes the size of a set, and the is a normalization factor.
We are now ready to introduce our method in a bottomup fashion: from NN search to bidirectional diversity, and finally the SDS similarity.
2.1 Rank of Local Appearance for Rotation Robust NN search
The distance function in Eq. 1 is defined by
(2) 
where denotes pixel appearance (e.g., RGB) and denotes pixel location within the patch normalized to the range [0, 1]. In the stage of NN searching, under the assumption that intense deformation such as rotation do not occur within the patch, the spatial term can contribute to improving the confidence of NN by confirming the consistency of appearance and position. We propose
(3) 
to incorporate instead of , which denotes the rank with respect to the appearance of pixels within a circle. The origin of the circle is , with a support radius of . Specifically,
(4) 
where is an indicator function that turns true and false into 1 and 0. Equation in the same form is applied to . Unlike pixel location, the appearance rank defined by Eq. 4 is invariant to rotation, which can also be considered as structural information (e.g., the shape of the distribution of pixel values) extracted from a local region. As the rotation will not destroy the structure, it is reasonable to explain its invariance against rotation. Furthermore, the Euclidean distance of orders emphasizes the influence of local extremes, which also contributes to keeping the local features well.
2.2 Bidirectional Diversity for Discriminative Similarity Measure
We first extend the diversity similarity (DIS) defined in [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor] to a bidirectional way. The DIS is defined as
(5) 
which counts the types of points in that have NN in with the same pixel type (i.e., defined as diversity in direction ). The authors claim that this one direction diversity provides a good approximation to BBS with less computation. However, the number of candidates increase explosively by allowing multiscale candidate windows , therefore a more discriminative similarity measure is needed. We exploit both diversity calculated with respect to and (i.e., and ). Specifically, we first define the following function which indicates the number of points whose NNs are equal to in direction ,
(6) 
where NN() here is calculated with distance defined in Eq. 3. To understand the equation, we analyze its relationship with diversity from two situations. For : (1) When , the value is inversely proportional to the diversity contribution. That is, large value of indicates that many points in have the same NN of , which will lower the diversity defined in Eq. 5. (2) When , it indicates that a is not a NN of any , which also hinders the increase of diversity. An ideal situation is that for each , . For , the situations become more complex. (1) when , similarly it means low contribution to the diversity. (2) Due to the scaling between and , one point can be the NN of multiple points, when , it contributes to the diversity. (3) When , it will lower the diversity.
We propose to simultaneously introduce this statistic to direction . However, it is not straightforward in the case of template matching. Because the candidate window usually belongs to a target image , where . That is, when finding NNs in the case of , as is fixed and the preprocessing (e.g., sorting for brute force search, building kdtree, etc.) only need to be conducted once. In the case of , as such preprocessing for NN search has to be conducted over each , it will suffer from time cost. To tackle this problem, we pose an assumption that has a high probability to be included in the set of approximate NNs with respect to , which is denoted by . Formally, we define the following function which counts the number of points (i.e., patches in the image) whose ANNs include in direction ,
(7) 
2.3 Scalable Diversity Similarity
With bidirectional diversity and defined, we define the SDS to quantify the the similarity between template and candidate with given target image and scaling as follows, where can be calculated from and ,
(8) 
Where parameter is a normalization factor inversely proportional to the increase of (e.g., ). As analyzed in Sec. 2.2, only points in which hold , and points in which hold can possibly contribute to the increase of the diversity. returns the radius of a pixel in polor coordinate, with the pole set as the according geometric center of and . The denominator of Eq. 8 penalizes the spatial consistency in polar coordinate, to further increase the robustness against inplan rotation. Term is a normalization term for the number of NNs with respect to scaling. Following the analysis in Sec. 2.2, in our implementation, is defined as , which increases when more holds . In conclusion, SDS can be viewed as a cooperation of three terms: (1) The numerator term to evaluate the bidirectional diversity, (2) the denominator term to evaluate the spatial consistency, (3) the term to normalize the number of NNs with respect to .
2.4 Statistical Analysis
Analysis of scalingrobustness. To assert the effectiveness of SDS in measuring the similarity between scalevariant point sets, we first provide a 1D statistical analysis following [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman, Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor]. The expectations of similarity between two point sets drawn from two 1D Gaussian models are calculated for comparison, where point sets are cast as template/candidate window, points are cast as patches. Our goal is to show that the expectation of SDS is maximal when the two Gaussian models are the same and decrease fastest when models separate. We further analyze the expectations of point sets in different scaling size to show the scalingrobustness of SDS. As suggested by [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor], MonteCarlo integration is exploited for approximating the expectation. Figure 2 (a) to Figure. 2 (d) show the illustration of approximated expectation maps when two point sets have same size (). It can be obviously observed that the expectation of SDS drops faster than either SSD, BBS, or DDIS when the parameters of the second Gaussian ( and ) get away from the parameters of the first Gaussian ( and ). Figure 2 (d) to Figure. 2 (f) show the comparison of expectation map when two point sets are in different sizes (), which provides a strong evidence that SDS is highly robust against scaling as the expectation maps almost remain the same despite the scaling change.
To further show that the scale of target with respect to can be estimated by maximizing SDS, we provide a statistical result in Figure. 3. Similar with Figure. 2, is drawn from and is generated for expectation approximation. The difference is, we further prepare which involves background points to simulate the template matching task. Here, , and is composed of background points drawn from , with . In this demonstration, and are set to 100 and 200 respectively. and varies from 0.5 to 2 with step of 0.1. The can be treated as a candidate window in the template matching task and is sampled from by preferentially sample points in
(i.e., nearest neighbor interpolation). For example, when
, 150 points need to be sampled to formulate , with 100 points from and 50 points from . Estimated is supposed to approximate the ground truth scale well. This statistical analysis clearly prove the robustness of SDS against scaling, and the ability for estimating proper scale of the target.Analysis of rotation robustness. To show the robustness against rotation, we analyze the expectation of similarity between two sets and drawn from 2D Gaussian models, as shown in Figure. 4, we fix the parameters except and to validate the effect of rotation angle along with the shape of the Gaussian. In the case of BBS, as we can observe from Figure. 4 (c), when is extremely small, the points drawn are likely to form a line, which is sensitive to rotation as lines overlap little after rotation. This is also the case when , as it can be observed that the expectation decreases gradually with the increase of . Also, isotropic Gaussian is supposed to be unaffected by the rotation, which can be convinced from Figure. 4 (c) that when , the expectation keeps well with respect to the rotation. On the other hand, SDS shows the invariance to the rotation despite the shape change of distribution in Figure. 4 (d).
3 Experiment Results
We conduct a comprehensive experiment with both qualitative and quantitative tests to validate the superiority of SDS comparing with the stateoftheart methods BBS [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman, Oron et al.(2018)Oron, Dekel, Xue, Freeman, and Avidan] and DDIS [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor], as well as several conventional methods. We follow the same procedure as suggested in [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman, Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor] for a fair comparison. Note that as SDS can be employed with multiscale windows, we simultaneously compare the performance of SDS with fixed scale, which is referred to as NSDS. In addition, similar to SDS, we also employed DDIS to the multiscale candidate windows for comparison, denoted as SDDIS.
Multiple datasets are utilized for comparison. We originally collected 42 videos under different unconstrained environments and extract frames to create a benchmark for evaluating the performance of template matching involving overall rotation and scaling on the object. Ground truths are scalevariable and annotated manually image by image. Besides, this benchmark also includes other challenges like complex deformations, occlusion, background clutter, etc. The benchmark is subdivided into three datasets: (1) rotation dataset, (2) scaling dataset and (3) rotationscaling dataset for detail evaluation, each of them includes 166 referencetarget image pairs, respectively. It is noteworthy that each dataset also includes other photometric and geometric transformations as they are taken under unconstrained environments. As to the evaluation criteria, following previous works[Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman, Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor], we employ the success ratio based on the overlap rate between ground truth and matching result to measure the accuracy, which is defined as: . Here, the operator is to count the number of pixels within a window.
We compare our proposed methods (SDS and NSDS) to DDIS and its multiscale implementation SDDIS, BBS, HM, HOG, SAD, and SSD. The scaling factor with respect to both and axes range from 0.5 to 2, with step 0.1. The patch size of SDS, DDIS and BBS patch is fixed to . We report the result in Figure. 5. SDS/NSDS outperforms the other comparative methods with respect to the areaundercurve (AUC) score. NGT curve is to show the limitation of performance when calculating the success rate with fixed scales. Matching examples are shown in Figure. 6. 1st and 2nd rows show that SDS is robust against overall rotation. 3rd and 4th rows demonstrate that SDS can deal with scaling problem well. The likelihood maps show that SDS/NSDS is more distinct and yields in betterlocalized modes compared to other methods.
4 Conclusion
We proposed a novel multiscale template matching method in unconstrained environments, which is robust against overall scaling, intense rotation while taking advantage of global statistic based similarity measure to deal with complex deformations, occlusions, etc. Extended bidirectional diversity combined with rank based nearest neighbor search forms a scalerobust similarity measure, and the exploit of polar coordinate further improves the robustness against rotation. The experimental results have shown that SDS can remarkably outperform other competitive methods. On the other hand, SDS may fail when the template is too small to achieve a statistical score. The remained future work is to add a rotation parameter to the candidate windows to achieve rotationspecific matching results.
References
 [Chen et al.(2003)Chen, Chen, and Chen] JiunHung Chen, ChuSong Chen, and YongSheng Chen. Fast algorithm for robust template matching with mestimators. IEEE Transactions on signal processing, 51(1):230–243, 2003.

[Comaniciu et al.(2000)Comaniciu, Ramesh, and Meer]
Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer.
Realtime tracking of nonrigid objects using mean shift.
In
Computer Vision and Pattern Recognition (CVPR)
, pages 142–149. IEEE, 2000.  [Dekel et al.(2015)Dekel, Oron, Rubinstein, Avidan, and Freeman] Tali Dekel, Shaul Oron, Michael Rubinstein, Shai Avidan, and William T Freeman. Bestbuddies similarity for robust template matching. In Computer Vision and Pattern Recognition (CVPR), pages 2021–2029, 2015.
 [Elboher and Werman(2013)] Elhanan Elboher and Michael Werman. Asymmetric correlation: a noise robust similarity measure for template matching. IEEE Transactions on Image Processing (TIP), 22(8):3062–3073, 2013.
 [Hafner et al.(1995)Hafner, Sawhney, Equitz, Flickner, and Niblack] James Hafner, Harpreet S. Sawhney, William Equitz, Myron Flickner, and Wayne Niblack. Efficient color histogram indexing for quadratic form distance functions. IEEE transactions on pattern analysis and machine intelligence, 17(7):729–736, 1995.
 [HelOr et al.(2014)HelOr, HelOr, and David] Yacov HelOr, Hagit HelOr, and Eyal David. Matching by tone mapping: Photometric invariant template matching. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 36(2):317–330, 2014.
 [Kim and De Araújo(2007)] Hae Yong Kim and Sidnei Alves De Araújo. Grayscale templatematching invariant to rotation, scale, translation, brightness and contrast. In PacificRim Symposium on Image and Video Technology (PSIVT), pages 100–113. Springer, 2007.
 [Korman et al.(2013)Korman, Reichman, Tsur, and Avidan] Simon Korman, Daniel Reichman, Gilad Tsur, and Shai Avidan. Fastmatch: Fast affine template matching. In Computer Vision and Pattern Recognition (CVPR), pages 2331–2338, 2013.
 [Oron et al.(2015)Oron, BarHillel, Levi, and Avidan] Shaul Oron, Aharon BarHillel, Dan Levi, and Shai Avidan. Locally orderless tracking. International Journal of Computer Vision (IJCV), 111(2):213–228, 2015.
 [Oron et al.(2018)Oron, Dekel, Xue, Freeman, and Avidan] Shaul Oron, Tali Dekel, Tianfan Xue, William T Freeman, and Shai Avidan. Bestbuddies similarity—robust template matching using mutual nearest neighbors. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 40(8):1799–1813, 2018.

[Ouyang et al.(2012)Ouyang, Tombari, Mattoccia, Di Stefano, and
Cham]
Wanli Ouyang, Federico Tombari, Stefano Mattoccia, Luigi Di Stefano, and
WaiKuen Cham.
Performance evaluation of full search equivalent pattern matching algorithms.
IEEE transactions on pattern analysis and machine intelligence (TPAMI), 34(1):127–143, 2012.  [Pele and Werman(2008)] Ofir Pele and Michael Werman. Robust realtime pattern matching using bayesian sequential hypothesis testing. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 30(8):1427–1443, 2008.
 [Pérez et al.(2002)Pérez, Hue, Vermaak, and Gangnet] Patrick Pérez, Carine Hue, Jaco Vermaak, and Michel Gangnet. Colorbased probabilistic tracking. In European Conference on Computer Vision (ECCV), pages 661–675. Springer, 2002.

[Rubner et al.(2000)Rubner, Tomasi, and Guibas]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas.
The earth mover’s distance as a metric for image retrieval.
International journal of computer vision (IJCV), 40(2):99–121, 2000.  [Shin et al.(2007)Shin, Park, and Lee] Bong Gun Shin, SoYoun Park, and Ju Jang Lee. Fast and robust template matching algorithm in noisy image. In Control, Automation and Systems (ICCAS), pages 6–9. IEEE, 2007.
 [Sibiryakov(2011)] Alexander Sibiryakov. Fast and highperformance template matching method. In Computer Vision and Pattern Recognition (CVPR), pages 1417–1424. IEEE, 2011.
 [Simakov et al.(2008)Simakov, Caspi, Shechtman, and Irani] Denis Simakov, Yaron Caspi, Eli Shechtman, and Michal Irani. Summarizing visual data using bidirectional similarity. In Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.
 [Talmi et al.(2017)Talmi, Mechrez, and ZelnikManor] Itamar Talmi, Roey Mechrez, and Lihi ZelnikManor. Template matching with deformable diversity similarity. In Computer Vision and Pattern Recognition (CVPR), pages 1311–1319. IEEE, 2017.
 [Zhang and Akashi(2015)] Chao Zhang and Takuya Akashi. Fast affine template matching over galois field. In British Machine Vision Conference (BMVC), pages 121.1–121.11. BMVA Press, September 2015.
 [Zhang and Akashi(2016)] Chao Zhang and Takuya Akashi. Robust projective template matching. IEICE TRANSACTIONS on Information and Systems, 99(9):2341–2350, 2016.
Comments
There are no comments yet.