Template Matching with Deformable Diversity Similarity
We propose a novel measure for template matching named Deformable Diversity Similarity -- based on the diversity of feature matches between a target image window and the template. We rely on both local appearance and geometric information that jointly lead to a powerful approach for matching. Our key contribution is a similarity measure, that is robust to complex deformations, significant background clutter, and occlusions. Empirical evaluation on the most up-to-date benchmark shows that our method outperforms the current state-of-the-art in its detection accuracy while improving computational complexity.READ FULL TEXT VIEW PDF
Template Matching with Deformable Diversity Similarity
Template Matching is a key component in many computer vision applications such as object detection, tracking, surveillance, medical imaging and image stitching. Our interest is in Template Matching “in the wild”, i.e., when no prior information is available on the target image. An example application is to identify the same object in different cameras of a surveillance system . Another use case is in video tracking, where Template Matching is used to detect drifts and relocate the object after losing it . This is a challenging task when the transformation between the template and the target in the image is complex, non-rigid, or contains occlusions, as illustrated in Figure 1.
Traditional template matching approaches, such as Sum-of-Squared-Distances or Normalized Cross-Correlation, do not handle well these complex cases. This is largely because they penalize all pixels of the template, which results in false detections when occlusions or large deformations occur. To overcome this limitation the Best-Buddies-Similarity (BBS) measure was proposed in [4, 18]. BBS is based on properties of the Nearest-Neighbor (NN) matches beween features of the target and features of the template. It relies only on a subset of the points in the template, thus hoping to latch on to the relevant features that correspond between the template and the target. This makes BBS more robust than previous methods.
In this paper we adopt the feature-based, parameter-free, approach of BBS and propose a novel similarity measure for template matching named DDIS: Deformable Diversity Similarity. DDIS is based on two properties of the Nearest Neighbor field of matches between points of a target window and the template. The first is that the diversity of NN matches forms a strong cue for template matching. This idea is supported by observations in , where patch diversity was used to match objects for texture synthesis. We propose formulas for measuring the NN field diversity and further provide theoretical analysis as well as empirical evaluations that show the strength of these measures.
The second key idea behind DDIS is to explicitly consider the deformation implied by the NN field. As was shown by the seminal work of  on Deformable Part Models, allowing deformations while accounting for them in the matching measure is highly advantageous for object detection. DDIS incorporates similar ideas for template matching leading to a significant improvement in template detection accuracy in comparison to the state-of-the-art.
A benefit of DDIS with respect to BBS [4, 18] is reduced computational complexity. Both measures rely on NN matches, however, BBS is formulated in a way that requires heavier computations. DDIS is more efficient while providing statistical properties similar to BBS.
To summarize, in this paper we introduce DDIS, a measure for template matching in the wild that relies on two observations: (i) The diversity of NN matches between template points and target points is indicative of the similarity between them. (ii) The deformation implied by the NN field should be explicitly accounted for. DDIS is robust and parameter free, it operates in unconstrained environments and shows improved accuracy compared to previous methods on a real challenging data-set. Our code is available at https://github.com/roimehrez/DDIS
The similarity measure between the template and a sub-window of the target image is the core part of template matching. A good review is given in . The commonly used methods are pixel-wise, e.g., Sum of Squared differences (SSD), Sum of Absolute Differences (SAD) and Normalized Cross-Correlation (NCC), all of which assume only translation between the template and target. They could be combined with tone mapping to handle illumination changes  or with asymmetric correlation to handle noise 
. To increase robustness to noise pixel-wise measures such as M-estimators[2, 23] or Hamming-based distances [22, 20] have been proposed.
parametric transformations are handled by approximating the global optimum of the parametric model. In non-rigid transformations are addressed via parametric estimation of the distortion. All of these methods work very well when their underlying geometric assumptions hold, however, they fail in the presence of complex deformations, occlusions and clutter.
A second group of methods consider a global probabilistic property of the template. For example in [3, 21] color Histogram Matching is used (for tracking). This does not restrict the geometric transformation, however, in many cases the color histogram is not a good representation, e.g., in the presence of background clutter and occlusions. Other methods combine geometric cues with appearance cues. For example, a probabilistic solution was suggested in , where geometric and color cues are used to represent the image in the location-intensity space. Oron et al.  extend this idea by measuring one-to-one distance in space. These methods all make various assumptions that do not hold in complex scenarios.
A more robust approach, that can handle complex cases has been recently suggested in [4, 18]. Their approach, named the Best-Buddies-Similarity (BBS) is based on the Bi-Directional Similarity (BDS) concept of 
. They compute the similarity between a template and a target window by considering matches between their patches. The matches are computed in both directions providing robustness to outliers. A similar idea was suggested in by replacing the max operator of the Hausdorff distance  with a sum. The BBS of [4, 18] lead to a significant improvement in template matching accuracy over prior methods. In this paper we propose a different measure, that shares with BBS its robustness properties, while yielding even better detection results.
To measure similarity between a target window and a template we first find for every target patch its Nearest Neighbor (NN), in terms of appearance, in the template. Our key idea is that the similarity between the target and the template is captured by two properties of the implied NN field. First, as shown in Figure 1(d), when the target and template correspond, most target patches have a unique NN match in the template. This implies that the NN field is highly diverse, pointing to many different patches in the template. Conversely, as shown in Figure 1(e), for arbitrary targets most patches do NOT have a good match, and the NNs converge to a small number of template points that happen to be somewhat similar to the target patches. Second, we note that arbitrary matches typically imply a large deformation, indicated by long arrows in Figure 1(e).
Next, we propose two ways for quantifying the amount of diversity and deformation of the NN field. The first is more intuitive and allows elegant statistical analysis. The second is slightly more sophisticated and more robust.
Let points represent patches of the template and target, respectively. Our goal is to measure the similarity between two sets of points, the template points and the target points . We require finding the NN in for every point , s.t., for some given distance function . The first property our measures are based on is the diversity of points that were found as NNs.
An intuitive way to measure diversity is to count the number of unique NNs. We define the Diversity Similarity (DIS) as:
where is a normalization factor and denotes group size.
To provide further intuition as to why DIS captures the similarity between two sets of points we provide an illustration in 2D in Figure 5. Figure 4(a) demonstrates that when the distributions of points in and are similar, most of the points have a unique NN implying a high DIS value. Conversely, when and are distributed differently, as illustrated in Figure 4(b), DIS is low. This is since in areas where is sparse while is dense most of the points in are not NN of any . In addition, in areas where is dense while is sparse most of the points in share the same NNs. In both cases, since the number of points is finite, the overall contribution to DIS is low.
While capturing well diversity, DIS does not explicitly consider the deformation field. Accounting for the amount of deformation is important since while non-rigid transformations should be allowed, they should also be restricted to give preference to plausible deformations of real objects.
In order to integrate a penalty on large deformations we make two modifications to the way we measure diversity. First, to obtain an explicit representation of the deformation field we distinguish between the appearance and the position of each patch and treat them separately. Second, we propose a different way to measure diversity, that enables considering the deformation amount.
Let denote the appearance and the location of patch (and similarly for ). We find the appearance based NN for every point s.t. for some given distance . The location distance between a point and its is denoted by . To quantify the amount of diversity as a function of the NN field we define as the number of patches whose is :
Finally, we define the Deformable Diversity Similarity (DDIS) by aiming for high diversity and small deformation:
where is a normalization factor.
This definition can be viewed as a sum of contributions over the points . When a point has a unique NN, then and the exponent reaches its maximum value of . Conversely, when the NN of is shared by many other points, then is large, the exponent value is low and the overall contribution of to the similarity is low. In addition, the contribution of every point is inversely weighted by the length
of its implied deformation vector.
DDIS possesses several properties that make it attractive: (1) it relies mostly on a subset of matches, i.e., points that have distinct NNs. Points that share NNs will have less influence on the score. (2) DDIS does not require any prior knowledge on the data or its underlying deformation. (3) DDIS analyses the NN field, rather than using the actual distance values. These properties allow DDIS to overcome challenges such as background clutter, occlusions, and non-rigid deformations.
DIS and DDIS capture diversity in two different ways. DIS simply counts unique matches in , while DDIS measures exponentially the distinctiveness of each NN match of patches in . Nonetheless, we next show that DIS and DDIS are highly related.
We start by ignoring the deformations by setting in (3) and simplifying (without loss of generality) by assuming . We denote by the fraction of points that are NNs of at least one point in . Both DIS and DDIS reach their maximum value of when , i.e., when . When , i.e., all share a single NN, both scores reach their minimum value, DIS= and DDIS=.
Further intuition can be derived from the case of uniform distribution of NN matches, i.e., when onlypoints in are NNs of some , and for all of them . In this case , and . Both measures share extrema points between which they drop monotonically as a function of , with DDIS decreasing faster due to its exponential nature. This is illustrated in Figure 3.
To further cement our assertions that diversity captures the similarity between two distributions, we provide statistical analysis, similar to that presented in [4, 18]. Our goal is to show that the expectation of DIS and DDIS is maximal when the points in both sets are drawn from the same distribution, and drops sharply as the distance between the two distributions increases. We do that via a simple 1D mathematical model, in which an image window is modeled as a set of points drawn from a general distribution.
Appendix A presents derivations of (Expected value of DIS) when the points are drawn from two given distributions. The expression for does not have a closed form solution, but it can be solved numerically for selected underlying distributions. Therefore, we adopt the same setup as  where and
are assumed to be Gaussian distributions, which are often used as simple statistical models of image patches. We then use Monte-Carlo integration to approximate the Expectation for discrete choices of parametersand . For BBS and SSD we adopt the derivations in , where was also approximated via Monte-Carlo integration and is normalized.
Figure 4 presents the resulting approximated expected values. It can be seen that DIS is likely to be maximized when the distributions are the same, and falls rapidly when the distributions differ from each other. In addition it is evident that DIS and BBS present highly similar behaviors. Finally, similar to , one can show that this holds also for the multi-dimensional case.
For DDIS we cannot derive nice expressions for its Expectation . Instead, we use simulations to approximate it. The simulation needs to consider also locations to quantify the amount of deformation, , in (3). When the expectation is similar to that of BBS and DIS. For we simulate two cases: (i) Small deformation: We sort the points in each set based on their appearance coordinate, and take as position their index in the sorted list. When the distributions are different the diversity is very low anyhow. But when the distributions are similar, the sorting results in points and their NN having a similar index, which corresponds to small deformation. (ii) Large deformation: We sort the points of one set in descending order and the other set in ascending order, again taking as position their index in the sorted list. When the distributions are similar, the sorting results in points and their NN having a different index, which corresponds to large deformation. Figure 4 shows that for small deformation drops sharply as the distributions become more different. For large deformations it is always low, as desired, since even when the appearances are similar, if the geometric deformation is large the overall similarity between the point sets is low.
Our measures bare resemblance to BBS – all rely on NN matches between two sets of points. There are, however, two key differences: (i) the way in which similarity between the two sets is measured, and, (ii) the penalty on the amount of spatial deformation. We next analyze the implications of these differences.
The key idea behind the bi-directional similarity approaches of [4, 18, 24] is that robust matching requires bi-directional feature correspondences. Our unilateral measures contradict this claim. In fact, an observation we make is that, Diversity provides a good approximation to BBS. The analysis we present is for DIS, since it is simpler than DDIS and does not incorporate deformation, making the comparison to BBS more fair and direct.
Recall that BBS counts the number of bi-directional NN matches between the target and template. A pair of points and are considered a best-buddies-pair (BBP) if is the NN of , and is the NN of . BBS counts the number of BBPs as a measure of similarity between and . Clearly, the bilateral nature of BBS is wasteful in terms of computations, compared to the unilateral DDIS and DIS.
DIS and BBS are defined differently, however, since the number of patches in both target and template is finite, DIS provides a good approximation to BBS. As illustrated in Figure 4(a) when the distributions of points in and are similar, many of the NN relations are bi-directional. This implies that the values of BBS and DIS are very similar. In the extreme case when the template and target are identical, every point has a unique NN and they form a BBP. In this case DIS=BBS exactly.
DIS and BBS behave similarly also when the distributions are different, as illustrated in Figure 4(b). In areas where is sparse and is dense we get multiple points that share the same NN . At most one of them forms a BBP and their joint contribution to both DIS and BBS is . Since the number of points in and is finite, this implies that there are other areas where is dense and is sparse. In these areas there are many points in that are not NN of any , and have zero contribution to both DIS and BBS.
The need to penalize large deformations was noted in [4, 18]. This was done implicitly by adding the coordinates to the feature vectors when searching for NNs. The distance between a pair of points is taken as a weighted linear combination of their appearance and position difference. This is different from DDIS that considers only appearance for NN matching and explicitly penalizes the deformation in the obtained NN field. Our approach has two benefits: (i) improved runtime, and (ii) higher detection accuracy.
Using only appearance for NN matching significantly reduces runtime since while every image patch is shared by many sub-windows, its coordinates are different in each of them. This implies that the NN field needs to be computed for each image sub-window separately. Conversely, working in appearance space allows us to perform a single NN search per image patch. In Section 6 we analyze the benefits in terms of computational complexity.
Separating between appearance and position also leads to more accurate template localization. Overlapping target windows with very similar appearance could lead to very similar similarity scores. DDIS chooses the window implying less deformations. Our experiments indicated that this is important and improves the localization accuracy.
To utilize DDIS for template matching in images, we follow the traditional raster scan approach. Our algorithm gets as input a target image and a template . Its output is a frame placing within . We denote the width of by and its height , similarly for . Each template sized sub-window is compared to . We extract from and feature vectors, as described below, yielding sets and respectively. We use the Euclidean distance () to compare appearance features and . The deformation length is the Euclidean distance between the coordinates and . Our implementation consists of phases:
We experimented with two forms of appearance feature, color and deep-features. As color features we setand as vectorized RGB pixel values of overlapping patches. To obtain deep-features we used the popular VGG-Deep-Net . More specifically, we take feature maps from layers , and (akin to the suggestion in 
for object tracking). We forsake the higher layers since we found the low spatial resolution inaccurate. The features maps were normalized to zero mean and unit standard deviation, and then upscaled, via bi-linear interpolation, to reach the original image size.
1. NN search: We find for each feature vector in , its approximate NN in the template . We use TreeCANN  with PCA dimensionality reduction to 9 dimensions, kd-tree approximation parameter , dense search (), and window parameter , .
2. Similarity map calculation: For each target image pixel (ignoring boundary pixels) we compute the similarity between its surrounding sub-window and the template . For each , we first compute as defined in (2). Since subsequent windows have many overlaps, the computation of needs only update the removed and added features with respect to the previous sub-window. We then calculate DDIS as defined in (3).
3. Target localization: Finally, the template location is that with maximum score. Before taking the maximum, we smooth the similarity map with a uniform kernel of size , to remove spurious isolated peaks.
The sets and consist of features from all locations in and , receptively. This implies . The number of possible sub-windows111In practice, we exclude patches that are not fully inside the template or the sub-window, but these are negligible for our complexity analysis. is . Recall that denotes the feature vectors length. For color features equals the size of the patch while for deep-features it is determined by the feature map dimension. Next, we analyze the complexity of steps (1-3).
1. NN search: TreeCANN consists of two stages. In the first, the dimension of all template points is reduced from to in and a k-d tree is built in . The second stage performs the queries. Each query consists of dimensionality reduction , a search in the k-d tree (on average), and a propagation phase which leverages spatial coherency . The overall complexity for finding the Approximate NN for all the features in the target image is on average. The memory consumption is .
2. Similarity map calculation: Assuming for simplicity that , the update of takes operations, for any except for the first one. Next, DDIS is calculated with operations. Given that the overall number of sub-windows is this step’s complexity is . The memory consumption for this stage is which is the size of a table holding .
3. Target localization: Averaging the similarity map is done efficiently using an integral image in . To find the maxima location another swipe over the image is needed, which takes .
Putting it all together, we get that the overall complexity of Template Matching with DDIS is where we omitted since and are expected to be of the same order for small and for large .
One of the benefits of DDIS with respect to BBS is that it requires only unilateral matches. The benefit in terms of complexity can now be made clear. According to , the BBS complexity using deep features is and for color features it is on average. The latter case uses heavy caching which consumes memory (assuming ).
Our experimental setup follows that of  that created a benchmark by sampling frames from video sequences annotated with bounding-boxes for object tracking . The videos present a variety of challenges: complex deformations, luminance changes, scale differences, out-of-plane rotations, occlusion and more. The benchmark consists of three data-sets, generated by sampling pairs of frames with a constant frame (time) difference , producing increasingly challenging data-sets, and overall, a challenging benchmark for template matching.
For each pair of frames, one is used to define the template as the annotated ground-truth box, while the second is used as a target image. As commonly done in object tracking, the overlap between the detection result and the ground-truth annotation of the target is taken as a measure of accuracy: where counts the number of pixels in a region and and are the ground truth and estimated rectangles, locating in .
Quantitative Evaluation: We compare DDIS and DIS to BBS, BDS, SSD, SAD and NCC with both color and deep features. For BBS and DIS we use the exact same setup as in , that is, non-overlapping patches represented in space. In Figure 6 we plot for each data-set and method a success rate curve. It can be seen from the Area-Under-Curve (AUC) scores in Table 5(d) that DDIS is significantly more successful than all previous methods. Furthermore, DDIS with our simplistic color features outperforms all other methods with either color or deep features. When using Deep features, DDIS improves over BBS with margins of for the three data-sets. When using color features the margins are .
Qualitative Evaluation: Figure 8 displays several detection results on challenging examples, taken from the web, that include occlusions, significant deformations, background clutter and blur. We compare DDIS and DIS to BBS – the current state-of-the-art. It is evident from the detection likelihood maps that DIS and BBS share a similar behavior, supporting our suggestion that unilateral matching suffices to capture similarity. DDIS, on the other hand, accounts also for deformations, hence, it presents cleaner maps, with fewer distractors.
Runtime: Our implementation is in MATLAB/c++ and all experiments were performed on a 32GB RAM, Intel i7 quad-core machine. The average(std) runtime for an image-pair in the benchmark, using color features is , depending on the template size. For comparison, the average(std) time for BBS is orders of magnitude longer: . The max and min runtimes of DDIS are and , respectively, and for BBS are and , respectively. Detailed results for are presented in Figure 7. This matches our complexity analysis that showed that DDIS is less affected by the template size, while BBS dependence on is polynomial.
We introduced a new approach for template matching in the wild, based on properties of the NN field of matches between target and template features. Our method suggests not only improvement in terms of detection accuracy, but also in terms of computational complexity. A drawback of our algorithm is not dealing with significant scale change of the object. This could possibly be addressed, by computing the likelihood maps over multiple scales. A future research direction is to explore consideration of more than the first NN for each patch. This could be beneficial to handle repetitive textures.
An important observation, our analysis makes, is that one does not necessarily need bi-directional matches to compute similarity. This raises questions regarding the celebrated bi-directional-similarity approach, which provided excellent results, but was heavy to compute.
This research was supported by the Israel Science Foundation Grant 1089/16 and by the Ollendorf foundation.
In this appendix we develop mathematical expressions for the expectation of DIS in . We start by rewriting DIS in a form convenient for our derivations:
where is an indicator function and indicates whether is chosen as a NN match at least once.
We proceed with the expectation:
where the last step is since samples and are drawn independently, so all indexes behave alike and we can choose some arbitrary index . Continuing with the expectation of the indicator function, we have:
where and are the CDF’s of Q and P, respectively. are defined by:
Given a known set of samples P, the probability that the NN match for a sampledis NOT is:
where we split into two ranges where the indicator is not zero. Since consists of independently sampled points, the probability that is not a NN match for any when is sampled and is known, is:
Finally, since all of the points are sampled independently, we have:
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 142–149, 2000.
Performance evaluation of full search equivalent pattern matching algorithms.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1):127–143, 2012.