Best-Buddies Similarity - Robust Template Matching using Mutual Nearest Neighbors

09/06/2016 ∙ by Shaul Oron, et al. ∙ Google MIT Tel Aviv University 0

We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)--pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding a template patch in a target image is a core component in a variety of computer vision applications such as object detection, tracking, image stitching and 3D reconstruction. In many real-world scenarios, the template—a bounding box containing a region of interest in the source image —undergoes complex deformations in the target image: the background can change and the object may undergo nonrigid deformations and partial occlusions.

Template matching methods have been used with great success over the years but they still suffer from a number of drawbacks. Typically, all pixels (or features) within the template and a candidate window in the target image are taken into account when measuring their similarity. This is undesirable in some cases, for example, when the background behind the object of interest changes between the template and the target image (see Fig. 1). In such cases, the dissimilarities between pixels from different backgrounds may be arbitrary, and accounting for them may lead to false detections of the template (see Fig. 1(b)).

In addition, many template matching methods assume a specific parametric deformation model between the template and the target image (e.g., rigid, affine transformation, etc.). This limits the type of scenes that can be handled, and may require estimating a large number of parameters when complex deformations are considered.

In order to address these challenges, we introduce a novel similarity measure termed Best-Buddies Similarity (BBS), and show that it can be applied successfully to template matching in the wild. In order to compute the BBS we first represent both the template patch and candidate query patches as point sets in . Then, instead of searching for a parametric deformation between template and candidate we directly measure the similarity between these point sets. We analyze key features of BBS, and perform extensive evaluation of its performance compared to a number of commonly used alternatives on challenging datasets.

Figure 1: Best-Buddies Similarity (BBS) for Template Matching: (a), The template, marked in green, contains an object of interest against a background. (b), The object in the target image undergoes complex deformation (background clutter and large geometric deformation); the detection results using different similarity measures are marked on the image (see legend); our result is marked in blue. (c), The Best-Buddies Pairs (BBPs) between the template and the detected region are mostly found the object of interest and not on the background; each BBP is connected by a line and marked in a unique color.

BBS measures the similarity between two sets of points in . A key feature of this measure is that it relies only on a subset (usually small) of pairs of points – the Best-Buddies Pairs (BBPs). A pair of points is considered a BBP if the points are mutual nearest neighbors, i.e. each point is the nearest neighbor of the other in the corresponding point set. BBS is then taken to be the fraction of BBPs out of all the points in the set.

Albeit simple, this measure turns out to have important and nontrivial properties. Because BBS counts only the pairs of points that are best buddies, it is robust to significant amounts of outliers. Another, less obvious property is that the BBS between two point sets is maximal when the points are drawn from the same distribution, and drops sharply as the distance between the distributions increases. In other words, if two points are BBP, they were likely drawn from the same distribution. We provide a statistical formulation of this observation, and analyze it numerically in the 1D case for point sets drawn from distinct Gaussian distributions (often used as a simplified model for natural images).

Modeling image data as distributions, i.e. using histograms, was successfully applied to many computer vision tasks, due to its simple yet effective non-parametric representation. A prominent distance measure between histograms is the Chi-Square (

) distance, in which contributions of different bins, to the similarity score, are proportional to the overall probability stored in those bins.

In this work we show that for sufficiently large sets, BBS converges to the distance between distributions. However, unlike

computing BBS is done directly on the raw data without the need to construct histograms. This is advantageous as it alleviates the need to choose the histogram bin size. Another benefit is the ability to work with high dimensional representation, such as Deep features, for which constructing histograms is not tractable.

More generally, we show a link between BBS and a well known statistical measure. This provides additional insight into the statistical properties of mutual nearest neighbors, and also sheds light on the ability of BBS to reliably match features coming from the same distribution, in the presence of outliers.

We apply the BBS measure to template matching by representing both the template and each of the candidate image regions as point sets in a joint location-appearance space. To this end, we use normalized coordinates for location and experiment with both color as well as Deep features for appearance (although, BBS is not restricted to these specific choices). BBS is used to measure the similarity between the two sets of points in these spaces. The aforementioned properties of BBS now readily apply to template matching. That is, pixels on the object of interest in both the template and the candidate patch can be thought of as originating from the same underlying distribution. These pixels in the template are likely to find best buddies in the candidate patch, and hence would be considered as inliers. In contrast, pixels that come from different distributions, e.g., pixels from different backgrounds, are less likely to find best buddies, and hence would be considered outliers (see Fig. 1(c)). Given this important property, BBS bypasses the need to explicitly model the underlying object appearance and deformation.

To summarize, the main contributions of this paper are: (a) introducing BBS – a useful, robust, parameter-free measure for template matching in unconstrained environments, (b) analysis providing theoretical justification of its key features and linking BBS with the Chi-Square distance, and (c) extensive evaluation on challenging real data, using different feature representations, and comparing BBS to a number of commonly used template matching methods. A preliminary version of this paper appeared in CVPR 2015 [1].

Figure 2: Best-Buddies Pairs (BBPs) between 2D Gaussian Signals: First row, Signal

consists of “foreground” points drawn from a normal distribution,

, marked in blue; and “background” points drawn from , marked in red. Similarly, the points in the second signal Q are drawn from the same distribution , and a different background distribution . The color of points is for illustration only, i.e., BBS does not know which point belongs to which distribution. Second row, only the BBPs between the two signals which are mostly found between foreground points.

2 Related Work

Template matching algorithms depend heavily on the similarity measure used to match the template and a candidate window in the target image. Various similarity measures have been used for this purpose. The most popular are the Sum of Squared Differences (SSD), Sum of Absolute Differences (SAD) and Normalized Cross-Correlation (NCC), mostly due to their computational efficiency [2]. Different variants of these measures have been proposed to deal with illumination changes and noise [3, 4].

Another family of measures is composed of robust error functions such as M-estimators [5, 6] or Hamming-based distance [7, 8], which are less affected by additive noise and ’salt and paper’ outliers than cross correlation related methods. However, all the methods mentioned so far assume a strict rigid geometric deformation (only translation) between the template and the target image, as they penalize pixel-wise differences at corresponding positions in the template and the query region.

A number of methods extended template matching to deal with parametric transformations (e.g., [9, 10]). Recently, Korman et al. [11] introduced a template matching algorithm under 2D affine transformation that guarantees an approximation to the globally optimal solution. Likewise, Tian and Narasimhan [12]

find a globally optimal estimation of nonrigid image distortions. However, these methods assume a one-to-one mapping between the template and the query region for the underlying transformation. Thus, they are prone to errors in the presence of many outliers, such as those caused by occlusions and background clutter. Furthermore, these methods assume a parametric model for the distortion geometry, which is not required in the case of BBS.

Measuring the similarity between color histograms, known as Histogram Matching (HM), offers a non-parametric technique for dealing with deformations and is commonly used in visual tracking [13, 14]. Yet, HM completely disregards geometry, which is a powerful cue. Further, all pixels are evenly treated. Other tracking methods have been proposed to deal with cluttered environments and partial occlusions [15, 16]. But unlike tracking, we are interested in detection in a single image, which lacks the redundant temporal information given in videos.

Olson [17] formulated template matching in terms of maximum likelihood estimation, where an image is represented in a 3D location-intensity space. Taking this approach one step further, Oron et al.[18] use space and reduced template matching to measuring the EMD [19] between two point sets. Unlike EMD, BBS does not require matching. It therefore does not have to account for all the data when matching, making it more robust to outliers.

The BBS is a bi-directional measure. The importance of such two-side agreement has been demonstrated by the Bidirectional similarity (BDS) in [20] for visual summarization. Specifically, the BDS was used as a similarity measure between two images, where an image is represented by a set of patches. The BDS sums over the distances between each patch in one image to its nearest neighbor in the other image, and vice versa.

In the context of image matching, another widely used measure is the Hausdorff distance [21]. To deal with occlusions or degradations, Huttenlocher et al. [21] proposed a fractional Hausdorff distance in which the farthest point is taken instead of the most farthest one. Yet, this measure highly depends on that needs to be tuned. Alternatively, Dubuisson and Jain [22] replace the max operator with sum, which is similar to the way BDS is defined.

In contrast, the BBS is based on a count of the BBPs, and makes only implicit use of their actual distance. Moreover, the BDS does not distinguish between inliers and outliers. These proprieties makes the BBS a more robust and reliable measure as demonstrated by our experiments.

We show a connection between BBS and the Chi-Square () distance used as a distance measure between distributions (or histograms). Chi-Square distance comes from the test-statistic  [23] where it is used to test the fit between a distribution and observed frequencies. was successfully applied to a wide range of computer vision tasks such as texture and shape classification  [24, 25], local descriptors matching  [26], and boundary detection  [27] to name a few.

It is worth mentioning, that the term Best Buddies was used by Pomeranz et al. [28] in the context of solving jigsaw puzzles. Specifically, they used a metric similar to ours in order to determine if a pair of pieces are compatible with each other.

The power of mutual nearest neighbors was previously leveraged for tasks such as image matching [29], classification of images [30] and natural language data [31], clustering  [32] and more. In this work we demonstrate its use for template matching while providing some new statistical analysis.

Figure 3: BBS template matching results. Three toys examples are shown: (A) cluttered background, (B) occlusions, (C) nonrigid deformation. The template (first column) is detected in the target image (second column) using the BBS; the results using BBS are marked in a blue. The likelihood maps (third column) show well-localized distinct modes. The BBPs are shown in last column. See text for more details.

3 Best-Buddies Similarity

Our goal is to match a template to a given image, in the presence of high levels of outliers (i.e., background clutter, occlusions) and nonrigid deformation of the object of interest. We follow the traditional sliding window approach and compute the Best-Buddies Similarity (BBS) between the template and every window (of the size of the template) in the image. In the following, we give a general definition of BBS and demonstrate its key features via simple intuitive toy examples. We then statistically analyze these features in Sec. 4.

General Definition:  BBS measures the similarity between two sets of points and , where . The BBS is the fraction of Best-Buddies Pairs (BBPs) between the two sets. Specifically, a pair of points is a BBP if is the nearest neighbor of in the set , and vice versa. Formally,

(1)

where, , and is some distance measure. The BBS between the point sets and is given by:

(2)

The key properties of the BBS are: 1) it relies only on a (usually small) subset of matches i.e., pairs of points that are BBPs, whereas the rest are considered as outliers. 2) BBS finds the bi-directional inliers in the data without any prior knowledge on the data or its underlying deformation. 3) BBS uses rank, i.e., it counts the number of BBPs, rather than using the actual distance values.

To understand why these properties are useful, let us consider a simple 2D case of two point sets and . The set consist of 2D points drawn from two different normal distributions, , and . Similarly, the points in are drawn from the same distribution , and a different distribution (see first row in Fig. 2). The distribution can be treated as a foreground model, whereas and are two different background models. As can be seen in Fig. 2, the BBPs are mostly found between the foreground points in and . For set , where the foreground and background points are well separated, of the BBPs are foreground points. For set , despite the significant overlap between foreground and background, of the BBPs are foreground points.

This example demonstrates the robustness of BBS to high levels of outliers in the data. BBS captures the foreground points and does not force the background points to match. In doing so, BBS sidesteps the need to model the background/foreground parametrically or have a prior knowledge of their underlying distributions. This shows that a pair of points is more likely to be BBP if and are drawn from the same distribution. We formally prove this general argument for the 1D case in Sec. 4. With this observations in hand, we continue with the use of BBS for template matching.

3.1 BBS for Template Matching

To apply BBS to template matching, one needs to convert each image patch to a point set in . Following [18], we use a joint spatial-appearance space which was shown to be useful for template matching. BBS, as formulated in equation (2), can be computed for any arbitrary feature space and for any distance measure between point pairs. In this paper we focus on two specific appearance representations: (i) using color features, and (ii) using Deep features taken from a pretrained neural net. Using such Deep features is motivated by recent success in applying features taken from deep neural nets to different applications  [33, 34]. A detailed description of each of these feature spaces is given in Section 5.1.

Following the intuition presented in the 2D Gaussian example (see Fig. 2), the use of BBS for template matching allows us to overcome several significant challenges such as background clutter, occlusions, and nonrigid deformation of the object. This is demonstrated in three synthetic examples shown in Fig. 3. The templates and include the object of interest in a cluttered background, and under occlusions, respectively. In both cases the templates are successfully matched to the image despite the high level of outliers. As can be seen, the BBPs are found only on the object of interest, and the BBS likelihood maps have a distinct mode around the true location of the template. In the third example, the template is taken to be a bounding box around the forth duck in the original image, which is removed from the searched image using inpating techniques. In this case, BBS matches the template to the fifth duck, which can be seen as a nonrigid deformed version of the template. Note that the BBS does not aim to solve the pixel correspondence. In fact, the BBPs are not necessarily semantically correct (see third row in Fig. 3), but rather pairs of points that likely originated from the same distribution. This property, which we next formally analyze, helps us deal with complex visual and geometric deformations in the presence of outliers.

Figure 4: The expectation of BBS in the 1D Gaussian case: Two point sets, P and Q, are generated by sampling points from , and , respectively. (a), the approximated expectation of BBS(P,Q) as a function of (x-axis), and (y-axis).(b)-(c), the expectation of SSD(P,Q), and SAD(P,Q), respectively. (d), the expectation of BBS as a function of plotted for different .

4 Analysis

So far, we have empirically demonstrated that the BBS is robust to outliers, and results in well-localized modes. In what follows, we give a statistical analysis that justifies these properties, and explains why using the count of the BBP is a good similarity measure. Additionally, we show that for sufficiently large sets BBS converges to the well known Chi-Square. This connection with provides additional insight into the way BBS handles outliers.

4.1 Expected value of BBS

We begin with a simple mathematical model in 1D, in which an “image” patch is modeled as a set of points drawn from a general distribution. Using this model, we derive the expectation of BBS between two sets of points, drawn from two given distributions and , respectively. We then analyze numerically the case in which , and are two different normal distributions. Finally, we relate these results to the multi-dimentional case. We show that the BBS distinctively captures points that are drawn from similar distributions. That is, we prove that the likelihood of a pair of points being BBP, and hence the expectation of the BBS, is maximal when the points in both sets are drawn from the same distribution, and drops sharply as the distance between the two normal distributions increases.

One-dimentional Case:  Following Eq. 2, the expectation BBS(P,Q), over all possible samples of P and Q is given by:

(3)

where is defined in Eq. 1. We continue with computing the expectation of a pair of points to be BBP, over all possible samples of P and Q, denoted by . That is,

(4)

This is a multivariate integral over all points in P and Q. However, assuming each point is independent of the others this integral can be simplified as follows.

Claim: 

(5)

where, , and denote the CDFs of P and Q, respectively. That is, . And, , , and are similarly defined.

Proof:  Due to the independence between the points, the integral in Eq.4 can be decoupled as follows:

(6)

With abuse of notation, we use , and . Let us consider the function for a given realization of P and Q. By definition, this indicator function equals 1 when and are nearest neighbors of each other, and zero otherwise. This can be expressed in terms of the distance between the points as follows:

(7)

where is an indicator function. It follows that for a given value of and , the contribution of to the integral in Eq. 6 can be decoupled. Specifically, we define:

(8)

Assuming , the latter can be written as:

(9)

where , . Since , it can be easily shown that can be expressed in terms of , the CDF of P:

(10)

The same derivation hold for computing , the contribution of to the integral in Eq. 6, given , and . That is,

(11)

where are similarly defined and is the CDF of Q. Note that and depends only on and and on the underlying distributions. Therefore, Eq. 6 results in:

(12)

Substituting the expressions for and in Eq. 12, and omitting the subscripts for simplicity, result in Eq. 5, which completes the proof.

In general, the integral in Eq. 5 does not have a closed form solution, but it can be solved numerically for selected underlying distributions. To this end, we proceed with Gaussian distributions, which are often used as simple statistical models of image patches. We then use Monte-Carlo integration to approximate for discrete choices of parameters and of in the range of [0, 10] while fixing the distribution of to have . We also fixed the number of points to . The resulting approximation for as a function of the parameters is shown in Fig. 4, on the left. As can be seen, is the highest at , i.e., when the points are drawn from the same distribution, and drops rapidly as the the underlying distribution of deviates from .

Note that does not depends on and (because of the integration, see Eq. 5. Hence, the expected value of the BBS between the sets (Eq. 3) is given by:

(13)

where is constant.

We can compare the BBS to the expectation of SSD, and SAD. The expectation of the SSD has a closed form solution given by:

(14)

Replacing with results in the expression of the SAD. In this case, the expected value reduces to the expectation of the Half-Normal distribution and is given by:

(15)

Fig. 4(b)-(c) shows the maps of the expected values for , and , where are the expectation of SSD and SAD, normalized to the range of [0,1]. As can be seen, the SSD and SAD results in a much wider spread around their mode. Thus, we have shown that the likelihood of a pair of points to be a BBP (and hence the expectation of the BBS) is the highest when P and Q are drawn from the same distribution and drops sharply as the distance between the distributions increases. This makes the BBS a robust and distinctive measure that results in well-localized modes.

Multi-dimensional Case:  With the result of the 1D case in hand, we can bound the expectation of BBS when and are sets of multi-dimensional points, i.e., .

If the -dimensions are uncorrelated (i.e., the covariance matrices are diagonals in the Gaussian case), a sufficient (but not necessary) condition for a pair of points to be BBP is that the point would be BBP in each of the dimensions. In this case, the analysis can be done for each dimension independently similar to what was done in Eq. 5. The expectation of the BBS in the multi-dimensional case is then bounded by the product of the expectations in each of the dimensions. That is,

(16)

where denote the expectation of BBS in the dimension. This means that the BBS is expected to be more distinctive, i.e., to drop faster as increases. Note that if a pair of points is not a BBP in one of the dimensions, it does not necessarily imply that the multi-dimentional pair is not BBP. Thus, this condition is sufficient but not necessary.

4.2 BBS and Chi-Square

Chi-Square is often used to measure the distance between histograms of two sets of features. For example, in face recognition,

is used to measure the similarity between local binary patterns (LBP) of two faces [35], and it achieves superior performance relative to other distance measures.

In this section, we will discuss the connection between this well known statistical distance measure and BBS. Showing that, for sufficiently large point sets, BBS converges to the distance.

We assume, as before, that point sets and are drawn i.i.d. from 1D distribution functions and respectively. We begin by considering the following lemma:

Lemma 1.

Given a point in , let be the probability that has a best buddy in . Then we have:

(17)

For the proof of the lemma see appendix A. Intuitively, if there are many points from in the vicinity of point , but only few points from , i.e. is large but is small. It is then hard to find a best buddy in for , as illustrated in Figure 5(a). Conversely, if there are few points from in the vicinity of but many points from , i.e. is small and is large. In that case, it is easy for to find a best buddy, as illustrated in Figure 5(b).

(a) (b)
Figure 5: Finding a Best-Buddy: We illustrate how the underlying density functions affect the probability that a point (bold red circle) has a best buddy. (a) Points from set (red circles) are dense but points from set (blue cross) are sparse. Although is the nearest neighbor of in , is not the nearest neighbor of in ( is closer). (b) Points from set are dense and points from set are sparse. In this case, and are best buddies, as is the closest point to .
(a) Density functions of two Gaussian mixtures (b) The probability that a point in P has a best buddy
Figure 6: Illustrating Lemma 1: Point sets and are sampled iid from the two Gaussian mixtures shown in (a). The probability that a point in set has a best buddy in set is empirically computed for different set sizes (b). When the size of the sets increase, the empirical probability converges to the analytical solution in Lemma 1 (dashed black line).

A synthetic experiment illustrating the lemma is shown in Figure 6. Two Gaussian mixtures, each consisting of two 1D-Gaussian distributions are used (Figure 6(a)). Sets and are sampled from these distributions (each set from a different mixture model). We then empirically calculate the probability that a certain point from set has a best buddy in set for different set sizes, ranging from 10 to 10000 points, Figure 6(b) . As the sets size increases, the empirical probability converges to the analytical value given by Lemma 1, marked by the dashed black line. Note how the results agree with our intuition. For example, at , is very large but is almost , such that is almost 0. At , however, is very small and is almost , so is almost 1.

Lemma 1 assumes the value of the point is fixed. However, we need to consider that itself is also sampled from the distribution , in which case the probability this point has a best buddy is:

(18)

Where we assume both density functions are defined on the closed interval .

We are now ready to show that BBS converges to Chi-Square,

Theorem 1.

Suppose both density functions are defined on a close interval , non-zero and Lipschitz continuous 111Note that most of density functions, like the density function of a Gaussian distribution, are non-zero and Lipschitz continuous in their domain.. That is,

  1. ,

  2. and ,

then we have,

(19)

where is the Chi-Square distance between two distributions.

To see why this theorem holds, consider the BBS measure between two sets, and . When the two sets have the same size, the BBS measure equals to the fraction of points in that have a best buddy, that is . Taking expectation on both sides of the equation, we get:

(20)

Where for the last equality we used lemma 1. This completes the proof of Theorem 1.

The theorem helps illustrate why BBS is robust to outliers. To see this, consider the signals in Figure  6(a). As can be seen and are both Gaussian mixtures. Let us assume that the Gaussian with mean represents the foreground (in both signals), i.e. , and that the second Gaussian in each mixture represents the background, i.e. and . Note how, is very close to zero around and similarly is very close to zero around . This means that the background distributions will make very little contribution to the distance, as the numerator of Eq. 19 is very close to 0 in both cases.

We note that using BBS has several advantages compared to using . One such advantage is that BBS does not require binning data into histograms. It is not trivial to set the bin size, as it depends on the distribution of the features. A second advantage is the ability to use high dimensional feature spaces. The computational complexity and amount of data needed for generating histograms quickly explodes when the feature dimension goes higher. On the contrary, the nearest neighbor algorithm used by BBS can easily scale to high-dimensional features, like Deep features.

5 Implementation Details

In this section we provide information on the specific feature spaces used in our experiments. Additionally, we analyze the computational complexity of BBS and propose a caching scheme allowing for more efficient computation.

5.1 Feature Spaces

In order to perform template matching BBS is computed, exhaustively, in a sliding window. A joint spatial-appearance representation is used in order to convert both template and candidate windows into point sets. For the spatial component normalized coordinates within the windows are used. For the appearance descriptor we experiment with both color features as well as Deep features.

Color features:  When using color features, we break the template and candidate windows into distinct patches. Each such patch is represented by its color channel values and location of the central pixel, relative to the patch coordinate system. For our toy examples and qualitative experiments color space is used. However, for our quantitative evaluation was used as it was found to produced better results. Both spatial and appearance channels were normalized to the range . The point-wise distance measure used with our color features is:

(21)

where superscript denotes a points appearance and superscript denotes a points location. The parameter was chosen empirically and was fixed in all of our experiments.

Deep features:  For our Deep feature we use features taken from the VGG-Deep-Net [36]. Specifically, we take features from two layers of the network, conv (64 features) and conv (256 features). The feature maps from conv

are down-sampled twice, using max-pooling, to reach the size of the conv

which is down-sampled by a factor of with respect to the original image. In this case we treat every pixel in the down-sampled feature maps as a point. Each such point is represented by its

location in the down-sampled window and its appearance is given by the 320 feature channels. Prior to computing the point-wise distances each feature channel is independently normalized to have zero mean and unit variance over the window. The point-wise distance in this case is:

(22)

where

denotes the inner product operator between feature vectors. Unlike the color features we now want to maximize

rather then minimize it (we can always minimize ). The parameter was chosen empirically and was fixed in all of our experiments.


Figure 7: BBS results on Real Data: (a), the templates are marked in green over the input images. (b) the target images marked with the detection results of 6 different methods (see text for more details). BBS results are marked in blue. (c)-(e), the resulting likelihood maps using BBS, EMD and NCC , respectively; each map is marked with the detection result, i.e., its global maxima.

5.2 Complexity

Computing BBS between two point sets , requires computing the distance between each pair of points. That is, constructing a distance matrix where . Given , the nearest neighbor of , i.e. , is the minimal element in the row of . Similarly, is the minimal element in the column of . BBS is then computed by counting the number of mutual nearest neighbors (divided by a constant).
In this section we analyze the computational complexity of computing BBS exhaustively for every window in a query image. We then propose a caching scheme, allowing extensive computation reuse which dramatically reduces the computational complexity, trading it off with increased memory complexity.

Naive implementation:  For our analysis we consider a target window of size and some query image of size . Both represented using a feature space with feature channels. Let us begin by considering each pixel in our target window as a point in our target point set and similarly every pixel in some query window is considered as a point in the query point set . In this case, and our distance matrices are of size

. Assuming some arbitrary image padding, we have

query windows for which BBS has to be computed. Computing all the distance matrices requires . For each such distance matrix we need to find the minimal element in every row and column. The minimum computation for a single row or column is done in and for the entire matrix in . Therefore, the complexity of computing BBS naively for all query windows of image is,

(23)

This is a high computational load compared to simpler methods such as sum-of-square-difference (SSD) that require only .

Distance computation reuse:  When carefully examining the naive scheme above we notice that many pairwise distance computations are performed multiple times. This observation is key to our proposed caching scheme.

Assuming our sliding window works column by column, the first distance matrix in the image has to be fully computed. The second distance matrix, after sliding the query window down by one pixel, has many overlapping distance computations with the previously computed matrix. Specifically, we only have to compute the distances between pixels in the new row added to the query window and the target window. This means we have to recompute only columns of and not the entire matrix. Taking this one step further, if we cache all the distance matrices computed along the first image column, then staring from the second matrix in the second column, we would only have to compute the distance between one new candidate pixel and the target window, which means we only have to recompute one column of which requires only . Assuming the majority of distance matrices can be computed in , instead of . This means that computing BBS for the entire image would now require:

(24)

Minimum operator load reduction:  So far we have shown how caching can be used to reduce the load of building the distance matrices. We now show how additional caching can reduce the computational load of the minimum operator applied to each row and column of in order to find the BBP.

As discussed earlier, for the majority of query windows we only have to recompute column of . This means that for all other columns we have already computed the minimum. Therefore, we actually obtain the minimum over all columns in just . For the minimum computation along the rows there are two cases to consider. First, that the minimal value, for a certain row, was in the column that was pushed out of . In this case we would have to find the minimum value for that row, which would require . The second option is that the minimal value of the row was not pushed out and we know where it is from previous computations. In such a case we only have to compare the new element added to the row (by the new column introduced into ) relative to the previous minimum value, this operation requires

. Assuming the position of the minimal value along a row is uniformly distributed, on average, there will be only one row where the minimum value needs to be recomputed. To see this consider a set of random variables

such that if and only if the minimal value in the ’th row of was pushed out of the matrix when a new column was introduced. Assuming a uniform distribution . The number of rows for which the minimum has to be recomputed is given by , and the expected number of such rows is,

(25)

This means, that on average, there will be only one row for which the minimum has to be computed in time (for other rows only is required). Therefore, on average, we are able to find the minimum of all rows and columns in , in instead on . By combining the efficient minimum computation scheme, along with the reuse of distance computations for building , we reduce the overall BBS complexity over the entire image to,

(26)

Additional load reduction:  When using color features, we note that the actual complexity of computing BBS for the entire image is even lower due to the use of non-overlapping patches instead of individual pixels. This means that both image and target windows are sampled on a grid with spacing which in turn leads to an overall complexity of:

(27)

We note that the reuse schemes presented above cannot be used with our Deep features due to the fact that we normalize the features differently, with respect to each query window. Also the above analysis does not consider the complexity of extracting the Deep features themselves.

6 Results

We perform qualitative as well as extensive quantitative evaluation of our method on real world data. We compare BBS with several measures commonly used for template matching. 1) Sum-of-Square-Difference (SSD), 2) Sum-of-Absolute-Difference (SAD), 3) Normalized-Cross-Correlation (NCC), 4) color Histogram Matching (HM) using the distance, 5) Bidirectional Similarity [20] (BDS) computed in the same appearance-location space as BBS.

(a) Color feature, best mode only. (b) Deep feature, best mode only.
(c) Color feature, top 3 modes. (d) Deep feature, top 3 modes.
Figure 8: Template matching accuracy: Evaluation of method performance using 270 template-image pairs with . BBS outperforms competing methods as can be seen in ROC curves showing fraction of examples with overlap greater than threshold values in [0,1]. Top: only best mode is considered. Bottom: best out of top 3 modes is taken. Left: Color features. Right: Deep features. Mean-average-precision (mAP) values taken as area-under-curve are shown in the legend. Best viewed in color.
(a) (b) (c) (d)
Figure 9: Example results using color features. Top, input images with annotated template marked in green. Middle, target images and detected bounding boxes (see legend); ground-truth (GT) marked in green (our results in blue). Bottom, BBS likelihood maps. BBS successfully match the template in all these examples.
(a) (b) (c) (d)
Figure 10: Example results using Deep features. Top, input images with annotated template marked in green. Middle, target images and detected bounding boxes (see legend); ground-truth (GT) marked in green (our results in blue). Bottom, BBS likelihood maps. BBS successfully match the template in all these examples.

6.1 Qualitative Evaluation

Four template-image pairs taken from the Web are used for qualitative evaluation. The templates, which were manually chosen, and the target images are shown in Figure  1(a)-(b), and in Figure  7. In all examples, the template drastically changes its appearance due to large geometric deformation, partial occlusions, and change of background.

Detection results, using color features with color space, are presented in Figure  1(a)-(b), and in Figure  7(b), and compared to the above mentioned methods as well as to the Earth Movers Distance[19] (EMD). The BBS is the only method successfully matching the template in all these challenging examples. The confidence maps of BBS, presented in Figure  7(c), show distinct and well-localized modes compared to other methods222Our data and code are publicly available at: http://people.csail.mit.edu/talidekel/Best-BuddiesSimilarity.html. The BBPs for the first example are shown in Figure  1(c). As discussed in Sec. 3, BBS captures the bidirectional inliers, which are mostly found on the object of interest. Note that the BBPs, as discussed, are not necessarily true physical corresponding points.

(a) Color features (b) Deep features
Figure 11: Effect of space time baseline: Methods performance evaluated for data sets with different space-time baseline, and . Left: Color features, Right: Deep features. BBS outperforms competing methods for both feature choices and for all values. Best viewed in color.

6.2 Quantitative Evaluation

We now turn to the quantitative evaluation. The data for our experiments was generated from a dataset of 100 annotated video sequences333https://sites.google.com/site/benchmarkpami/ [37], both color and gray-scale. These videos capture a wide range of challenging scenes in which the objects of interest are diverse and typically undergo nonrigid deformations, photometric changes, motion blur, in/out-of-plane rotation, and occlusions.

Three template matching datasets were randomly sampled from the annotated videos. Each dataset is comprised of template-image pairs, where each such pair consists of frames and , where was randomly chosen. For each dataset a different value of was used ( or ). The ground-truth annotated bounding box in frame is used as the template, while frame is used as the query image. This random choice of frames creates a challenging benchmark with a wide baseline in both time and space (see examples in Figure  9 and Figure  10). For the data sets consist of pairs and for there are pairs.

BBS using both color (with color space) and Deep features was compared with the 5 similarity measures mentioned above. The ground-truth annotations were used for quantitative evaluation. Specifically, we measure the accuracy of both the top match as well as the top k ranked matches, as follows.

Accuracy:  was measured using the common bounding box overlap measure: where and are the estimated and ground truth bounding boxes, respectively. The ROC curves show the fraction of examples with overlap larger than a threshold (). Mean average precision (mAP) is taken as the area-under-curve (AUC). The success rates, of all methods, were evaluated considering only the global maximum (best mode) prediction as well as considering the best out of the top 3 modes (using non-maximum suppression, NMS).

Results for both color feature and Deep features for the dataset with are shown in Figure 8. Overall it can be seen that BBS outperforms competing methods using both color and Deep features. Using color features and considering only the top mode Figure 8(a), BBS outperforms competing methods with a margin ranging from compared to BDS to over compared to SSD. When considering the top 3 modes, Figure 8(c), the performance of all methods improves. However, we can clearly see the dominance of BBS, increasing its margin over competing methods. BBS reaches mAP of 0.648 (compared to 0.589 with only the top mode). For example the margin between BBS and BDS, which is the runner up, increases to . The increase in performance when considering the top 3 modes suggests that there are cases where BBS is able to produce a mode at the correct target position however this mode might not be the global maximum of the entire map.

Some successful template matching examples, along with the likelihood maps produced by BBS, using the color features, are shown in Figure  9. Notice how BBS can overcome non-rigid deformations of the target.

Typical failure cases are presented in Figure  12. Most of the failure cases using the color features can be attributed to either illumination variations (c), distracting objects with a similar appearance to the target (a)-(b), or cases were BBS matches the background or occluding object rather than the target (d). This usually happens when the target is heavily occluded or when the background region in the target window is very large.

Results using our Deep feature and considering only the top mode are shown in figures Figure 8(b). We note that HM was not evaluated in this case due to the high dimensionality of the feature space. We observe that BBS outperforms the second best methods by only a small margin of . Considering the top 3 modes allows BBS to reach mAP of 0.684 increasing its margin relative to competing methods. For example the margin relative to the second best method (SSD) is now .

Some template matching examples, along with their associated likelihood maps, using the Deep features, are shown in Figure  10. The Deep features are not sensitive to illumination variations and can capture both low level information as well as higher level object semantics. As can be seen the combination of using Deep features and BBS can deliver superior results due to its ability to explain non-rigid deformations. Note how when using the Deep feature, we can correctly match the bike rider in Figure  10(c) for which color features failed (Figure  12(d)). BBS with Deep features produce very well localized and compact modes compared to when color features are used.

Some typical failure cases when using the Deep features are presented in Figure  13. As for the color features, many failure cases are due to distracting objects with a similar appearance (a)-(b) or cases were BBS matches the background or occluding object (d).

It is interesting to see that BDS which was the runner up when color features were used come in last when using Deep features switching places with SSD which was worst previously and is now second in line. This also demonstrates the robustness of BBS which is able to successfully use different features. Additionally, we see that overall BBS with Deep features outperforms BBS with color features (a margin of with top 3 modes). However, this performance gain requires a significant increased in computational load both since the features have to be extracted and also since the proposed efficient computation scheme cannot be used in this case. It is interesting to see that BBS with color features is able perform as well as SSD with Deep features.

Finally, we note that, when using the color features BBS outperforms HM which uses the distance. Although BBS converges to for large sets there are clear benefits for using BBS over

. Computing BBS does not require modeling the distributions (i.e. building normalized histograms) and can be performed on the raw data itself. This alleviates the need to choose the histogram bin size which is known to be a delicate issue. Moreover, BBS can be performed on high dimensional data, such as our Deep features, for which modeling the underlying distribution is not practical.

The space time baseline:  effect on performance was examined using data-sets with different values (). Figure 11 shows mAP of competing methods for different values of . Results using color features are shown on the left and using Deep features on the right. All results were analyzed taking the best out of the top 3 modes. It can be seen that BBS outperforms competing methods for the different values with the only exception being Deep feature with in which case BBS and SSD produce similar results reaching mAP of 0.6.

(a) (b) (c) (d)
Figure 12: Example of failure cases using color features. Top, input images with annotated template marked in green. Bottom, target images and detected bounding boxes (see legend); ground-truth (GT) marked in green (our results in blue). As can be seen, some common failure causes are illumination changes, similar distracting targets or locking onto the background.
(a) (b) (c) (d)
Figure 13: Example of failure cases using Deep features. Top, input images with annotated template marked in green. Bottom, target images and detected bounding boxes (see legend); ground-truth (GT) marked in green (our results in blue). Some common failure causes are similar distracting targets or locking onto the background.

7 Conclusions

We have presented a novel similarity measure between sets of objects called the Best-Buddies Similarity (BBS). BBS leverages statistical properties of mutual nearest neighbors and was shown to be useful for template matching in the wild. Key features of BBS were identified and analyzed demonstrating its ability to overcome several challenges that are common in real life template matching scenarios. It was also shown, that for sufficiently large point sets, BBS converges to the Chi-Square distance. This result provides interesting insights into the statistical properties of mutual nearest neighbors, and the advantages of using BBS over where discussed.

Extensive qualitative and quantitative experiments on challenging data were performed and a caching scheme allowing for an efficient computation of BBS was proposed. BBS was shown to outperform commonly used template matching methods such as normalized cross correlation, histogram matching and bi-directional similarity. Different types of features can be used with BBS, as was demonstrated in our experiments, where superior performance was obtained using both color features as well as Deep features.

Our method may fail when the template is very small compared to the target image, when similar targets are present in the scene or when the outliers (occluding object or background clutter) cover most of the template. In some of these cases it was shown that BBS can predict the correct position (produce a mode) but non necessarily give it the highest score.

Finally, we note that since BBS is generally defined between sets of objects it might have additional applications in computer-vision or other fields that could benefit from its properties. A natural future direction of research is to explore the use of BBS as an image similarity measure, for object localization or even for document matching.

Appendix A Proof of Lemma 1

Because of independent sampling, all points in have equal probability being the best buddy of . From this we have:

(28)

where is a point from and subscript is dropped for ease of description.

The probability that two points are best buddies is given by:

(29)

where and denote CDFs of these two distributions, that is, . And, , , and are similarly defined. Combining Eq.28 and Eq. 29, the probability that has a best buddy equals to

(30)

We denote the signed distance between two points by . Intuitively, because the density function are non-zero at any place, when goes to infinity, the probability that two points are BBP decreases rapidly as increases. Therefore, we only need to consider the case when the distance between and is very small. Formally, for any positive , changing the integration limits in Eq. 30 from to does not change the result (see Claim 2 in the supplementary material).

Then let us break down and in Eq. 30. Given that the density functions and are Lipschitz continuous (Condition 2 in Theorem 1), we can assume that they take a constant value in the interval , and . That is,

(31)

And thus, the expression can be approximated as follows:

(32)

Similarly, . Note that this approximation can also be obtained using Taylor expansion on and . At last, since and are very close to each other, we assume:

(33)

Plugging all these approximations (Eq. 32 and Eq. 33) to Eq. 30 and replacing by , we get:

(34)
(35)
(36)

It is worth mentioning that the approximated equality in Eq. 32 and Eq. 33 becomes restrict equality when goes to infinity (for the proof see Claim 3 in the supplementary material). Also, since the distance between two points is very small, the second order term in Eq. 35 is negligible and is dropped in Eq. 36 (for full justification see Claim 4 in the supplementary material).

At last, (see Claim 1 in supplementary material). Thus Eq. 36 equals to:

(37)

which completes the proof of Lemma 1.

Acknowledgments.

This work was supported in part by an Israel Science Foun- dation grant 1556/10, National Science Foundation Robust Intelligence 1212849 Reconstructive Recognition, and a grant from Shell Research.

References

  • [1] T. Dekel, S. Oron, S. Avidan, M. Rubinstein, and W. Freeman, “Best buddies similarity for robust template matching,” in

    Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on

    .    IEEE, 2015.
  • [2]

    W. Ouyang, F. Tombari, S. Mattoccia, L. Di Stefano, and W.-K. Cham, “Performance evaluation of full search equivalent pattern matching algorithms,”

    PAMI, 2012.
  • [3] Y. Hel-Or, H. Hel-Or, and E. David, “Matching by tone mapping: Photometric invariant template matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 317–330, 2014. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.138
  • [4] E. Elboher and M. Werman, “Asymmetric correlation: a noise robust similarity measure for template matching,” Image Processing, IEEE Transactions on, 2013.
  • [5] J.-H. Chen, C.-S. Chen, and Y.-S. Chen, “Fast algorithm for robust template matching with m-estimators,” Signal Processing, IEEE Transactions on, 2003.
  • [6] A. Sibiryakov, “Fast and high-performance template matching method,” in CVPR, 2011.
  • [7] B. G. Shin, S.-Y. Park, and J. J. Lee, “Fast and robust template matching algorithm in noisy image,” in Control, Automation and Systems, 2007. ICCAS’07. International Conference on, 2007.
  • [8] O. Pele and M. Werman, “Robust real-time pattern matching using bayesian sequential hypothesis testing,” PAMI, 2008.
  • [9] D.-M. Tsai and C.-H. Chiang, “Rotation-invariant pattern matching using wavelet decomposition,” Pattern Recognition Letters, 2002.
  • [10] H. Y. Kim and S. A. De Araújo, “Grayscale template-matching invariant to rotation, scale, translation, brightness and contrast,” in AIVT.    Springer, 2007.
  • [11] S. Korman, D. Reichman, G. Tsur, and S. Avidan, “Fast-match: Fast affine template matching,” in CVPR, 2013.
  • [12] Y. Tian and S. G. Narasimhan, “Globally optimal estimation of nonrigid image distortion,” IJCV, 2012.
  • [13] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in CVPR, 2000.
  • [14] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in ECCV 2002, 2002.
  • [15] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 tracker using accelerated proximal gradient approach,” CVPR, 2012.
  • [16] X. Jia, H. Lu, and M. Yang, “Visual tracking via adaptive structural local sparse appearance model,” CVPR, 2012.
  • [17] C. F. Olson, “Maximum-likelihood image matching,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 6, pp. 853–857, 2002. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2002.1008392
  • [18] S. Oron, A. Bar-Hillel, D. Levi, and S. Avidan, “Locally orderless tracking,” IJCV, 2014.
  • [19] Y. Rubner, C. Tomasi, and L. Guibas, “The earth mover’s distance as a metric for image retrieval,” IJCV, 2000.
  • [20] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing visual data using bidirectional similarity,” in CVPR, 2008.
  • [21] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Comparing images using the hausdorff distance,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 15, no. 9, pp. 850–863, 1993.
  • [22] M.-P. Dubuisson and A. Jain, “A modified hausdorff distance for object matching,” in Pattern Recognition, 1994. Vol. 1 - Conference A: Computer Vision amp; Image Processing., Proceedings of the 12th IAPR International Conference on, vol. 1, Oct 1994, pp. 566–568 vol.1.
  • [23] G. Snedegor, W. G. Cochran et al., “Statistical methods.” Statistical methods., no. 6th ed, 1967.
  • [24] M. Varma and A. Zisserman, “A statistical approach to material classification using image patch exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 2032–2047, 2009.
  • [25] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 4, pp. 509–522, 2002.
  • [26] P.-E. Forssén and D. G. Lowe, “Shape descriptors for maximally stable extremal regions,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on.    IEEE, 2007, pp. 1–8.
  • [27] D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color, and texture cues,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26, no. 5, pp. 530–549, 2004.
  • [28] D. Pomeranz, M. Shemesh, and O. Ben-Shahar, “A fully automated greedy square jigsaw puzzle solver,” in The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, 2011, pp. 9–16. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2011.5995331
  • [29] T.-t. Li, B. Jiang, Z.-z. Tu, B. Luo, and J. Tang, Intelligent Computation in Big Data Era.    Berlin, Heidelberg: Springer Berlin Heidelberg, 2015, ch. Image Matching Using Mutual k-Nearest Neighbor Graph, pp. 276–283. [Online]. Available: http://dx.doi.org/10.1007/978-3-662-46248-5_34
  • [30] H. Liu, S. Zhang, J. Zhao, X. Zhao, and Y. Mo, “A new classification algorithm using mutual nearest neighbors,” in 2010 Ninth International Conference on Grid and Cloud Computing, Nov 2010, pp. 52–57.
  • [31] K. Ozaki, M. Shimbo, M. Komachi, and Y. Matsumoto, “Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data,” in Proceedings of the Fifteenth Conference on Computational Natural Language Learning, ser. CoNLL ’11.    Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 154–162. [Online]. Available: http://dl.acm.org/citation.cfm?id=2018936.2018954
  • [32] Z. Hu and R. Bhatnagar, “Clustering algorithm based on mutual k-nearest neighbor relationships,” Statistical Analy Data Mining, vol. 5, no. 2, pp. 110–113, 2012.
  • [33] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision), 2015.
  • [34] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [35] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” in PAMI, vol. 28, no. 12.    IEEE, 2006, pp. 2037–2041.
  • [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [37] Y. Wu, J. Lim, and M. Yang, “Online object tracking: A benchmark,” in CVPR, 2013.