CoMIC: Good features for detection and matching at object boundaries

12/05/2014 ∙ by Swarna Kamlam Ravindran, et al. ∙ Indian Institute Of Technology, Madras 0

Feature or interest points typically use information aggregation in 2D patches which does not remain stable at object boundaries when there is object motion against a significantly varying background. Level or iso-intensity curves are much more stable under such conditions, especially the longer ones. In this paper, we identify stable portions on long iso-curves and detect corners on them. Further, the iso-curve associated with a corner is used to discard portions from the background and improve matching. Such CoMIC (Corners on Maximally-stable Iso-intensity Curves) points yield superior results at the object boundary regions compared to state-of-the-art detectors while performing comparably at the interior regions as well. This is illustrated in exhaustive matching experiments for both boundary and non-boundary regions in applications such as stereo and point tracking for structure from motion in video sequences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Features are the basic building blocks in several tasks in Computer Vision such as Visual Odometry etc

[31], Structure from Motion (SfM)[37] and Simulataneous Localisation and Mapping (SLAM)[21]. Basic corner detectors or point features popularly used in these applications include Harris[20], Shi and Tomasi [40] and Hessian[9] which aggregate image gradients in a patch to find corners. Many fast variations of point detectors such as SUSAN[41], AGAST[25], FAST[35] and FAST-ER[36]

have emerged, which perform fast approximations of the gradient computation, where the latter three use machine learning to train a classifier on a corner model. The performance of these fast detectors is quite similar to the best-performing point detectors, Harris and Hessian

[45, 25]. Scale and affine invariant extenstions of these detectors [24, 27, 28, 3, 4] also find use in applications such as Object Recognition and Mosaicing.

Figure 1: Regions shown in boxes correspond to corners in one image that are missed in another or missed in both due to the change in gradients associated with a changing background.

The detection and matching of features across frames of a video is an important first step in several applications. However, features on object boundaries are typically not utilized in any further step since the detections and matches are poor in these regions, especially when the object moves against a significantly varying background as shown in Fig.1. This can be attributed to two reasons. First, the point detectors rely on gradient aggregation in image patches which may span across multiple objects in the scene, leading to errors in the boundary region when the object or the camera moves. This effect is more pronounced when the background changes and is compounded across frames, causing a significant drift in the tracks after a number of frames. Second, there is a further error introduced in the local template matching stage, where the correlation values due to varying non-object portions in the patch introduce errors in the matching.

We address the above problems by proposing a “corner” detector on iso-intensity curves. Iso-curves are the boundaries of connected components in an image thresholded at a particular intensity level. We note that the boundaries of objects are typically traced by iso-curves which often move along with the object (Fig.2) and can thus be used to detect an object or its parts accurately even in a changing background.

Our approach to improve the matching accuracy is two-fold. First, we find points on iso-curves that are more stable and robust to changes in the background, as compared to points found using patch-aggregation techniques. Second, we block out irrelevant background portions of the patch in the template matching stage using the iso-curve which acts as an effective curve of separation between the object and the background.

Features have been detected using iso-curves, the most popular among them being the Maximally Stable Extremal Regions(MSER) detector[26]. MSERs are stable iso-curves that have high Repeatability and Matching scores in image matching experiments[29] but return very few detections. These may not be sufficient in SfM or point tracking where the overall displacement and geometry is drawn from a consensus on corresponding feature points. The detections are fewer because MSER considers only small, closed iso-curves since features by definition must be local in order to deal with factors such as occlusion. This causes MSER to miss information along long iso-curves completely (Fig. 2). Other approaches have detected corners on iso-curves[12, 33] or edges[45]. However, these approaches are again dependent on gradients or use very few points to compute the corner which makes them quite noisy.

Figure 2: A long iso-curve that forms the boundary of an object. The information present along such an iso-curve is discarded by MSER.

For matching, the most popular approach is to use gradients distributions[24] built on the entire patch around the point, which again has problems at the object boundaries with changing backgrounds. Shape based descriptions for MSER were proposed by Lowe et al[15]. While a purely shape based description may be too generic, our approach effectively combines the use of a curve with information from the patch as it leads to more distinctiveness of the patch[34].

In this paper, we use the information along long iso-curves and detect corners on portions of them. We use a measure based on area-change (similar to MSER) for determining the local stability of an iso-curve. Furthermore, we improve the matching using the iso-curve. We demonstrate through extensive visual and quantitative results that such an approach yields corner points that perform well on the boundary regions and are therefore useful in 3D Tracking, SfM and 3D reconstruction applications.

The rest of the paper is organized as follows. Sec. 2 describes our features. Sec. 3 contains the algorithm and some implementation details in order to efficiently detect such features. Finally, Sec. 4 presents experimental results compared to the state-of-the-art detectors on a variety of datasets.

2 Corner Definition

We define our feature point such that it satisfies two properties. First, it must be found on an iso-curve segment () (a portion of an iso-curve) that remains largely unchanged with respect to intensity perturbations. Such an is called locally stable in our work. Second, it must be a corner along the iso-curve according to a measure that evaluates the distribution of the points of the in orthogonal directions. We first consider the idea of local stability of an .

2.1 Local Stability of an Iso-Curve Segment

We denote an centered at a candidate corner point on an iso-curve at intensity , with points on either side of as . Equivalently, we may denote the as , where , being the scale at which the is detected and being a constant. An approximation used in the implementation is described in Sec. 3.

In order to define the stability of an , we locate corresponding portions on nearby iso-curves at intensities and , which are denoted by and respectively. A corresponding portion is identified in an or by finding points on them that are closest to the endpoints of . A few examples are shown diagrammatically in Fig. 3. Since iso-­curves do not intersect the and s of an are unique for a particular .

The stability of can be calculated in terms of a distance measure between these two open curves, and . Such a distance measure may be defined by finding corresponding points on the two curves and measuring a quantity between some or all of them[33]. These measures can be noisy and are typically not symmetric.

(a)                            (b)

Figure 3: Two s from image blocks in (a) are shown in red in (b) with the corresponding portions on their Up and Down s at in white and their mid-points , in green. The shaded area enclosed between the Up and Down s is obtained by connecting their corresponding end-points shown in a red dotted line. While the top row of (b) is an unstable , the bottom row shows a stable whose Up and Down s are close together.

In this work, we use the area between the two curves as a measure of stability, as first proposed by MSER. While MSER computes this area between two closed curves, we approximate the area between two open curve segments and by connecting their end points (Fig.3(b)). Such a measure is simple and robust as also seen from the stable performance of MSER.

Given such a variation measure , we define the stability of as the inverse of its divided by its length :

(1)

Essentially, measures the average motion of a point on the when the intensity is varied. Thus, lower values of (or higher values of ) specify s that are relatively stable with respect to intensity variations and can thus be found reliably in another image of the same scene.

We make a further modification to the above measure to make it more robust. Portions of the curve near the candidate corner are more important than portions away from it. A large distance between two iso-curve portions near a corner must not be averaged to a low distance value due to smaller distances between portions further away from it. Fig 3(b) shows such a corner that is unstable, eventhough it has relatively more stable end portions. At the same time, a relatively less stable portion far away from the corner should not bring its stability down.

We improve the measure for corner stability by giving a Gaussian-weight to the points in the image that are used to compute as well as the points used to compute . In order to do so in a consistent way, we first assign weights to the points on the iso-curve based on their distance along the curve from using a 1D Gaussian. While is calculated from such a weighted curve, the computation is done by assigning weights to all the points in the 2D image. Each point in the image is given the weight of the point on the closest to it. This can be done very fast using the distance transform. Such an approach still measures the average motion of a point on the but now does so with a Gaussian weight assigned to such points.

Using this stability measure of s, a non-maximal suppression is done to accurately localize them. The stability of the is higher than their respective and , i.e iso-curve segments that are immediately above and below it. We denote each such maximally stable iso-curve segment as .

The local maximally stable iso-curve segments thus obtained should be as different from a straight line as possible, since points on straight lines cannot be localized in another image accurately. The following approach is used to obtain s of the appropriate shape.

2.2 Corners on Iso-Curves

The second condition we enforce on the feature point is that it must not lie on an that is nearly straight. To detect such distinct and well localized points, we find corners on the . A popular concept used to measure the change in the direction at a particular point of the curve is the curvature[33, 12]. However, curvature based methods can be quite sensitive to noise.

Figure 4: The distribution of points on an w.r.t their mean point (in red) which is used to determine a corner.

In this work, we detect a corner by measuring the distribution of points of an centered at a given point (Fig. 4). A similar technique is used by Tsai et al.[44] to find corners on a curve, which was shown to return less spurious corners compared to the curvature approach. We compute the covariance matrix using:

(2)

where is used to index the points on the

. These points are Gaussian weighted according to a variance

that is proportional to the scale at which the corner is being detected.

The eigenvalues of

reflect the distribution of the points of the along two principal orthogonal directions and high values of both indicate a corner point. The idea is similar to the Harris Corner detector[20]

which works on the second moment matrix of the image gradients. Several measures have been used in the literature:

[20], minimum of the two eigenvalues [40, 44], [11] and [43]. We use the last measure as it is suitable for point distributions where the number of points on the curve is constant. Such a measure is also rotation invariant. A non-maximal suppression is applied to localize the corner on the iso-curve when there are multiple corners in a neighborhood.

Apart from dealing with the problem of detecting spikes, our approach to find corners on s has the benefit of not needing exact derivatives. This lends the method to fast approximations as explained in the next section. Compared to traditional 2D corner algorithms, there is a reduction in computation since we work on the 1D curve. Finally, we define corners as follows:

Definition

A point is said to be a corner at a particular scale if is maximally stable according to the stability measure and is the local maxima of the cornerness measure along at scale .

An exhaustive search for such maximally stable corner points by investigating the stability of each segment on each iso-curve present in an image would be prohibitively slow. We next discuss a method to detect such points efficiently using some approximations.

3 Algorithm and Implementation Details

The first approximation is made in the scale and stability of an by running an MSER-like algorithm in an image block as shown in Fig. 5(d). The portion of an iso-curve at intensity contained within such a block yields the corresponding , where is the block center and is related to the block size by a constant.

The MSER algorithm that we run in this block is modified in a few ways. First, we use our stability formulation Eq.1 which involves a division by the iso-curve length rather than a division by the area of the (closed) iso-curve in the standard MSER’s stability formulation. Second, we calculate the areas and the curve lengths using a Gaussian weight on the image points as explained before. The Gaussian weighting also ensures that the blocking has negligible effect on the accuracy of the method since the points near the block edges will have low weights. Furthermore, since we have to compute the measure only in a small neighborhood of (between and ), we can use a region-growing algorithm for MSER computation[32]. It is linear in the number of pixels used in computing the MSER, which is a very small number of pixels in the neighborhood of the , rather than the whole block.

Figure 5: (a) Image divided into blocks of size . (b) and Corners found for one sample block. For one sample initial corner in (c), (d) shows a block of size centered on it and the redetected and corner converging to a more accurate position in two iterations.

A second approximation is used in convolutions/summations for cornerness calculation, where the 1D Gaussian function is replaced by an average filter that runs through multiple iterations to yield approximately the same result due to the Central Limit Theorem. The idea is similar in spirit to the approximate 2D Gaussian implemented in SURF

[8] and is much faster due to the use of Dynamic Programming. Such an approximation is possible as our cornerness measure is much more robust to weight errors as opposed to other possibilities such as the curvature.

Finally, we describe a two-stage approach that reduces the number of points that have to be analyzed. We detect initial corners at scale and obtain corners desired at scale from them through an iterative procedure. In the first Initialization stage, we detect corners at scale using blocks of size BxB (Fig. 5(a)). No weights are used in this stage. For an image of size MxN and a shift of there are overlapping blocks. Since the computation is linear in the number of pixels in the block, the time for initial computation remains at .

Assuming that such stable iso-curves at a higher scale do not change drastically when the scale is reduced by half, the Feature Convergence procedure proceeds by centering a weighed block of size x on the corner detected at , as shown in 5(d)). The modified local MSER and corner detection algorithms are applied on it again. If the redetected corner does not change, it is taken to be a maximally stable feature point. However, in case the corner shifts to a new point on a nearby , a weighted image block of the same scale is centered at and used to redetermine the and the nearest corner point on it, and the process is iterated. The fixed point of this iteration, if present, yields a point that satisfies both the conditions for our corners.

The detection takes computations on this smaller block. For initial corners and iterations per corner the convergence takes operations. Typically, and the average value of was experimentally found to be , therefore the time complexity for convergence is and the total time complexity is .

The whole iterative procedure is illustrated in Fig. 5 while Algorithm 1 describes the entire algorithm.

It is to be noted that the purpose of using a larger scale during Initialization is only to reduce the computation time. The Feature Convergence stage ensures that the detected corner is in the center of the block so that the approximations used in the calculation are consistent. The desired scale is taken to be 8.4 and to be 100, so that each final has about 25 points.

1:procedure Feature Extraction
2:     
3:     
4:     Initialization:
5:     Get Blocks
6:     
Compute the set of s,
 1
7:     
8:     for  do
9:         
10:         Corner movement
11:         while  do
12:              
13:              
redetected
in s.t is similar in shape to
14:               Redetect corner on
15:              
16:                             
Algorithm 1 Iterating the initialized corners to convergence

4 Matching Strategy

(a)                   (b)                            (c)

Figure 6: Regions used for SSD computation (a) Harris (b) CoMIC regions (c) CoMIC regions

A simple translational motion model is used to match points in a video sequence, similar to the protocol followed in Visual Odometry[31]. This is sufficient due to the low time gap between consecutive frames. Each point in the current frame is assumed to have moved to a location that is within a radius (dependent on the scale used to detect the point) of that point in the next frame. The value of pixels was sufficient for our experiments.

We extract a patch as a 23x23 window centered on the candidate feature point. Matching is done using a simple Sum-of-squared-differences (SSD) of the appearance of the two patches. A simple SSD based matching is sufficient[7] in tracking applications although more complicated invariant descriptors can be used for more complex matching tasks using a similar framework.

s are used as the boundary to separate the regions on either side of the , which are denoted as and respectively. This is shown pictorially in Fig.6(b), where they correspond to Background (BG)/ Foreground (FG) regions for patches on the object boundary. Since the FG and BG regions are not known in advance, the two sides are matched separately and the minimum of their match distances is taken to be the match score for that feature patch. The steps in Matching are listed in Algorithm2, where steps are unique to our matching approach. The other detectors compute the SSD between the two full patches as they have no FG/BG separation technique (Fig 6(a)).

1:
radius , s ,, threshold
2:
3:for  do
4:      patch around
5:     
6:     for  do
7:          patch around
8:         Obtain from and
9:         Obtain from and
10:         
11:         if  then
12:                             
Algorithm 2 Matching

5 Experimental Results

We demonstrate the effectiveness of our technique on a variety of videos where the object moves in a changing background, using the SSD-based template matching technique for point features as described above.

5.1 Experimental Setup

Datasets: The performance of some detectors on 3D objects has been evaluated in [17, 30]. The changes in the background are negligible in these controlled environments, and often the background itself is homogeneous. Furthermore, there is no means to analyse the performance of the feature points lying at the object boundary.

In order to evaluate the performance of point detectors at the object boundary regions and under a varying background, we have designed the challenging CoMIC dataset which has objects, homogeneous and textured, moving against a set of differently textured books. With the knowledge of the static background image, the delineation of the foreground object boundary is made possible through background subtraction. This enables an analysis of the performance of the features at the boundary and non-boundary regions of the object.

We also show results on sequences from the Middlebury stereo dataset[39], where the object boundary regions are affected by parallax and motion against a textured background.

We also demonstrate the overall effectiveness of our approach through experiments on a subset of sequences from popular tracking datasets such as KITTI Vehicle dataset[18], PROST[38], VoT[22] and Cehovin[13] which have mostly rigid objects moving against a changing background, samples and description of which are given in the supplementary section. The evaluation is done on the features on the object alone, obtained after the background is subtracted out in the case of the CoMIC dataset, using the groundtruth depth discontinuity map in the case of Middlebury sequences and approximated using the groundtruth bounding box for the other datasets. All the experiments were run on images resized to a height of 700 pixels.

Detectors compared: Out of the point detectors in the literature, we compare the performance of our approach with Harris, Hessian and FAST-9 (performance is quite similar to FAST-ER[36] and AGAST[25]) detectors that have been found to perform best in comparative studies[46, 36]. Experiments against these detectors were performed using codes from [1] and [2] respectively. We do not compare with scale and affine-invariant detectors (Harris-affine and Hessian-affine) since these do not perform as well as basic detectors in these tasks, where there is no significant scale or affine variations between consecutive frames. However, we compare with MSER since their method is closely related to ours.

Parameters: For a fair comparison, we equalize the number of detected features on the object across detectors. We vary the threshold of each detector to get the same number of points in the first frame of the sequence in a manner similar to the evaluation in CenSurE[3]. For this value of the threshold, features are detected on all other images in the sequence. In cases where the detector returns very few points in general, the threshold is fixed to yield the maximum number of points it can return. For CoMIC, only the stability value is varied to obtain a given number of points.

Evaluation Criteria: The Matching Score is obtained as the ratio of the Number of matches in the frame and the Number of detections in the frame. This measure can be extended to obtain a score for the Number of residual Matches in the frame that match consistently across frames, when the detection is done in the frame.

(3)

Thus, out of points detected in frame , points have matches in each of the frames in between. Such a normalization gives a quantitative indication of the resilience of the detector, which determines the number of points that can reliably be tracked over a number of frames. This corresponds to resilience to the changes in the background for the sequences used in the experiment. It also gives an idea about the interval one needs to choose to redetect points in the frames.

Groundtruthing: Matches that are stable for were taken to be true matches, since it is highly unlikely that these points match incorrectly in all five frames. The error due to false matches was found to be about 10% which becomes negligible for larger values of . Duplicate matches are removed by one-to-one matching.

5.2 Results


(a) Harris

(b) CoMIC

Figure 7: CoMIC misses fewer points on the object boundary than Harris as seen from the regions in the colored boxes.

5.2.1 Results on CoMIC dataset

Object Sequence Boundary Non-Boundary Overall
CoMIC Har Hes Mser Fast CoMIC Har Hes Mser Fast CoMIC Har Hes Mser Fast
Textured Pens 40.9 24.7 23.4 9.8 13.3 50.8 45.8 43.5 30.9 36.1 46.2 37.8 35.5 25.9 27.1
Doll 34.6 19.5 18.2 11.3 12.0 44.9 46.5 44.0 32.1 28.6 40.8 36.1 34.8 27.9 20.4
Toy 28.4 21.6 18.5 12.6 13.0 38.7 37.7 38.2 28.5 25.6 34.8 31.5 31.8 25.0 21.1
Hero 39.2 27.0 24.5 12.6 16.6 45.9 46.8 43.6 29.2 28.4 43.9 41.1 39.2 27.1 25.3
Race-car 37.0 23.2 22.7 11.9 12.8 49.9 53.2 48.5 32.4 36.3 44.7 42.3 40.6 29.3 29.4
Homogeneous Box 36.5 33.1 28.2 10.3 20.5 53.3 47.0 30.2 35.5 23.7 40.1 38.6 29.2 25.3 21.8
Tape-Box 47.6 31.9 26.5 10.5 16.9 56.9 37.7 37.3 25.5 25.8 51.2 35.3 31.5 20.9 19.7
House 37.2 21.5 24.4 8.7 15.8 39.9 41.1 32.9 30.0 19.1 38.8 35.0 29.3 26.3 17.0
Stereo Tsukuba 32.6 20.9 29.2 11.6 30.7 48.2 47.8 51.1 43.3 43.9 81.1 68.4 79.7 54.9 74.2
Cones 22.1 10.1 19.6 4.4 13.5 35.1 44.1 37.8 43.1 32.2 57.4 53.3 56.1 47.5 44.9

Table 1: Average Matching Score on the CoMIC dataset
Object Sequence Boundary Non-Boundary Overall
CoMIC Har Hes Mser Fast CoMIC Har Hes Mser Fast CoMIC Har Hes Mser Fast
Textured Pens 27.5 5.7 15.5 2.6 6.4 45 20.1 45.1 28.8 32.5 72.5 25.8 60.7 31.5 38.9
Doll 36.4 7.0 19.7 5.5 15.2 78.2 27.7 90.5 61.7 40.6 114.6 34.7 110.2 67.2 55.8
Toy 23.1 5.0 13.6 4.4 9.4 56.6 16.3 63.1 38.7 39.4 79.7 21.4 76.7 43.1 48.8
Hero 38.7 8.5 19.1 5.7 15.5 106.9 36.4 110.0 76.0 79.1 145.6 44.9 129.1 81.7 94.7
Race-car 40.5 7.9 20.5 4.1 11.4 87.2 32.6 101.7 65.1 84.3 128.8 40.7 122.2 69.2 95.7
Homogeneous Box 27.3 9.7 21.1 3.5 17.9 14.2 11.7 15.9 16.8 10.4 41.5 21.4 37.0 20.3 28.2
Tape-Box 33.1 9.6 16.1 2.8 15.4 21.8 14.5 15.6 14.4 9.2 54.9 24.0 31.7 17.2 24.6
House 26.7 6.3 21.1 3.5 19.0 49.2 28.4 43.9 63.5 21.0 75.9 34.7 65.0 67.0 40.0
Stereo Tsukuba 278.0 112.0 266.0 56.0 171.0 503.0 489.0 514.0 544.0 409.0 881.0 591.0 763.0 599.0 571.0
Cones 338.0 188.0 299.0 121.0 330.0 514.0 429.0 524.0 450.0 471.0 852 614.0 817.0 571.0 797.0

Table 2: Average Number of matches on CoMIC dataset and Middlebury stereo pairs

Quantitative scores for the resilience of the feature are computed using Eq.3 with . The Matching score over 5 frames, and the Number of matches that survive through 5 frames, are computed for every fifth frame and averaged over all the frames in the dataset. Scores are shown separately for points on the object boundary, internal points and all the points for the average on the CoMIC dataset in Table1 and in Table2.

CoMIC generally outperforms the state-of-the-art detectors, in terms of both and at the boundary regions of both homogeneous as well as textured objects moving against a textured background. It performs comparably at non-boundary regions, where it closely follows Hessian and Harris, to yield the best scores on the full object.

CoMIC yields substantially more matches with high resilience in the boundary regions and overall, starting with approximately the same number of detections on the object as others in the initial frame. This is useful in applications where a high number of matches increases the consensus on the pose of the object, especially at the boundaries. This is seen in Fig 7 where several boundary points missed by Harris are detected by CoMIC. While these points are also correctly matched in CoMIC, very few points from the gradient based detectors match at the boundary regions. These are shown in videos attached with the supplementary section.

The superior performance at the boundaries can be attributed to iso-curves being relatively unaffected by the change in the gradients at boundaries when the object moves with respect to its background. The feature patch is treated as a whole in gradient based methods, causing a matching failure when the background portion in the patch changes. CoMIC’s acts as a reliable segmentation of the object in the neighborhood of that point. Examining regions on either side of it separately ensures that a matching portion is consistently associated with FG or BG as shown in Fig.6. This reduces the ambiguity in matching and results in higher matching scores. Such a technique may also be incorporated more generally into a sophisticated descriptor built using image intensities on each side of the separation boundary for other applications. It may also lead to better learning and discrimination of the FG portions in object tracking[19].

Harris and Hessian mostly perform well in the internal object regions and especially when it is textured, while MSER and CoMIC perform well in homogeneous internal regions with distinctive boundaries. FAST, while being the fastest detector, yields less than remarkable scores in the comparison. Internal points are not as affected by changes in the background and therefore information from the entire patch used in gradient based detectors benefits the matching. On the other hand, information lost on one side of the curve costs our performance slightly for internal features. However, apart from being useful in boundary regions, such an approach may even help non-boundary regions in the case of partial occlusion.

5.2.2 Stereo Matching

We observe similar results in the case of stereo matching on Tsukuba and Cone stereo pairs in the Middlebury dataset. The features evaluated with groundtruth information in terms of and are shown in Table1 and Table2. Again, the spatial windows used for gradient aggregation in Harris etc are affected when the windows span multiple objects in the scene.

5.2.3 KITTI and other datasets

Sequence Matching Score Number of Matches
CoMIC Harris Hessian MSER FAST CoMIC Harris Hessian MSER FAST
Board 49.6 42.7 40.1 26.4 24.5 111.1 41.9 85.9 59.3 51.6
Lemming 39.2 43.3 35.8 27.2 14.0 30.6 14.6 29.3 18.0 5.4
Box 35.5 35.9 36.9 31.7 19.1 32.5 11.8 28.7 23.7 12.0
Cycle 28.9 24.1 26.3 17.0 11.6 45.7 7.8 43.2 17.3 16.9
Cup 44.0 45.1 37.9 23.9 14.2 32.1 14.5 27.7 16.5 4.1
Can 27.6 18.7 18.3 15.4 9.8 42.2 10.9 26.9 23.6 11.9
Dino 39.4 39.8 38.1 28.8 17.7 87.6 47.9 77.1 71.9 22.7
Car-A 18.8 21.5 16.3 12.1 9.1 65.1 45.4 50.2 48.5 31.4
Car-B 13.7 12.6 10.2 8.1 6.3 87.3 55.1 71.2 48.0 37.3
Car-C 12.8 12.6 10.2 8.9 5.5 54.2 17.0 48.1 31.0 24.6
Car-D 14.5 14.6 10.0 8.7 6.8 35.6 22.3 24.7 21.2 20.2
Car-E 7.4 5.4 3.5 4.1 2.7 23.5 5.7 10.9 13.0 8.0
Car-F 28.6 29.7 29.1 19.5 18.3 55.3 13.6 59.1 28.9 35.4
Car-G 15.1 14.8 13.1 10.1 9.1 42.7 22.3 34.7 30.7 27.2
Car-H 12.7 12.3 9.9 8.3 5.8 93.1 50.4 47.7 74.6 33.5
Car-I 18.6 14.4 12.9 6.4 8.6 13.6 4.8 9.3 4.3 8.2
Car-J 18.9 17.6 13.0 11.0 8.2 114.1 72.0 113.2 78.2 49.5
Car-K 19.9 18.9 13.6 13.3 7.7 26.9 9.8 15.9 11.4 10.6
Car-L 18.7 20.8 14.6 13.3 8.4 82.3 54.4 61.1 56.2 39.5
Car-M 51.7 58.2 57.5 42.1 50.4 31.3 7.3 23.9 13.7 30.4
Car-N 12.3 17.2 10.6 8.5 5.2 33.0 27.6 29.4 22.6 15.4
Car-O 49.5 68.5 62.0 52.2 47.6 74.9 56.1 85.0 72.0 50.2
Car-P 23.4 20.2 17.0 13.8 9.6 74.4 38.1 66.2 26.1 43.9
Car-Q 30.5 32.6 30.7 21.5 17.4 220.4 126.6 219.1 166.8 125.5
Table 3: Average and values for different sequences in the PROST, VoT, Cehovin and KITTI sequences.

We further demonstrate CoMIC’s effectiveness in matching points on real-world vehicle sequences (KITTI) and popular datasets (PROST, VoT and Cehovin) that have changes in the background. We present results in terms of and in Table3. We observe that CoMIC has more points in almost every sequence and returns the best or second best in most cases closely followed by Hessian and Harris. These results are especially interesting due to the popular use of point tracking and SfM in Vehicle tracking in recent times[16, 42, 6]

5.3 Discussion

Point matching is used in a host of 3D applications. The seminal work by Intille and Bobick[10] uses keypoint matches in a DP-based Stereo matching. Point tracking is used extensively in Visual Odometry[31, 3, 16, 42, 6], SfM from video[48] and SLAM[21] where the object or vehicle may move against different backgrounds. Even after several years work in Feature Tracking, Kanade-Lucas-Tomasi (KLT) is still the best algorithm to track points across frames[7]. When geometric distortion is present, image pyramids[47] or an affine motion models are used[40]. While KLT traditionally uses Harris points, FAST features have been used more recently for much higher speed. Apart from the problems in Harris points, the gradient descent used in KLT will have errors when an object with wired frames such as a cycle wheel moves in a changing background. For these reasons many algorithms that use KLT[5, 14, 23] such as SfM from videos are unable to use boundary points[48]. CoMIC features integrated into these applications can improve the point detector module and thereby improve the overall performance of such systems. If one could identify the boundary regions (perhaps in an adaptive or probabilistic process), then one could also consider using different detectors in the boundary and non-boundary regions or a combination of them for optimal results. Hessian, for instance, is a blob detector that has physically different and complementary points that can be used along with CoMIC to yield best results at both the boundary and non-boundary regions.

6 Conclusion

We have presented an iso-curve based method that combines contour and appearance information for corner detection and matching. Points on the object boundary are detected and matched consistently even in changing backgrounds. This is shown in experiments where they perform better than the state-of-the-art detectors at the object boundaries and comparably at internal regions leading to an overall improvement in performance for matching the full object. It yields a sufficient number of stable points on the object that can be used as part of algorithms for SfM in video sequences, Visual Odometry, stereo etc. There are several avenues for future work, where CoMIC can be used with more sophisticated motion models in a complete tracking setup. Scale and affine invariant extensions can be built based on the curve or intensity information inside, or both.

References

  • [1] http://lear.inrialpes.fr/people/mikolajczyk/.
  • [2] http://www.edwardrosten.com/work/fast-er-1.5.tar.gz.
  • [3] M. Agrawal, K. Konolige, and M. R. Blas. Censure: Center surround extremas for realtime feature detection and matching. In ECCV (4), volume 5305 of Lecture Notes in Computer Science, pages 102–115. Springer, 2008.
  • [4] P. F. Alcantarilla, A. Bartoli, and A. J. Davison. Kaze features. In ECCV (6), volume 7577 of Lecture Notes in Computer Science, pages 214–227. Springer, 2012.
  • [5] S. Ali. Measuring flow complexity in videos. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1097–1104. IEEE, 2013.
  • [6] H. Badino, A. Yamamoto, and T. Kanade. Visual odometry by multi-frame feature integration. In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pages 222–229. IEEE, 2013.
  • [7] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
  • [8] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In ECCV, pages I: 404–417, 2006.
  • [9] P. R. Beaudet. Rotationally invariant image operators. In

    Proceedings of the 4th International Joint Conference on Pattern Recognition

    , pages 579–583, Nov. 1978.
  • [10] A. F. Bobick and S. S. Intille. Large occlusion stereo. International Journal of Computer Vision, 33(3):181–200, 1999.
  • [11] M. Brown, R. Szeliski, and S. Winder. Multi-image matching using multi-scale oriented patches. In Computer Vision and Pattern Recognition, 2005., volume 1, pages 510–517. IEEE, June 2005.
  • [12] F. Cao, P. Musé, and F. Sur. Extracting meaningful curves from images. Journal of Mathematical Imaging and Vision, 22(2-3):159–181, 2005.
  • [13] L. Cehovin, M. Kristan, and A. Leonardis. An adaptive coupled-layer visual model for robust visual tracking. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1363–1370. IEEE, 2011.
  • [14] L. B. Dorini and S. K. Goldenstein. Unscented feature tracking. Computer Vision and Image Understanding, 115(1):8–15, 2011.
  • [15] P.-E. Forssen and D. Lowe. Shape descriptors for maximally stable extremal regions. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, Oct 2007.
  • [16] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In Proc. IEEE Intl. Conf. on Robotics and Automation, 2014.
  • [17] F. Fraundorfer and H. Bischof. Evaluation of local detectors on non-planar scenes. In In Proc. 28th workshop of the Austrian Association for Pattern Recognition, pages 125–132, 2004.
  • [18] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, page 0278364913491297, 2013.
  • [19] M. Grabner, H. Grabner, and H. Bischof. Learning features for tracking. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • [20] C. Harris and M. Stephens. A combined corner and edge detector. International Journal of Computer Vision, 1988.
  • [21] G. Klein and D. Murray. Improving the agility of keyframe-based slam. In Computer Vision–ECCV 2008, pages 802–815. Springer, 2008.
  • [22] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Cehovin, G. Nebehay, G. Fernandez, T. Vojir, A. Gatt, et al. The visual object tracking vot2013 challenge results. In Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pages 98–111. IEEE, 2013.
  • [23] M. Lourenço and J. P. Barreto. Tracking feature points in uncalibrated images with radial distortion. In Computer Vision–ECCV 2012, pages 1–14. Springer, 2012.
  • [24] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
  • [25] E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger. Adaptive and generic corner detection based on the accelerated segment test. In ECCV 2010, volume 6312 of Lecture Notes in Computer Science, pages 183–196. Springer, 2010.
  • [26] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In British Machine Vision Conference, pages 36.1–36.10. BMVA Press, 2002.
  • [27] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In ICCV 2001. IEEE International Conference on Computer Vision, volume 1, pages 525–531, 2001.
  • [28] K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60:63–86, 2004.
  • [29] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65:43–72, 2005.
  • [30] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on 3d objects. In IJCV, pages 800–807, 2005.
  • [31] D. Nistér, O. Naroditsky, and J. Bergen. Visual odometry. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 1, pages I–652. IEEE, 2004.
  • [32] D. Nistér and H. Stewénius. Linear time maximally stable extremal regions. In ECCV 2008, volume 5303 of Lecture Notes in Computer Science, pages 183–196. Springer, 2008.
  • [33] M. Perdoch, J. Matas, and S. Obdrzalek. Stable affine frames on isophotes. In ICCV 2007. IEEE International Conference on Computer Vision, pages 1–8, 2007.
  • [34] A. Pinz. Object categorization. Foundations and Trends in Computer Graphics and Vision, 1(4):255–353, 2005.
  • [35] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In European Conference on Computer Vision, volume 1, pages 430–443, May 2006.
  • [36] E. Rosten, R. Porter, and T. Drummond. Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 32:105–119, 2010.
  • [37] K. Sakurada, T. Okatani, and K. Deguchi. Detecting changes in 3d structure of a scene from multi-view images captured by a vehicle-mounted camera. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 137–144. IEEE, 2013.
  • [38] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof. Prost: Parallel robust online simple tracking. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 723–730. IEEE, 2010.
  • [39] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47:7–42, 2002.
  • [40] J. Shi and C. Tomasi. Good features to track. In Computer Vision and Pattern Recognition, 1994, pages 593–600, 1994.
  • [41] S. M. Smith and J. M. Brady. Susan, a new approach to low level image processing. International Journal of Computer Vision, 23(1):45–78, May 1997.
  • [42] H. Song, S. Lu, X. Ma, Y. Yang, X. Liu, and P. Zhang. Vehicle behavior analysis using target motion trajectories. 2013.
  • [43] B. Triggs. Detecting keypoints with stable position, orientation, and scale under illumination changes. In ECCV 2004, volume 3024 of Lecture Notes in Computer Science, pages 100–113. Springer, 2004.
  • [44] D.-M. Tsai, H.-T. Hou, and H.-J. Su. Boundary-based corner detection using eigenvalues of covariance matrices. Pattern Recognition Letters, 20(1):31–40, Jan. 1999.
  • [45] T. Tuytelaars and L. J. V. Gool. Matching widely separated views based on affine invariant regions. International Journal of Computer Vision, 59(1):61–85, 2004.
  • [46] T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: A survey. FnT Comp. Graphics and Vision, pages 177–280, 2008.
  • [47] J. yves Bouguet. Pyramidal implementation of the lucas kanade feature tracker. Intel Corporation, Microprocessor Research Labs, 2000.
  • [48] G. Zhang, Z. Dong, J. Jia, T.-T. Wong, and H. Bao. Efficient non-consecutive feature tracking for structure-from-motion. In Computer Vision–ECCV 2010, pages 422–435. Springer, 2010.