Interest Point Detection based on Adaptive Ternary Coding

12/31/2018 ∙ by Zhenwei Miao, et al. ∙ Nanyang Technological University 0

In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark pixels. Interest points are extracted from this blob significance measurement. By labeling the statuses of ternary bright, uncertain, and dark, the proposed detector shows more robustness to image noise and quantization errors. Moreover, the adaptive strategy for the ternary cording, which relies on two thresholds that automatically converge to the median of the local region in measurement, enables this coding to be insensitive to the image local contrast. As a result, the proposed detector is invariant to illumination changes. The state-of-the-art results are achieved on the standard datasets, and also in the face recognition application.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A well-designed interest point detector is supposed to effectively represent images across variations of scale and viewpoint changes, clutter background and occlusion [1, 2]. For years, interest point detectors have been extensively studied and widely used in many applications [3, 4, 5, 6, 7]. Nevertheless, an open question remains about extracting the stable points under illumination variations. The Hessian-Laplace/Affine [8], Harris-Laplace/Affine [8], SIFT [9] and SURF [10]

detectors are built upon the derivatives of the Gaussian filter. Either the first or the second derivative of the Gaussian filter is used to compute the strength of the image local contrast. As the Gaussian filter responds proportionally to the image local contrast, these detectors perform poorly in detecting low contrast structures even if these structures are stable under different variations and significant in computer vision applications. Moreover, these detectors are susceptible to abrupt structures and image noises. To mitigate the influence caused by image noise and nearby image structures, a rank-ordered Laplacian of Gaussian filter is proposed in

[11]. However, such a detector still partial relies on the image local contrast.

To address the problems caused by illumination changes particularly, image segmentation has been utilized in designing interest point detectors. For example, the MSER [12, 13], PCBR [14] and BPLR [15] detectors use the watershed-like segmentation algorithms to extract the image structures. However, these detectors’ performance is unsatisfactory under image blurring in which the boundaries of image structures are unclear [3]. Self-dissimilarity and self-similarity of image patches are used in SUSAN [16], FAST [17] and self-similar [18]

detectors to alleviate the problems caused by lighting variation. In particular, the SUSAN and FAST detectors use the number of pixels that are dissimilar from that in a region center to detect corners. The weakness of two detectors is that they are not scale-invariant and inefficient in detecting blob-like structures. Although local pixel variance is adopted in 


to estimate the self-similarly, the robustness of this detector is uncertain when there are strong abrupt changes within the image patch.

Considering the above-mentioned limitations of existing detectors, this paper aims to develop a contrast invariant and noise resistant interest point detector. Inspired by the recent work on the Iterative Truncated Mean (ITM) algorithms [19, 20, 21, 22], an adaptive ternary coding (ATC) is proposed to adaptively encode the pixels into bright, dark and uncertain statues. The ternary status of each pixel in a local region is detected by the dynamic thresholds that are automatically computed by the ITM algorithm. Interest points are extracted from the blob significance map that is measured by the number of bright and dark pixels. As expected, the proposed ATC shows robustness to illumination variations and is effective in dealing with cluttered structures.

Ii The Proposed Interest Point Detector

Fig. 1: (a) input image. (b) an enlarged image patch. For this image patch, (c) shows the pixels divided by the median and (d) shows the pixels divided by the upper and lower bounds of the ITM algorithm. (f) shows the corresponding blob significance against the number of iterations. (Best viewed in color).

Ii-a Problem Formulation

Blobs, as shown in Fig. 1, are the image local structures with the majority of the bright (or dark) pixels concentrating in the center while the majority of opposite intensity resides in the peripheral region. Such property of the blob structure is preservable under various variations. Moreover, the blob-like structures widely spread over a pictorial image. These properties make the blob-like structure suitable in anchoring the local descriptor [23, 9] under various image conditions. Hence, a lot of works have been proposed to extract blob-like structures from images [9, 12, 18, 24]. However, the linear filter based detectors, such as SIFT and SURF, are sensitive to the illumination changes. In contrast, the relative bright-dark order of pixels in a local region is more stable than the pixel intensity value under illumination changes. In view of this, we propose to detect interest points using the bright/dark labels of pixels.

An issue that needs to be addressed is how to differentiate and label the pixels as bright or dark ones. One way is to dichotomize the pixels into bright and dark ones by a certain threshold, which could be set by the mean or median value of the local region. Take the image patch (shown in Fig. 1, as a zoom in from Fig. 1) as an example, the bright and dark pixels dichotomized by the median value are identified in Fig. 1

. Median is more robust to the outliers and abrupt variations than mean. However, the median-based threshold is sensitive to quantization error because of its inefficiency in suppressing this type of noise. This may lead to unreliable labelling. To solve this problem, we propose to introduce a fuzzy label for the pixels that are not clear enough to be labelled into either bright or dark set. This results in our proposed adaptive ternary coding algorithm.

Ii-B Adaptive Ternary Coding Algorithm

Instead of using one threshold to binarize the pixels into bright or dark labels, a pixel intensity margin spanned by two thresholds is proposed to ternarize the pixels, as


where is the pixel intensity value, and are the lower and upper bounds for the pixel ternarization. Pixel intensities that are close to the median value in a local region are labeled the uncertain ones to reduce their sensitivity to noise. Properly choosing the two thresholds is essential in the ternarization. The two thresholds should be invariant to the illumination changes, and should be located on both sides of the median value to ensure the correctness of pixel labeling.

Let the half width of the margin spanned by and be , and the mean of and be . Choosing and is equivalent to choosing and . One solution for the ternary coding is setting equal to the median of the local region and equal to some fixed threshold. However, this has two limitations: 1) computing the median is time consuming and 2) a fixed threshold cannot adapt to the contrast changes. Compared to the median, the mean of the pixel intensities in a local region is easier to be computed. By setting equal to the Mean Absolute Deviation (MAD) of the pixel intensities from the mean , the two thresholds and are located on both sides of the median [19] and invariant to the illumination changes. Moreover, by iteratively truncating the extreme samples with the ITM algorithm proposed in [19, 20], the mean of the truncated data starts from the mean and approaches to the median of the input data. Meanwhile, the MAD of the truncated data converges to zero [19, 20]. As a result, these two boundaries and computed by the ITM algorithm automatically converge to the median while keeping the median within the margin spanned by and . Therefore, this margin (as shown in Fig. 1) separates the pixels into bright and dark ones and tolerates noise and quantization errors. Given the advantage of the ITM filter, we propose an adaptive ternary coding algorithm and a blob significance measure based on the ITM algorithm, which are presented as follows.

Let and be the central region and the corresponding peripheral ring of a filter mask centered at . For the blob detection, here both and are chosen as circle shape, and the radius of the outside ring is times of the inner one to make the area size of these two regions the same. Two pixel sets centered at are defined as and , where is the region center and is the pixel gray value at the location . In order to ensure that the two pixel sets and have the same effect on estimating the thresholds for pixel labeling, the weighted ITM algorithm [20] is adopted to make them have equivalently equal number of pixels. The pixel numbers and in these two sets and are used to weight the pixels in and , respectively. The proposed adaptive pixel ternary coding is shown in Algorithm 1.

Input: , , , ;
Output: Blob significance ;
1 do
2       Compute the weighted mean ; Compute the weighted dynamic threshold ; , , , compute the blob significance by (II-B), and truncate by:
3while the stopping criterion is violated;
Algorithm 1 Adaptive Pixel Ternary Coding for the Proposed Detector

The lower and upper bounds and in Algorithm 1 are used to ternarize the pixels into bright, uncertain or dark ones by (1), as shown in Fig. 1. A bright pixel is the one that is larger than the higher threshold. A dark pixel is the one that is smaller than the lower threshold. The blob structures have the attribute that the majority of bright (or dark) pixels are concentrated in the inner region while the majority of the opposite ones in the surrounding region. As a result, we measure the blob significance by the distribution of the bright and dark pixels. First, the dominances of bright/dark pixels in and are measured by the difference of the numbers of bright and dark pixels in the corresponding region. The bright and dark pixels are respectively labeled as and by (1) and the uncertain pixels are labeled as 0. Therefore, the normalized dominance of the bright/dark pixels in and are and , respectively, where and are the lower and upper bounds in the th iteration. Second, these two parts are linearly combined as the blob significance in the th iteration:


From Algorithm 1 it is seen that the margin between the lower and upper bounds equals . It monotonically decreases to zero by increasing the number of iterations [20]. In the first few iterations, the margin is large as only few extreme samples are truncated by the ITM algorithm. By increasing the number of iterations, both the lower and higher thresholds converge to the median value of the local region. As a result, the margin between these two thresholds reduces. Therefore, the number of pixels categorized into the intermediate group decreases. The blob significance (shown in Fig. 1) is a function of the number of iterations . The maximum value of over is selected as the blob significance map for interest point detection, defined as


However, exhaustively searching the global peaks over all iterations is time-consuming. The following stopping criterions are used to allow that the global maximum value is achieved in most cases within a reasonable number of iterations.

Let , the corresponding weight set be and the two sets separated by the weighted mean be and . Let and denote the summation of the weights of and , respectively. One stopping criterion [20], which enables the truncated mean to be close to the weighted median, is to meet the condition


In some cases, after is met, the amplitude of the blob significance still increases because the number of pixels with uncertain status is still large. Therefore, an additional constrain is applied:


The third condition is to limit the maximum number of iterations as


which is chosen from experiment. The truncating procedure of in Algorithm 1 is terminated if the following conditions is satisfied, as

Fig. 2: Results on (a) textured scene ‘wall’ v.s. viewpoint angle changes from 20 degree to 60 degree, (b) ‘boat’ structured sequence v.s. scale changes from 1.1 to 2.8, (c) ‘leuven’ the illumination change sequence with decreasing light, (d)‘desktop’ and (e) ‘corridor’ with complex illumination changes.

From (II-B) we find that the blob significance value is within the range . For a bright region, . The maximum value of its blob significance is 2. Similarly, a local region is dark if and the minimum value of its blob significance is -2.

Ii-C The Proposed ATC Detector

Ii-C1 Ridge and Edge Suppression

Interest points are extracted by detecting the local peaks from the blob significance map (3). In order to suppress the unreliable points detected on ridges and edges, the ratio


is used. Small means that the peak value is quite similar to that in its surrounding regions. We remove such candidates if , which is chosen empirically.

Ii-C2 Algorithm for ATC Detector

Detecting interest points in multiple scales is essential in many vision applications where the same objects can appear with different sizes. By changing the size of the local image patches and , the ATC detector can identify local structures of various scales. Similar to that done in [25], we implement the multi-scale ATC detector by detecting the points in each scale. The procedures of the proposed ATC detector are summarized as follows:

  1. Generate the blob significance map on multi-scales by Algorithm 1.

  2. Detect the local peaks of the blob significance on spatial dimensions.

  3. Remove the peaks on ridges and edges by (8). The remaining peaks are the interest points to be detected.

  wall 1508 1460 1520 1568 1593 1514
  boat 1546 1501 1549 1429 1524 1501
  leuven 1527 1426 1476 1501 1648 1488
  desktop 1539 1539 868 1526 1698 1451
  corridor 1526 1564 1540 1544 1583 1578
TABLE I: Number of Detected Points on the First Image of Each Data Set.

Iii Experiments

Iii-a Repeatability

Two detected regions are regarded as repeated if their overlap is above 60% as suggested in [26]. For an image pair {Img1, Img2}, the repeatability score is defined as , where is the number of repeated points, and and are the numbers of the points detected from the common area and scale of Img1 and Img2, respectively. We use the repeatability to evaluate the detectors under different variations. The three datasets ‘wall’, ‘boat’ and ‘leuven’ from Oxford database in [26] and the ‘desktop’ and ‘corridor’ datasets from [27] with complex illumination changes are used for testing.

Similar to that in [18], half-sampled images are used for evaluation. For the ATC detector, interest points are extracted on 5 octaves by half-sampling the previous octave. In each octave, local extrema are detected on 3 scales: . The ATC detector is compared with five detectors consisting of the SIFT [9], Harris-affine (HR-A) [8], Hessian-affine (HS-A) [8], MSER [12] and ROLG [11] detectors. For each data set, the detector parameters are adjusted so that roughly the same number of interest points (shown in Table I) are detected on the first image for all detectors. The interest points detected by the HR-A detector on the first image of the ‘desktop’ set is smaller than others although the contrast threshold is already set to be zero due to the darken illumination on this image. Fig. 2 (a) and (b) illustrate the experimental results under the changes of viewpoint and scale, respectively. Fig. 2 (c), (d) and (e) show the performances under complex illumination changes. These results show that the ATC detector can achieve better performance than the other five detectors under almost all the different experimental settings.

Iii-B Application to Face Recognition

To demonstrate the implications of the proposed ATC detector, we evaluate it in the face recognition application [28, 29, 30]. Specifically, the ATC detector is compared with the SIFT [9], HR-A [8], HS-A [8], MSER [12] and ROLG [11] detectors. As the default setting produces too few interest points for the face recognition for all detectors, the thresholds that are used to remove the low response interest points are set to be zero for all detectors in the present experiment. For the MSER detector, the minimum size of its output region is set to be 1/4 of the default setting to ensure it is applicable to all of the testing databases. All the detected interest points are described by the SIFT descriptor. The matching algorithm for face recognition, which consists of interest point matching and geometric verification with Hough transform, is described in [9].

Four standard face recognition databases, including AR [31], GT [32], ORL [33] and FERET [34], are used to evaluate these detectors. The database setting is shown in Table II. The face images in these databases have variations in illumination, expression and poses. The recognition rate, which is the percentage of correctly identified test images from the rank-1 best matched gallery, is used to measure the performance of the interest point detectors. Table III shows that the proposed detector achieves the highest recognition rate over the four databases. It suggests that the interest points detected by the proposed ATC detector are more robust and discriminative compared to others.

image size subjects gallery test
  AR 6085 75 7 7
  GT 6080 50 8 7
  ORL 5057 40 5 5
  FERET 6080 1194 1 1
TABLE II: Face Database Settings.
  ATC 98.3% 94.0% 97.5% 98.5%
  SIFT 94.3% 84.0% 90.0% 89.9%
  HS-A 88.6% 74.0% 80.0% 85.3%
  HR-A 74.5% 47.4% 66.5% 49.7%
  MSER 92.7% 81.1% 91.0% 89.3%
  ROLG 98.3% 91.1% 96.5% 98.2%
TABLE III: Recognition Rate on AR, GT, ORL and FERET Databases.

Iv Conclusions

In this paper, an interest point detector is designed based on the adaptive ternary coding (ATC) algorithm, which is inspired by the ITM algorithm to categorize the pixels into the bright, dark and uncertain statuses. As the blob significance is measured by counting the number of bright and dark pixels, the detection result is invariant to the illumination changes. Evaluations on the Oxford dataset [26] and the complex illumination dataset in [27] show that the ATC detector outperforms the other five detectors in terms of repeatability under the variations caused by scale, viewpoint and illumination changes. The advance performance of the proposed detector is also verified in the application of face recognition.


  • [1] R. Unnikrishnan and M. Hebert, “Extracting scale and illuminant invariant regions through color,” in Proc. British Machine Vision Conference, 2006.
  • [2] Z. W. Miao, Median based approaches for noise suppression and interest point detection, Ph.D. thesis, 2013.
  • [3] T. Tuytelaars and K. Mikolajczyk, “Local invariant feature detectors: a survey,” Fundations and Trends in Computer Graphics and Vision, vol. 3, no. 3, pp. 177–280, 2008.
  • [4] G. Guan, Z. Wang, S. Lu, J. D. Deng, and D. D. Feng, “Keypoint-based keyframe selection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 4, pp. 729–734, 2013.
  • [5] X. Wu, D. Xu, L. Duan, J. Luo, and Y. Jia, “Action recognition using multilevel features and latent structural svm,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 8, pp. 1422–1431, 2013.
  • [6] T. Chen and K. H. Yap, “Context-aware discriminative vocabulary learning for mobile landmark recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 9, pp. 1611–1621, 2013.
  • [7] Z. W. Miao and X. D. Jiang, “A novel rank order LoG filter for interest point detection,” in IEEE Conf. Acoustics, Speech and Signal Processing, 2012, pp. 937–940.
  • [8] K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” Int. J. Computer Vision, vol. 60, no. 1, pp. 63–86, 2004.
  • [9] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [10] H. Bay, T. Tuytelaars, and L. van Gool, “SURF: Speeded up robust features,” in Proc. European Conference on Computer Vision, vol. 3951, pp. 404–417. 2006.
  • [11] Z. W. Miao and X. D. Jiang, “Interest point detection using rank order LoG filter,” Pattern Recognition, vol. 46, no. 11, pp. 2890–2901, November 2013.
  • [12] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image and Vision Computing, vol. 22, no. 10, pp. 761–767, 2004.
  • [13] R. Kimmel, C. P. Zhang, A. M. Bronstein, and M. M. Bronstein, “Are mser features really interesting?,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2316–2320, 2011.
  • [14] H. L. Deng, W. Zhang, E. Mortensen, T. Dietterich, and L. Shapiro, “Principal curvature-based region detector for object recognition,” in Proc. Conf. Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  • [15] J. Kim and K. Grauman, “Boundary preserving dense local regions,” in Proc. Conf. Computer Vision and Pattern Recognition, 2011, pp. 1553–1560.
  • [16] S. M. Smith and J. M. Brady, “SUSAN - a new approach to low level image processing,” Int. J. Computer Vision, vol. 23, no. 1, pp. 45–78, 1997.
  • [17] E. Rosten, R. Porter, and T. Drummond,

    “Faster and better: A machine learning approach to corner detection,”

    IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 105–119, 2010.
  • [18] J. Maver, “Self-similarity and points of interest,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1211–1226, 2010.
  • [19] X. D. Jiang, “Iterative truncated arithmetic mean filter and its properties,” IEEE Trans. Image Processing, vol. 21, no. 4, pp. 1537–1547, 2012.
  • [20] Z. W. Miao and X. D. Jiang, “Weighted iterative truncated mean filter,” IEEE Trans. Signal Processing, vol. 61, no. 16, pp. 4149–4160, August 2013.
  • [21] Z. W. Miao and X. D. Jiang, “Further properties and a fast realization of the iterative truncated arithmetic mean filter,” IEEE Trans. Circuits and Systems Part II: Express Briefs, vol. 59, no. 11, pp. 810–814, November 2012.
  • [22] Z. W. Miao and X. D. Jiang, “Additive and exclusive noise suppression by iterative trimmed and truncated mean algorithm,” Signal Processing, vol. 99, pp. 147 – 158, 2014.
  • [23] Z. W. Miao, K. H. Yap, X. D. Jiang, S. Sinduja, and Z. H. Wang, “Laplace gradient based discriminative and contrast invertible descriptor,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 1842–1846.
  • [24] Z. W. Miao, X. D. Jiang, and K. H. Yap, “Contrast invariant interest point detection by zero-norm log filter,” Image Processing, IEEE Transactions on, vol. 25, no. 1, pp. 331–342, Jan 2016.
  • [25] W. T. Lee and H. T. Chen, “Histogram-based interest point detectors,” in Proc. Conf. Computer Vision and Pattern Recognition, 2009, pp. 1590–1596.
  • [26] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. van Gool, “A comparison of affine region detectors,” Int. J. Computer Vision, vol. 65, no. 1-2, pp. 43–72, 2005.
  • [27] Z. H. Wang, B. Fan, and F. Wu, “Local intensity order pattern for feature description,” in IEEE International Conference on Computer Vision, 2011, pp. 603–610.
  • [28] X. D. Jiang, B. Mandal, and A. Kot, “Eigenfeature regularization and extraction in face recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 383–394, 2008.
  • [29] Z. W. Miao, W. Ji, Y. Xu, and J. Yang, “A novel ultrasonic sensing based human face recognition,” in IEEE Ultrasonics Symposium, 2008, pp. 1873–1876.
  • [30] Z. W. Miao, W. Ji, Y. Xu, and J. Yang, “Human face classification using ultrasonic sonar imaging,” Japanese Journal of Applied Physics, vol. 48, no. 7S, pp. 07GC11, 2009.
  • [31] A. M. Martinez, “Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 748–763, 2002.
  • [32] Georgia Tech Face Database,” reco.htm, 2007.
  • [33] F. Samaria and A. Harter, “Parameterisation of a stochastic model for human face identification,” in Second IEEE Workshop Applications of Computer Vision, 1994, pp. 138–142.
  • [34] P. J. Phillips, Hyeonjoon M., S. A. Rizvi, and P. J. Rauss, “The feret evaluation methodology for face-recognition algorithms,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090 – 1104, 2000.