Compact Binary Fingerprint for Image Copy Re-Ranking

by   Nazar Mohammad, et al.

Image copy detection is challenging and appealing topic in computer vision and signal processing. Recent advancements in multimedia have made distribution of image across the global easy and fast: that leads to many other issues such as forgery and image copy retrieval. Local keypoint descriptors such as SIFT are used to represent the images, and based on those descriptors matching, images are matched and retrieved. Features are quantized so that searching/matching may be made feasible for large databases at the cost of accuracy loss. In this paper, we propose binary feature that is obtained by quantizing the SIFT into binary, and rank list is re-examined to remove the false positives. Experiments on challenging dataset shows the gain in accuracy and time.


3rd Place: A Global and Local Dual Retrieval Solution to Facebook AI Image Similarity Challenge

As a basic task of computer vision, image similarity retrieval is facing...

An Evaluation of Popular Copy-Move Forgery Detection Approaches

A copy-move forgery is created by copying and pasting content within the...

Image Identification Using SIFT Algorithm: Performance Analysis against Different Image Deformations

Image identification is one of the most challenging tasks in different a...

MREAK : Morphological Retina Keypoint Descriptor

A variety of computer vision applications depend on the efficiency of im...

Revisiting copy-move forgery detection by considering realistic image with similar but genuine objects

Many images, of natural or man-made scenes often contain Similar but Gen...

Two-Stage Copy-Move Forgery Detection with Self Deep Matching and Proposal SuperGlue

Copy-move forgery detection identifies a tampered image by detecting pas...

An Improved Statistic for the Pooled Triangle Test against PRNU-Copy Attack

We propose a new statistic to improve the pooled version of the triangle...

1 Introduction

Internet is the most powerful tool used in these days. Visual graphics like videos and images are the most attention seekers rather than text paragraphs, so that image manipulation and editing softwares are easily available online or offline. That’s why thousands of images are being edited and shared among different servers of social media and other sites such as Facebook111, Flicker222, and ImageShake333

. Image databases are expanding rapidly in scope and immensity which is being a major cause of in-efficient image recuperation such as image infringement and extravagant. Many researchers proclaimed different elucidations to fix this problem; image copy retrieval and partial-duplicate image retrieval overcome the problem in some such way. Image copy retrieval and partial-duplicate image retrieval somehow solve these problems. Image copy is defined as a segment of image derived from another image usually by means of various challenging transformations such as pattern addition, content deletion, modification of image contents (such as aspect ratio, color, contrast, encoding), cam-cording, and morphing.

Local attributes are extensively used for image and video applications. Initially, key-points are detected from the images and represented by some robust descriptors such as SIFT David2004, GLOH Mikolajczyk2005, and CSLBP CSLBP.

The affine regions are decisive in the calculation of the descriptors over the key-points; the descriptors performance around the key-points is very delicate while assessment of the affine regions Mikolajczyk2005. Therefore, the descriptors computation only prosperous and immense recapitulation affine regions are preferred Krystian2004; Mikolajczyk2005. The descriptors centered from the affine regions computation to key-points are recapitulation under image transformation, but that is not an effective solution because two totally different patches can have same consistency. The previous research presented about the descriptors have lack in robustness and distinctiveness JICTAbaber. The robustness of the descriptors is generally perceived as the limit of descriptors to make due in picture clamor, and unmistakable characterizes the ability of the descriptors to see the difference between two picture fixes or surface. The descriptors processed from various scenes or settings might be viewed as comparative (absence of uniqueness), or comparative descriptors might be considered as different because of image clamor (absence of robustness); the comparability between descriptors is generally registered by the separations between them in include space.

Similarity between pair of images is computed by the distance between the local keypoint descriptors. Since, one image may have several thousand keypoints, so finding the nearest neighbor of single keypoint in other image, and then repeating the steps for all other keypoints, makes it computationally very expensive. To make searching feasible on large databases, features are quantized. The most common feature quantization is bag of visual word model (BoVW) James07; Zhong2009; Zhou2010; Baber2014. In BoVW, image is represented by histogram of visual words present in that image, visual words are basically the centroids that are obtained by SIFT features clustering. The BoVW model is computationally efficient compared to raw SIFT based retrieval, whereas, SIFT has better accuracy compared to BoVW. Geometrical connection between key-points and their descriptors are used in the field of image copy detection Fergus; LiuCVPR; Satoh2; Satoh1. It has been argued that the performance of local features is significantly improved from bag of visual words to bag of pair of visual words LiuCVPR. Geometrical connection between key-points and their descriptors are used in the field of image copy detection  James07; Zhong2009; Zhou2010. Many researchers trying to improve their methods or algorithms to enhance the performance of content based copy detection (CBCD) performance in the manner of robustness and distinctiveness before the quantization of Bag-of-Visual words Xu2011; Ke2004511; Ke04, or sometimes after BoW quantization James07; Zhong2009; Zhou2010.

In this paper, we propose re-ranking of BoVW by compact binary features. Initially, images are presented by SIFT features, and later quantized into BoVW model. Initial ranking is obtained by computing the distances between query image and databases.

Furthermore, attempt to improve the heartiness and uniqueness of local key-point descriptors for image duplicate identification sorts of uses. The more robust and distinctive feature descriptors are having a spatial data inside the local patches. Expanding or decreasing the size of the key-points can be helpful to accomplish the goal and makes the descriptor vector more efficient. We tentatively described and shows the presentation of improved CBCD, and the descriptors which are impacted from their region are utilized.

2 Related work

This area is divided into two sections. In the first section, we described the different portions of image copy detection and its methods, then the Second section will clarify about two well-known feature descriptors which are used in our experiments.

2.1 Image copy detection

Watermarking and CBCD techniques are mostly used to prevent copyright issues. Moreover, CBCD is the correspondent approach for watermarking. Initially, global features were used for CBCD which will later switched into local features.

Chang et al. Edward1998 propose RIME (Replicated IMage dEtector) technique through which we can identifies the pilfered duplicate images on the world wide web by utilizing color space and wavelets chang1998rime.. The fundamental sort of changes can be made with a good accuracy by the system. Kim Kim2003 utilizes Discrete Cosine Transform (DCT) for CBCD, as DCT is increasingly vigorous to numerous mutilations and image changes. Images are changed over into YUV format and Y segment is utilized, as it is contended that hues don’t have essential influence in image recovery. RIME effectively identifies the duplicates of the test image with and without alterations, anyway they neglect to identify the duplicates with or

turn. The worldwide highlights are effective for basic sorts of changes, be that as it may, if there should arise an occurrence of serious changes, the exhibition of worldwide highlights is poor; for instance, if there should arise an occurrence of trimming, impediment, and angle proportion change. There are many scheme proposed for the image copy detection and forgery detection. SIFT and SURF are the mostly used methods for the feature extraction. Cong lin et al. 

lin2019copy proposed a method that combines SIFT and Local Intensity Order Pattern (LIOP) to adopt good feature results. As SIFT is invariant feature extractor with image scale, image rotation and noise removal etc. Meanwhile LIOP descriptor is used for image scale, image rotation, viewpoint change, image blur and JPEG compression. Local features have proven to be more resistant and robust for severe image transformations as compared to the global features. Local features have demonstrated to be progressively safe and powerful for extreme image changes when contrasted with the worldwide highlights. Numerous CBCD and image recovery frameworks have been proposed dependent on SIFT and other local features  Chum2011; Nister2006; James07; Philbin2011; Zhong2009; Xu2011; Zhou2011; Zhou2010. Xu et al. Xu2011 Local features have demonstrated to be progressively safe and powerful for extreme image changes when contrasted with the worldwide highlights. Numerous CBCD and image recovery frameworks have been proposed dependent on SIFT and other local features. Zhou et al. Zhou2010 propose a system for fractional copy image location for huge scope applications by utilizing sack of-visual-words model. They quantized the SIFT in descriptor space and direction space. They encode the spatial design of keypoints by XMAP and YMAP technique, which assists with expelling the exceptions. Be that as it may, their structure is touchy to computerized blunders, for example, floating or moving of keypoints because of change which causes to miss the genuine matches. There are different methods are being used for the detection of keypoints and descriptors for instance FAST and its variants rosten2006machine; rosten2008faster are used to find keypoints in real-time in which system visual features are matched, like parallel tracking and mapping klein2007parallel. It is effective and finds sensible corner keypoints, in the spite of the fact that it must be enlarged pyramid business models for scale klein2008improving, and in this case, a Harris corner filter harris1988combined is used to dismiss edges and give a sensible score. And for the finding of descriptors, BRIEF calonder2010brief is a latest feature descriptor that utilized basic binary tests between pixels in a smoothed image patch. Furthermore, BRIEF is similar with SIFT in many aspects like its robustness, blurriness and perspective distortion. Be that as it may, it is delicate to in-plane rotation. BRIEF as a research method uses binary tests that train a set of classification trees calonder2008keypoint. When prepared on a lot of 500 or so run of the typical keypoints, the trees can be used to restore a mark for any self-assertive keypoint calonder2009high.

Joly et al. joly2007content; joly2003robust proposed a method which represents 20D spatio-temporal descriptors to each keyframe that computes about Harris interest points although it requires a large amount of data to store per keyframe, possibly hundreds of Harris key points and their descriptors.

Recently, image retrieval based on hashing technique has attracted a focus of researchers. Torralba et al. torralba2008small represents a method that can learn short descriptors to retrieve same images from a large database. The technique depends on on a dense 128D global image descriptor, which confines the way to deal with no geometric/perspective invariance. Jain et al. jain2008fast presented a strategy for proficient augmentation of Locally Sensitive Hashing scheme indyk2000stable for Mahalanobis distance. Both above mentioned techniques uses a bit string as a unique symbol of the image. In such a portrayal, direct impact of comparative images in a solitary container of the hashing table is impossible and a search over different receptacles must be performed. This is achievable (or indeed, even favorable) for rough closest neighbor or range search when the question model is given. Be that as it may, for grouping undertakings, (for example, discovering all gatherings of close copied pictures in the database) the bit string portrayal is less appropriate.

Over the other low dimensional descriptors, SIFT detector and descriptor gives sufficient performance mikolajczyk2005performance. Therefore, it has been used worldwide in the field of copy image retrieval chum2008near; auclair2009hash, image classification nister2006scalable and processing medical images jiang2014computer for instance, tracking the growth of cancerous existence. Several methods have been proposed to accelerate the feature indexing process and reduced the Original SIFT detector’s length. And this can be accomplished by ignoring some patches of the original descriptor khan2011sift to get 96D, 64D and 32D descriptors or by employing principle component analysis to obtain 64D SIFT descriptors ke2004pca.

Another method for managing worldwide feature extraction is identified with the investigation of data encoded in the image texture. Instances of the calculations that follow this methodology are Steerable Pyramid simoncelli1995steerable, Gabor Wavelet Transform he2002object, Contour-let Transform do2005contourlet or Complex Directional Filter Bank vo2006texture

. Worldwide highlights calculations are commonly viewed as straightforward and quick, which regularly brings about the absence of in-variance to change of point of view or light. To beat these issues local features techniques were presented

gorecki2012ranking. For example, Schmid schmid1997local use Harris corner detector to distinguish intrigue focuses which is insensitive toward change of image direction.

Recki et al. gorecki2012ranking

proposed a strategy for The primary commitment of this work is the Ranking by K-means Voting algorithm, whose reason for existing is to make a positioning of comparable pictures. The outcomes got in the trial meeting show the benefit of the technique proposed right now the standard similitude measures, for our situation over the Euclidean distance. Afra et al. 

nurnberger2016near assessed the exhibition of the RC-SIFT 64D descriptor to tackle the near duplicate recovery task in two cases: Firstly, for benchmarks of various size. Besides, utilizing a similar benchmark yet for various quantities of removed highlights.

Zhihua xu et al. xu2011novel presented a powerful content-based image copy detetion conspire. The fundamental commitment of this paper is in three parts. (1) By looking into the distributed writing about the geometric mutilation strong plans, it is obvious that the nearby invariant areas and worldwide descriptor is an ideal technique to adapt to geometric contortion, especially, editing and revolution; (2) Construct a progression of powerful, homogeneous, and bigger size roundabout patches utilizing the SIFT finder; (3) Adopt the MHD to produce include vector for every round fix.

Wu et al. Zhong2009 propose a structure by taking multiple gathering of keypoints rather than single keypoint. They utilize a SIFT keypoints alongside maximally stable extreme areas (MSER) keypoints. MSER keypoints are relative covariant key-points with higher repeatability contrasted with the SIFT key-points, these focuses are likewise bigger in scale and generally littler in number in image. On the off chance that SIFT key-points pack with MSER areas, at the point, they have better robustness and performance with more discrimination power of 45% accurately precision in Bag-of-Visual Word model as compared with the simple SIFT BoW model. Be that as it may, the computational time for search engine is higher when contrasted with cutting edge. Wu et al. also presented a way to deal with locate the close duplicate pictures. The SIFT and MSER feature descriptor were used to locate image duplication. MSER is an area based methodology, while SIFT is a point feature location descriptor. Each of them gathering to turns out the feature extraction to be more distinctive than a single element. The MSER feature extractor is also widely used for the image retrieval system just like SIFT invariant feature extractor. Dissimilar to the SIFT feature detector, MSER recognizes relative covariant stable locales. Each recognized curved locale is standardized into a round district from which a SIFT descriptor is figured out.

Ahmed Alzu’bi et al. alzu2016improving

presents a paper about examination towards discovering smaller and discriminate image portrayals utilizing worldwide and nearby multi-highlight plot. The led tests give bits of knowledge into the connection between image features and other recovery factors, including distance measures, quantization and visual code-books, recovery speed, and memory necessities. A bank of image features is removed and afterward defined into minimized image portrayals. The entirety of the extricated highlights are assessed against eight diverse separation measures for similarity measurements. The exploratory outcomes show that distinctive image features and blends give diverse execution. At the last assessment stage, Euclidean, cosine, and relationship estimates show nearly a similar effect on both retrieval accuracy and efficiency. The Spear-man separation measure has indicated the most noteworthy recovery precision for single neighborhood descriptors contrasted with the joined worldwide or nearby ones. Be that as it may, it takes more coordinating time than other distance measures.

2.2 Patch based descriptors

There are three famous approaches for keypoint descriptors calculation. In the first approach raw pixel values based calculations are used, such as CSLBP and LBP CSLBP; LBP1. In the second approach, gradients are computed such as SIFT David2004; and in the third approach binary features are used. These binary features can be computed by quantizating the gradient histograms such as BIG-OH and CARD Baber2014; CARD, or sometimes comparing the pixel values, such as BRIEF Michael2010. The SIFT is widely used descriptor, and to make this section self-contained, SIFT descriptors is explained below.
Scale-Invariant Feature Transform (SIFT) descriptor is the representation of gradient orientation histograms. SIFT was introduced by David Lowe, after that it was detailed analyzed and deeply developed in 2004 David2004 to accomplish huge enhancements in stability and feature invariance. The standard SIFT descriptor is built by testing the extents and directions of the image slope in the region around the intrigue focuses, and portraying the significant data of the local region by building smoothed direction histogramsliu2019top. To compute the SIFT descriptor,the given patch is divided into grid of . In each cell, the gradient magnitude, , and orientation, , are computed for each pixel. The gradient orientations are divided in to 8 different directions and then histogram of each directions are calculated individually. Each gradient sample added to its histogram by their gradient magnitude and Gaussian weight. For Gaussian weight, circular window with a that is 1.5 times that of the scale of keypoint is taken David2004.

3 Proposed Re-Ranking Framework

To improve the accuracy efficiently of BoVW, the obtained rank list is re-evaluated by image to image matching based on binary features. The binary features transform the Euclidean space to binary space that makes the distance computation very fast without significant compromise on accuracy. BoVW model and image to image matching are explained in later in this section.

3.1 Image Representation

Mostly, images are in RGB format which are then presented by gray-scale format, the given image , where and denote the dimensions (width and height), and denotes the channels, in case of gray-scale format and image then can be defined as . Color is most important and huge element of an image if it is maintained in most significant way and in a perceptually situated manner. Content Based Image Retrieval (CBIR) system proposed color as a most significant element in image retrieval. The color histogram was acquainted to distinguish the relationship between color circulation of pixels and spatial relationship of colors galshetwar2019local. Image matching is the method to find out geometric correspondence between the at least two pictures of a similar scene. These images could be of different frames, different time taken and different viewpoints. There are many detectors used for the feature extraction and image matching. We have used Harris-Affine detector for the feature key-point detection. Harris-Affine detector is based on the Harris-Laplace detector, which is a corner detector mostly used in image matching. Phase correlation method is applied to overcome the mismatching issues by estimating the translation parameters of the image and the image which is used to be matched for the feature detectionzheng2019image. Despite the fact that there is some unobtrusive difference between the corner detection and feature extraction but both are mean to be same in this paper. We use Harris Affine (HA) keypoint detector Krystian2004. The HA keypoints have corner like structure with less localization error and higher repeatability as compared to Difference of Gaussian (DOG) keypoints which is used by SIFT algorithm Krystian2004. Each key-point is spoken to by a circular area, which is controlled by scale, slope edge and second grid Krystian2004. The detected elliptical region is mapped to circular region and then normalized into pixels in Cartesian grid CSLBP; Mikolajczyk2005.

3.2 Image-to-Image Matching

Two images can be matched by computing the distance between their feature vectors, and if the distance is closed enough, then those two images are said to be similar. Rank list is obtained by the distance score received in ascending order. However, image to image matching can be easily modeled if there exists only one feature vector for given image such as gray scale histogram or RGB histogram. Whereas, this can not be easily modeled for local keypoints descriptors where one image is represented by set of features. For example, in case of SIFT, image is represented by feature matrix , where indicates the dimensions of one descriptor and indicates the number of key point. The varies from image to image, it depended on the contents of the image. On average, there are three thousand to four thousand keypoints per image if difference of Gaussian keypoint detector is used. Two images are comparable on the off chance that they have numerous basic coordinating points, or the positioned list is gotten dependent on image-to-image keypoint coordinating. Let and be the two images with their local keypoints descriptors and , respectively. The point pair, and , , is considered a match if following two conditions meet.

  • Nearest neighbor condition

  • Reliable match


Where is some distance estimated and is the threshold, which is being utilized for the stable coordinating under commotion conditions, as recommended by David David2004.

The feature distance used for the matching algorithms is an arbitrary, which can be measured by any distance matching models. There is one distance calculating model used widely to compute the distance matching in the field of computer vision is Euclidean distance, which is presented mathematically as:


Where represents the dimensions of the vector, i.e., 128 in case of SIFT.

3.3 Bag of Visual Word Model

The computational cost of image-to-image matching is very high as SIFT is high dimensional vector. For image representation Bag-of-Visual-Words (BoVW) model gave an extensive rise to capturing the local pattern variations. Basically the BoVW model is famous in text retrieval from different domains and have great efficiency of providing accurate results. It is generally motivated by Bag-of-Words (BoW) model. Typically a content documents, reports and so on have part of significant words a that can be spoken to by an feature vector for different word checks, Similarly, a BoVW model speak to the image retrieval approach is utilized to portrayed an image and its number of occurrences in different visual wordskulkarni2019comparing. The computational cost for each key-point from the first image to its nearest neighbor in the other image is very expensive. Moderating the databases can be used by image-to-image matching models. Bag of Visual Word (BoVW) model is used widely for the accurate and feasible searches results in the large databases.

The BoVW model, , is basically the quantizer which quantizes the descriptor


The quantizes descriptor to any of the integer cluster index, known as visual word. Visual words are quantized of different local key-points where the histogram of different visual words are calculated from the large image datasets. In this situation, they suggested to keep the value of greater. In case its keeps the smaller value it will give robust result but not in particular way and in the opposite case where is even larger than it makes the model more particular but less robust. In the current days, researchers have used number of values lies between 1 million to 1.5 million. amato2016reducing; philbin2007object. In the BoVW model the dataset of 2 to 3 thousand images contains approximately 1.0 to 1.5 million vocabulary keypoints makes the model very inadequate amato2016reducing; philbin2007object.

Flat K-mean sivic2003video or hierarchical K-mean nister2006scalable clustering are commonly used to learn the vocabulary of visual words.

Figure 1: CBCD and Oxford-5K dataset samples. (a) shows the CBCD dataset sample of 10 different scenes with their 10 copies, and (b) shows the Oxford-5K dataset sample of 4 landmark with their copies.

3.4 Compact Binary Fingerprint: BiSIFT

Finding the similar images using SIFT is very expensive task, as the the descriptors contains floating values. SIFT is not only computationally expensive for finding the similar images but also storage intensive. We quantize the SIFT into binary vector, Binary SIFT (BiSIFT), similar to BIG-OH, and then compute the image similarity. The BIG-OH binarizes each gradient histogram on the spatial grid of

and then concatenated at the end, whereas, BiSIFT binarizes the whole vector at once. Experiments show that there is negligible lose in accuracy but significant gain in computation. For given SIFT descriptor , that can be described as , where , the is obtained which is by


where . The is 128-Bits binary vector, and it is equivalently efficient as BIG-OH but comparatively simple.

4 Experimental Analysis

This section presents the dataset used for the experiments and evaluation metrics followed by the experimental results.

4.1 Datasets

Three different datasets are used for the evaluation. The first dataset is Synthetic dataset which is used to show the efficiency of the descriptors for similarity computation and storage. The second and third dataset are used to the show effectiveness of proposed binary descriptor.

4.1.1 Synthetic Dataset

For the computational speedup tests, we utilize engineered information comprising of uniform irregular bytes for all descriptors, since genuine information isn’t expected to assess crude match score figuring execution. This dataset is used only to assure the distance calculation of the raw descriptor about the total required time for the execution. Two different Synthetic descriptors are generated, the first Synthetic descriptors are in floating points, and second Synthetic descriptors are binary descriptors.

(a) (b)
Figure 2: Number of descriptors per image on each dataset. There are total 745563 number key points together and 682.26 on average on CBCD dataset and 13800780 key points together and 2726 on average on the Oxford-5K dataset.

4.1.2 Copy Detection Dataset

The second dataset, CBCD, is provided by Zhou et al. Zhou2010. This dataset is taken about 36 different camera scenes with total 1088 number of images. The pictures are seriously misshaped by testing changes normally experienced in duplicate location situation. The changes were made using the camcorder by playing a video of one pictures on the screen but it which was an image inside another image. The more transformation of an image like adding different designs, JPEG compressions, change of brightness, editing, obscuring, picture flipping, content inclusion, zoom in or out, change of viewpoint and decreasing of an image quality. All of these changes occurring alongside of the moving picture, differentiate changing and picture twisting. Figure 1(a) gives one picture haphazardly chose from every scene with their duplicates.

The third dataset is the Oxford-5K dataset James07. This is the mostly used dataset for the image recovery assessment test. It includes 55 questions pictures of 11 tourists spots including an aggregate of 5063 images gathered from Flickr444, Figure 1(b) shows the 4 landmark with their copies.

We have also used an unlabeled dataset of the Paris 100k dataset. This dataset is used for BoVW model learning. Philbin et at. James07 explained in detail about these datasets and where it can be found with the original work details of Oxford 5K and Paris 100K.

The visualization of CBCD and Oxford-5K along with description of number of images can be found in Figure 2.

(a) (b)
Figure 3: Computational gain to compute the nearest neighbor for one descriptor in varying database size. (a) shows the different possibilities to compute the distance of integer SIFT and BiSIFT, and (b) shows the actual gain of computation between integer SIFT and BiSIFT.

4.2 Experiment on Synthetic Dataset

Synthetic dataset is used to illustrate the computational gain for finding the nearest neighbor (NN) of one keypoint descriptor in large dataset. The original SIFT is floating point value that requires 512 bytes/ descriptor David2004. There are number of different implementation of SIFT, VLFEAT 555 is one of them. Using VLFEAT API, SIFT requires 128 Bytes/ descriptor as each value is 8-Bit unsigned integer. The 128-Bytes implementation of SIFT is very fast compared to original SIFT implementation at the cost of little accuracy loss. The original SIFT takes 252 Seconds to find the NN of given keypoints from the descriptor size of 500K points, whereas, integer implementation takes only 0.04 seconds. The hamming space is further far faster than the integer Euclidean space. That is why original SIFT is quantized into binary, known as BiSIFT, to make NN searching faster. Figure 3 shows the computational gain of BiSIFT. In Figure 3, the first two curves are the variants of equation 3, the third curve is the sum of the differences of binary vectors, and the last curve shows the lookup based implementation of finding the distances, the idea of lookup based search is inspired by BIG-OH Baber2014.

4.3 Experiment on Information Retrieval

This Experiment evaluate the effectiveness of the BiSIFT. As stated earlier, the BiSIFT achieved competitive performance with significant computational gain, the computational gain is shown in the Figure 3. Two datasets, CBCD and Oxford-5K, are taken for the experiments. Precision, recall, and mean average Precision (mAP) are used as evaluation matrics for this experiment.

Precision (): Precision computes the ratio between correctly retrieved and total retrieved images. The correctly retrieved are basically the true positives (TP), and total retrieved is the sum of true positives and incorrectly retrieved. The incorrectly retrieved are basically the false positives (FP). It can be computed as follow:


Recall (): Recall computes the ratio between correctly retrieved and total correct. The total correct refers to the total possible true positives of given query image should have retrieved, that can be model as TP + FN, where FN indicates the false negative (images could not be retrieved). It can be computed as follow:


mean Average Precision (mAP): An average precision is computed for each theme, each theme has more than one queries, and finally average of the average is calculated, denoted as mAP James07.

Table 1 show the mAP of BiSIFT along with SIFT, BIG-OH, and BoVW models. In BoVW mode, vocabulary of size 100K is learned on Paris 100K dataset dataset using k-mean clustering. It can be seen that original SIFT is better than BoVW, but BoVW is far efficient compared to the SIFT. On Oxford-5K dataset, average time to search one query image takes 4 minutes on Dell server with 32 GB of RAM. Figure 4 shows the time of each query taken on Oxford-5K dataset to be searched in 5K images. Since, every image have different number of keypoints, therefore, the time for each query is different. Whereas, in case of BoVW model, each image is represented by single feature: normalized histogram of visual words present in the image. The BoVW model takes only 0.9 seconds for one query image to search. The feature size of BoVW is , where rows indicates the vocabulary size and column indicates the number of images.

It can be seen that in Table 1 that the accuracy using BoVW is decreased. To improve the accuracy of BoVW, spatial verification is applied on top images in the rank-list either by using RANSAC or other techniques James07; Zhou2010. Instead of applying geometric or spatial verification, image to image matching is applied on images. Since, image to image is more effective but computationally expensive, therefore, it should not be applied on whole dataset but only on top images on the rank-list, where in the experiments. Image to image based varification improves the accuracy of BoVW with little more time of overhead. It takes only 0.1 seconds using BiSIFT and 1.27 seconds using SIFT, it can be seen that the accuracy is improved when BoVW is piped with SIFT or BiSIFT.

BoVW +
BoVW +
BoVW +
CBCD 0.47 0.57 0.58 0.60 0.61 0.59 0.60
Oxford-5K 0.51 0.53 0.52 0.53 0.55 0.52 0.53
Table 1: mAP of BiSIFT and SIFT
Figure 4: Time to search each query in Oxford-5K dataset. There are 55 queries in Oxford-5K dataset.

5 Conclusion

Image retrieval is challenging task in large databases as images are easily distorted, modified, and forged. To retrieve the copies or similar images, images are represented by some robust and distinctive local descriptors such as SIFT. However, local descriptor based image matching and then retrieving is very expensive task. As shown in the experiments, it takes on average 4 minutes to search one query image on the database of size as long as 5K, where feature extraction time is not include: only feature matching and sorting time are included. To make search feasible in real-time, features are quantizes into smaller features space at the cost of accuracy. BoVW is widely used in image retrieval for feature quantization. It takes only 0.9 seconds to a query image on the database of size 5K using BoVW model with vocabulary size of 100K. In this paper, we quantize the SIFT into binary vector as binary space is faster compared to Euclidean space (Integer values) for searching, as shown in Figure 3, without compromising on the accuracy, as shown in Table 1.

Conflict of interest

The authors declare that they have no conflict of interest.