Efficient Multimedia Similarity Measurement Using Similar Elements

09/08/2018 ∙ by Chengyuan Zhang, et al. ∙ Central South University 0

Online social networking techniques and large-scale multimedia systems are developing rapidly, which not only has brought great convenience to our daily life, but generated, collected, and stored large-scale multimedia data. This trend has put forward higher requirements and greater challenges on massive multimedia data retrieval. In this paper, we investigate the problem of image similarity measurement which is used to lots of applications. At first we propose the definition of similarity measurement of images and the related notions. Based on it we present a novel basic method of similarity measurement named SMIN. To improve the performance of calculation, we propose a novel indexing structure called SMI Temp Index (SMII for short). Besides, we establish an index of potential similar visual words off-line to solve to problem that the index cannot be reused. Experimental evaluations on two real image datasets demonstrate that our solution outperforms state-of-the-art method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the recent years, online social networking techniques and large-scale multimedia systems DBLP:conf/cikm/WangLZ13 ; DBLP:conf/mm/WangLWZZ14 ; DBLP:conf/mm/WangLWZ15 ; DBLP:journals/tip/WangLWZZH15 are developing rapidly, which not only has brought great convenience to our daily life, but generated, collected, and stored large-scale multimedia data DBLP:journals/tip/WangLWZ17 , such as text, image DBLP:conf/sigir/WangLWZZ15 , audio, video LINYANGARXIV and 3D data. For example, in China, Weibo (https://weibo.com/) is the largest online social networking service, which have 376 million active users and more than 100 million micro-blogs containing short text, image, or short video are posted. The most famous social networking platform all over the world, Facebook (https://facebook.com/), reports 350 million images uploaded everyday in the end of November 2013. More than 400 million tweets with texts and images have been generated by 140 million users on Twitter (http://www.twitter.com/) which is another popular social networking web site in the world. In September 2017, the largest online social networking platform. Another type of common application in the Internet is multimedia data sharing services. Flickr(https://www.flickr.com/) is one of the most famous photos sharing web site around the world. More than 3.5 million new images uploaded to this platform everyday in March 2013. More than 14 million articles are clicked every day on Pinterest, which is an attractive image social networking web site. More than 2 billion totally videos stored in YouTube, the most famous video sharing platform by the end of 2013, and every minute there are 100 hours of videos which are uploaded to this service (https://www.youtube.com/). The total watch time exceeded 42 billion minutes on IQIYI (http://www.iqiyi.com/), the most famous online video sharing service in China and number of independent users monthly is more than 230 million monthly. For audio sharing services, the total amount of audio in Himalaya (https://www.ximalaya.com/) had exceeded 15 million as of December 2015. Other web services like Wikipedia (https://en.wikipedia.org/), the largest and most popular free encyclopedia on the Internet, contains more than 40 million articles with pictures in 301 different languages. Other mobile applications such as WeChat, Instagram, etc, provide great convenience for us to share multimedia data. Thanks to these current rich multimedia services and applications, multimedia techniques DBLP:conf/ijcai/WangZWLFP16 ; DBLP:journals/cviu/WuWGHL18 is changing every aspect of our lives. On the other hand, the emergence of massive multimedia data DBLP:journals/corr/abs-1708-02288 and applications puts forward greater challenges for techniques of information retrieval.

Figure 1: An example of multimedia retrieval via similarity measurement
Figure 2: An example of multimedia retrieval via similarity measurement

Motivation. Textual similarity measurement is a classical issue in the community of information retrieval and data mining, and lots of approaches have been proposed. Guo et al Guo2016Sentence

proposed to use vectors as basic elements, and the edit distance and Jaccard coefficient are used to calculate the sentence similarity. Li et al. 

LiFeng2017 proposed the use of word vectors to represent the meaning of words, and considers the influence of multiple factors such as word meaning, word order and sentence length on the calculation of sentence similarity. Unlike the studies of textual similarity measurement, in this paper we investigate the problem of image similarity measurement, which is a widely applied technique in lots of application scenarios such as image retrieval DBLP:conf/pakdd/WangLZW14 ; DBLP:journals/pr/WuWGL18 ; DBLP:journals/tnn/WangZWLZ17 , image similarity calculation and matching NNLS2018 ; TC2018 . There are two examples shown in Figure 1 and Figure 2 which can describe this problem in a more clear way.

Example 1

In Figure 1, An user have a photo and she want to find out others pictures which are highly similar to it in the Internet. She can submit a image query containing this photo into the multimedia retrieval system. The system measures the visual content similarity between this photo and the images in the database and after that a set of similar images is returned.

Example 2

Figure. 2 demonstrates another example of application of image similarity measurement. An user want to measure similarity betweeen two pictures in a dataset quantitatively. She selects two pictures from the image dataset and input them into the similarity measurement system. According to image similarity measurement algorithm, the system will calculate the value of similarity between these images (e.g., 90%).

In order to improve the efficiency and accuracy of image similarity measurement, we present the definition of similarity measurement of images and the relevant notions. We introduce the measurement of similar visual words named SMI Naive (SMIN for short) which is the basic method for similarity measurement, and then propose the SMIN algorithm. After that, to optimize this method, we design a novel indexing structure named SMI Temp Index to reduce the time complexity of calculation. In addition, another technique named index of potential similar visual words is proposed to solve the problem that the index cannot be reused. We could search for the index to perform the measurement of similar visual words without having to repeatedly create a temporary index.

Contributions. Our main contributions can be summarized as follows:

  • Firstly we introduce the definition of similarity measurement of images and the related conceptions. The image similarity calculation function are designed.

  • We introduce the basic method of image similarity measurement, called SMI Naive (SMIN for short). In order to improve the performance of similarity measurement, based on it we design two indexing techniques named SMI Temp Index (SMII for short) and Index of Potential Similar Visual Words (PSMI for short).

  • We have conducted extensive experiments on two real image datasets. Experimental results demonstrate that our solution outperforms the state-of-the-art method.

Roadmap. In the remainder of this paper, Section 2 presents the related works about image similarity measurement and image retrieval. In Section 3, the definition of image similarity measurement and related conceptions are proposed. We present the basic similarity measurement method named SMIN and two improved indexing techniques and algorithms in Section 4. Our experimental results are presented in Section 5. Finally, we conclude the paper in Section 6.

2 Related Work

In this section, we present the related works of image similarity measurement and image retrieval, which are relevant to this study.

Image Similarity Measurement. In recent years, image similarity measurement has become a hot issue in the community of multimedia system DBLP:journals/ivc/WuW17

and information retrieval since the massive image data can be accessed in the Internet. On the other hand, like textual similarity measurement, image similarity measurement is an important technique which can be applied in lots of applications, such as image retrieval, image matching, image recognition and classification, computer vision, etc. Many researchers work for this issue and numerous approaches have been proposed. For example, Coltuc et al. 

DBLP:journals/entropy/ColtucDC18 studied the usefulness of the normalized compression distance (NCD for short) for image similarity detection. In their work, they considered correlation between NCD based feature vectors extracted for each image. Albanesi et al. DBLP:journals/jmiv/AlbanesiABM18 proposed a novel class of image similarity metrics based on a wavelet decomposition. They investigated the theoretical relationship between the novel class of metrics and the well-known structural similarity index. Abe et al. DBLP:conf/iccae/AbeMH18

studied similarity retrieval of trademark images represented by vector graphics. To improve the performance of the system, they introduced centroid distance into the feature extraction. Cicconet et al. 

DBLP:journals/corr/abs-1802-06515

studied the problem of detecting duplication of scientific images. They introduced a data-driven solution based on a 3-branch Siamese Convolutional Neural Network which can serve to narrow down the pool of images. For multi-label image retrieval, Zhang et al. 

DBLP:journals/corr/abs-1803-02987 proposed a novel deep hashing method named ISDH in which an instance-similarity definition was applied to quantify the pairwise similarity for images holding multiple class labels. Kato et al. DBLP:journals/ipsjtcva/KatoSP17 proposed a novel solutions for the problem of selecting image pairs that are more likely to match in Structure from Motion. They used Jaccard Similarity and bag-of-visual-words in addition to tf-idf to measure the similarity between images. Wang et al Wang2016Semantic

designed a regularized distance metric framework which is named semantic discriminative metric learning (SDML for short). This framework combines geometric mean with normalized divergences and separates images from different classes simultaneously. Guha et al. 

Guha2013Image proposed a new approach called Sparse SNR (SSNR for short) to measuring the similarity between two images using sparse reconstruction. Their measurement does not need to use any prior knowledge about the data type or the application. KHAN et al. DBLP:journals/ieicet/KhanAT12 proposed two halftoning methods to improve efficiency in generating structurally similar halftone images using Structure Similarity Index Measurement. Their Method I can improves efficiency as well as image quality and Method II can reaches a better image quality with fewer evaluations than pixel-swapping algorithm used in Method I.

Near-duplicate image detection is a another problem related to image similarity measurement. To solve the problem of near-duplicate image retrieval, Wang et al. DBLP:journals/ijes/WangZ18 developed a novel spatial descriptor embedding method which encodes the relationship of the SIFT dominant orientation and the exact spatial position between local features and their context. Gadeski et al. DBLP:journals/mta/ZhaoLPF17 proposed an effective algorithm based on MapReduce framework to identify the near duplicates of images from large-scale image sets. Nian et al. DBLP:journals/mta/NianLWGL16 investigated this type of problem and presented an effective and efficient local-based representation method named Local-based Binary Representation to encode an image as a binary vector. Zlabinger et al. DBLP:conf/sac/ZlabingerH17 developed a semi-automatic duplicate detection approach in which single-image-duplicates are detected between sub-images based on a connected component approach and duplicates between images are detected by using min-hashing method. Hsieh et al. DBLP:journals/mta/HsiehCC15 designed a novel framework that adopts multiple hash tables in a novel way for quick image matching and efficient duplicate image detection. Based on a hierarchical model, Li et al. DBLP:journals/mta/LiQLZWT15 introduced an automatic NDIG mining approach by utilizing adaptive global feature clustering and local feature refinement to solve the problem of near duplicate image groups mining. Liu et al. DBLP:journals/tip/LiuLS15 presented a variable-length signature to address the problem of near-duplicate image matching. They used the earth mover’s distance to handle variable-length signatures. Yao et al. DBLP:journals/spl/YaoYZ15 developed a novel contextual descriptor which measures the contextual similarity of visual words to immediately discard the mismatches and reduce the count of candidate images. For large scale near-duplicate image retrieval Fedorov et al. DBLP:journals/corr/FedorovK16 introduced a feature representation combining of three local descriptors, which is reproducible and highly discriminative. To improve the efficiency of near-duplicate image retrieval, Yıldız et al. DBLP:conf/ssiai/YildizD16 proposed a novel interest point selection method in which the distinctive subset is created with a ranking according to a density map.

Image Retrieval. Content-based image retrieval (CBIR for short) DBLP:journals/pami/JingB08 ; DBLP:journals/tomccap/LewSDJ06 ; DBLP:conf/mm/WuWS13 is to retrieve images by analyzing visual contents, and therefore image representation DBLP:conf/mm/WanWHWZZL14 ; DBLP:journals/pr/WuWGL18 plays an important role in this task. In recent years, the task of CBIR has attracted more and more attentions in the multimedia TC2018 ; DBLP:journals/pr/WuWLG18 and computer vision community DBLP:journals/tnn/WangZWLZ17 ; NNLS2018 . Many techniques have been proposed to support efficient multimedia query and image recognition. Scale Invariant Feature Transform (SIFT for short) DBLP:conf/iccv/Lowe99 ; DBLP:journals/ijcv/Lowe04 is a classical method to extract visual features, which transforms an image into a large collection of local feature vectors. SIFT includes four main step: (1)scale-space extrema detection; (2)keypoint localization; (3)orientation assignment; (4)Kkeypoint descriptor. It is widely applied in lots of researches and applications. For example, Ke et al. DBLP:conf/cvpr/KeS04

proposed a novel image descriptor named PCA-SIFT which combines SIFT techniques and principal components analysis (PCA for short) method. Mortensen et al. 

DBLP:conf/cvpr/MortensenDS05 proposed a feature descriptor that augments SIFT with a global context vector. This approach adds curvilinear shape information from a much larger neighborhood to reduce mismatches. Liu et al. DBLP:journals/inffus/LiuLW15

proposes a novel image fusion method for multi-focus images with dense SIFT. This dense SIFT descriptor can not only be employed as the activity level measurement, but also be used to match the mis-registered pixels between multiple source images to improve the quality of the fused image. Su et al. 

Su2017MBR designed a horizontal or vertical mirror reflection invariant binary descriptor named MBR-SIFT to solve the problem of image matching. Nam et al. DBLP:journals/mta/NamKMHCL18 introduced a SIFT features based blind watermarking algorithm to address the issue of copyright protection for DIBR 3D images. Charfi et al. DBLP:journals/mta/CharfiTAS17 developed a bimodal hand identification system based on SIFT descriptors which are extracted from hand shape and palmprint modalities.

Bag-of-visual-words DBLP:conf/iccv/SivicZ03 ; DBLP:journals/tnn/WangZWLZ17 ; DBLP:journals/corr/abs-1804-11013 (BoVW for short) model is another popular technique for CBIR and image recognition, which was first used in textual classification. This model is a technique to transform images into sparse hierarchical vectors by using visual words, so that a large number of images can be manipulated. Santos et al. DBLP:journals/mta/SantosMST17 presented the first ever method based on the signature-based bag of visual words (S-BoVW for short) paradigm that considers information of texture to generate textual signatures of image blocks for representing images. Karakasis et al. DBLP:journals/prl/KarakasisAGC15

presents an image retrieval framework that uses affine image moment invariants as descriptors of local image areas by BoVW representation. Wang et al. 

DBLP:conf/mmm/WangWLZ13 presented an improved practical spatial weighting for BoV (PSW-BoV for short) to alleviate this effect while keep the efficiency.

3 Preliminaries

In this section, we propose the definition of region of visual interests (RoVI for short) at the first time, then present the notion of region of visual interests query (RoVIQ for short) and the similarity measurement. Besides, we review the techniques of image retrieval which is the base of our work. Table 1 summarizes the notations frequently used throughout this paper to facilitate the discussion.

Notation Definition
  A given database of images
  The -th image
  A visual words set
  The number of visual words in
  The 1-th visual word in the visual words set
  The similarity of two visual words
  The similar visual word pair
  The operator to generates the set of SVWPs
  The similarity threshold of predefined
  The set of visual words weight
  The image similarity measurement
  The similarity of visual word
Table 1: The summary of notations

3.1 Problem Definition

Definition 1 (Image object)

Let be an image dataset and and be two images, . We define the image object represented by bag-of-visual-word model as and , wherein and are the visual word set generated by low-level feature extraction from and , and are the number of visual words in these two sets respectively. In this study, we utilize image object as the representation model of images for the task of image similarity measurement.

Definition 2 (Similarity of visual word)

Given two image objects and , wherein and are the visual words set. The similarity of two visual word and is represented by , and if these visual words are identical, i.e., , .

Definition 3 (Similar visual word pair)

Given two visual words and and the similarity of them is . Let is the similarity threshold predefined, if , this visual word pair is called as similar visual word pair (SVWP for short), represented as .

Definition 4 (Similarity measurement of two image objects)

Given two image objects and . Let operation generates the set of SVWPs which contain the visual words in and , , and the similarity set of them are denoted as , . Let and be the weight of visual word and . For image objects and , the sets of their visual words weight are denoted as and . The definitional equation of similarity between and is shown as follows:

(1)

where and are the number of visual words of and respectively. It is clearly that can meet the systematic similarity measurement criterion.

Theorem 3.1 (Monotonicity of similarity function)

The similarity measurement has the following five monotonicity conditions:

  • is a monotonic increasing function of weights of visual words in SVWPs, i.e., and , and and , if and , .

  • is a monotonic increasing function of the similarities of SVMPs , i.e., and , .

  • is a monotonic increasing function of number of SVWPs , i.e., , if , .

  • is a monotonic decreasing function of weights of visual words which are not in SVWPs, i.e., .

  • is a monotonic decreasing function of the number of visual words which are not in SVWPs, i.e., if and , , .

According to the Definition 4 and theorem 3.1, the similarity measurement for two image objects is proposed, which is described in formal as follows.

Given two image objects and , and . The sets of their visual words weight are and . The SVMPs set of and is , , and the similarities set of them is . The similarity measurement function is:

(2)

Function 2 apparently meet the monotonicity described in Theorem 3.1. On the other hand, if these two image objects are identical, i.e., , , , and , then .

Theorem 3.2 (dissatisfying commutative law)

The similarity measurement dissatisfy commutative law, i.e.,

In general, some visual words (e.g., noise words) in image objects have negative or reverse effects on the expression of the whole image. The SMI has a penalty effect on non-similar visual elements according to Theorem 3.1. this feature of the SMI has high accuracy for the similarity measurement of images.

4 Image Similarity Measurement Algorithm

4.1 The Measurement of Similar Visual Words

SMI is subject to the time complexity of the calculation of similar visual words. represents the similarity of a similar visual word as shown in the following formula:

(3)

where represents the cosine of the angle between two vectors as the measurement of similarity. is a judgment of the similarity threshold.

We give an intuitive way to measure similar visual words. The pseudo-code of the algorithm is shown in Algorithm 1. In this work, the double loop cosine calculation method is called to be SMI Naive (SMIN for short).

0:  , , .
0:  .
1:  Initializing: ;
2:  Initializing: ;
3:  Initializing: ;
4:  Initializing: ;
5:  for each  do
6:      for each  do
7:          if  then
8:              ;
9:          end if
10:          if  then
11:              ;
12:              ;
13:          else
14:              ;
15:              ;
16:          end if
17:      end for
18:  end for
19:  return ;
Algorithm 1 SMIN Algorithm

4.2 The Optimization of Calculating Similar Visual Words

In the context of massive multimedia data, the multimedia retrieval system or image similarity measurement system requires an efficient sentence similarity measurement algorithm, the time complexity of the SMI focuses on the optimization of calculating similar visual words.

SMI Temp Index. To reduce the double loop cos calculation to 1 cycle, a further approach is to construct an index of for each vector in . According to experience, the dimension of the visual word vector is generally 200-300 dimensions to get better results.

For a vector in , we search for the vector with the highest similarity in the temp index , so that the process requires only one similarity calculation. The times calculations of similar visual words are reduced to vector searching, thereby reducing the execution time of SMI. However, there is a flaw that when every time a similar element of a sentence is calculated, a temp index needs to be built once, and the index cannot be reused. The temp index approach is called to be SMI Temp Index (SMII for short), as shown in Figure 3.

Figure 3: The processing of similarity measurement via SMII

Index of Potential Similar Visual Words. In order to solve the problem that the index cannot be reused, we establish an index of potential similar visual words off-line in the process of word vector training. We could search for the index to perform the measurement of similar visual words without having to repeatedly create a temporary index. The main steps for index of potential similar visual words construction is shown as follows:

  • Establishing an index for all the visual word vector set by trained word vector model.

  • Traversing any vector to search the index to get a return set. In this set, the potential similar visual words are abstained with the similarity is greater than the threshold , in similarity descending order.

  • The physic indexing structure of potential similar visual words could be implemented by a Huffman tree.

According to the hierarchical Softmax strategy in Word2Vec, an original Word2Vec Huffman tree constructed on the basis of the visual words frequency, and each node (except the root node) represents a visual word and its corresponding vector.

We try to replace the vector with potential similar visual words. Thus each node of tree represents a visual word and its corresponding potential similar visual words. The index structure is illustrated by Figure 4:

Figure 4: The index structure of potential similar visual words

We call the methods using global index of potential similar visual words as PSMI. Algorithm 2 illustrates the pseudo-code of PSMI.

0:  , ,
0:  .
1:  Initializing: ;
2:  Initializing: ;
3:  Initializing: ;
4:  Initializing: ;
5:  Initializing: ;
6:  for each  do
7:      ;
8:      for each  do
9:          for each  do
10:              if  then
11:                 ;
12:                 ;
13:                 Break to loop ;
14:              end if
15:          end for
16:      end for
17:      ;
18:      ;
19:  end for
20:  return ;
Algorithm 2 PSMI Algorithm

Algorithm 2 demonstrates the processing of the PSMI Algorithm. Firstly, for each visual words vector , the algorithm executes the procedure to get the node of the huffman tree which contains and stored it in . Then, for each , the algorithm select each from and check if is equal to or not. if them are equal, the algorithm adds into set and adds into . Then break to the outer loop. If and are not equal, then adds into set and add 0 into .

5 Performance Evaluation

In this section, we present results of a comprehensive performance study on real image datasets Flickr and ImageNet to evaluate the efficiency and scalability of the proposed techniques. Specifically, we evaluate the effectiveness of the following indexing techniques for region of visual interests search on road network.

  • WJ WJ is the word2Vec technique proposed in https://github.com/jsksxs360/Word2Vec.

  • WMD WMD is the word2Vec technique, which is based on moving distance, is proposed in https://github.com/crtomirmajer/wmd4j.

  • SMIN SMIN is the double loop cosine calculation technique proposed in Section  4.

  • SMII SMII is the advanced technique of SMIN, which is proposed in Section  4.

  • PSMI PSMI is the potential similar visual words technique of SMII, which is also proposed in Section  4.

Datasets. Performance of various algorithms is evaluated on both real image datasets.

We first evaluate these algorithms on Flickr is obtained by crawling millions image the photo-sharing site Flickr(http://www.flickr.com/). For the scalability and performance evaluation, we randomly sampled five sub datasets whose sizes vary from 200,000 to 1000,000 from the image dataset. Similarly, another image dataset ImageNet, which is widely used in image processing and computer vision, is used to evaluate the performance of these algorithms. Dataset ImageNet not only includes 14,197,122 images, but also contained 1.2 million images with SIFT features. We generate ImageNet dataset with varying size from 20K to 1M.

Workload. A workload for the region of visual interests query consists of queries. The accuracy of these algorithm and the query response time is employed to evaluate the performance of the algorithms. The image dataset size grows from 0.2M to 1M; the number of the query visual words of dataset Flickr changes from 20 to 100; the number of the query visual words of dataset ImageNet varies from 50 to 250. The image dataset size, the number of the query visual words of dataset Flickr, and the number of the query visual words of dataset ImageNet set to 0.2M, 40, 100 respectively. Experiments are run on a PC with Intel Xeon 2.60GHz dual CPU and 16G memory running Ubuntu. All algorithms in the experiments are implemented in Java. Note that we only consider the algorithms WJ, SMI, WDM in accuracy comparison, because the SMIN, SMII, PSMI algorithms have the same error tolerance.

(a) Evaluation on Flickr (b) Evaluation on ImageNet
Figure 5: Evaluation on the number of visual words on Flickr and ImageNet
(a) Evaluation on Flickr (b) Evaluation on ImageNet
Figure 6: Evaluation on the number of visual words on Flickr and ImageNet

Evaluating hit rate on the number of visual words. We evaluate the hit rate on the number of query visual words on Flickr and ImageNet dataset shown in Figure 7. The experiment on Flickr is shown in Figure 7(a). It is clear that the hit rate of WJ, SMI and WMD decrease with the rising of the number of visual words. Specifically, the hit rate of our method, SMI, is the highest all the time. It descends slowly from around 90% to about 85%. On the other hand, the hit rate of WJ and WMD are very close. In the interval , they go down rapidly and after that the decrement of them become moderate. At 100, the hit rate of WJ is a litter higher than WMD, and both of them are much lower than SMI. In Figure 7(b), all of the decreasing trends are similar. Apparently, the hit rate of SMI is the highest, which goes down gradually with the increasing of the number of visual words. On ImageNet dataset, the hit rate of WMD is a litter higher than WJ all the time.

Evaluating response time on the number of visual words. We evaluate the response time on the number of visual words on Flickr and ImageNet dataset shown in Figure 8. In Figure 8(a), with the increment of number of visual words, the response time of PSMI has a slight growth, which is the lowest in these methods. The increasing trends of SMII is very moderate too, but it is slightly inferior to PSMI. Like PSMI and SMII, the performance of SMIN shows a moderate decrement with the rising of spatial similarity threshold. Although the response time of it is higher than the former two, it is much lower than WMD which has a fast growth in the interval of . Figure 8(b) illustrates that the efficiency of PSMI is almost the same with the increment of number of visual words, which is the highest amount these four methods. Like the experiment on Flickr, the performance of both SMII and SMIN increase gradually and they are much better than WMD.

(a) Evaluation on Flickr (b) Evaluation on ImageNet
Figure 7: Evaluation on the number of images on Flickr and ImageNet
(a) Evaluation on Flickr (b) Evaluation on ImageNet
Figure 8: Evaluation on the number of images on Flickr and ImageNet

Evaluating hit rate on the number of images. We evaluate the hit rate on the number of images on Flickr and ImageNet dataset shown in Figure 7. Figure 7(a) demonstrates clearly that the hit rate of SMI is much higher than WJ and WMD. With the increasing of images number, it fluctuates slightly. the hit rate of WMD is almost unchanged with the increasing of number of images. On the other hand, the hit rate of WJ shows a moderate growth in the interval of and after that it drops and it is a litter lower than WMD. Clearly, the performance of SMI is the best. Figure 7(b) shows that the hit rate of SMI grows slightly in and then go down weakly, which is higher than two others. Like the trend of SMI, the hit rate of WMD hit the maximum value at 0.6 and after that it decreases in the interval of . With this just the opposite is that the hit rate of WJ has a moderate decrement in and it rises after 0.6.

Evaluating response time on the number of images. We evaluate response time on different size of query region on Flickr and ImageNet dataset shown in Figure 8. We can find from Figure 8(a) that the response time of PSMI and SMII increase slowly with the increasing of size of dataset. Both of them are much better than the others. The growth rate of SMIN is a litter higher than the two formers. The efficiency time of WMD is the worst. It grows rapidly and at 1.0 it is more than 30000ms. In Figure 8(b), we see that the growth of WMD is the fastest too. Like the situation on Flickr, the performance of WMD is the worst among them. By comparison, the upward trends of SMII and PSMI are much more moderate, and PSMI shows the best performance.

6 Conclusion

In this paper, we investigate the problem of image similarity measurement which is a significant issue in many applications. Firstly we proposed the definition of image objects and similarity measurement of two images and related notions. We present the basic method of image similarity measurement which is named SMIN based on Word2Vec. To improve the performance of similarity calculation, we improve this method and propose SMI Temp Index. To solve the problem of that the index cannot be reused, we develop a novel indexing technique called Index of Potential Similar Visual Words (PSMI). The experimental evaluation on real geo-multimedia dataset shows that our solution outperforms the state-of-the-art method.

Acknowledgments: This work was supported in part by the National Natural Science Foundation of China (61702560), project (2018JJ3691, 2016JC2011) of Science and Technology Plan of Hunan Province, and the Research and Innovation Project of Central South University Graduate Students(2018zzts177,2018zzts588).

References

  • (1) Abe, K., Morita, H., Hayashi, T.: Similarity retrieval of trademark images by vector graphics based on shape characteristics of components. In: Proceedings of the 2018 10th International Conference on Computer and Automation Engineering, ICCAE 2018, Brisbane, Australia, February 24-26, 2018, pp. 82–86 (2018)
  • (2) Albanesi, M.G., Amadeo, R., Bertoluzza, S., Maggi, G.: A new class of wavelet-based metrics for image similarity assessment. Journal of Mathematical Imaging and Vision 60(1), 109–127 (2018)
  • (3) Charfi, N., Trichili, H., Alimi, A.M., Solaiman, B.: Bimodal biometric system for hand shape and palmprint recognition based on SIFT sparse representation. Multimedia Tools Appl. 76(20), 20457–20482 (2017)
  • (4) Cicconet, M., Elliott, H., Richmond, D.L., Wainstock, D., Walsh, M.: Image forensics: Detecting duplication of scientific images with manipulation-invariant image similarity. CoRR abs/1802.06515 (2018)
  • (5) Coltuc, D., Datcu, M., Coltuc, D.: On the use of normalized compression distances for image similarity detection. Entropy 20(2), 99 (2018)
  • (6) Fedorov, S., Kacher, O.: Large scale near-duplicate image retrieval using triples of adjacent ranked features (TARF) with embedded geometric information. CoRR abs/1603.06093 (2016)
  • (7) Guha, T., Ward, R.K., Aboulnasr, T.: Image similarity measurement from sparse reconstruction errors. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1937–1941 (2013)
  • (8) Guo, S., Xing, D., Computer, D.O.: Sentence similarity calculation based on word vector and its application research. Modern Electronics Technique (2016)
  • (9) Hsieh, S., Chen, C., Chen, C.: A novel approach to detecting duplicate images using multiple hash tables. Multimedia Tools Appl. 74(13), 4947–4964 (2015)
  • (10) Jing, Y., Baluja, S.: Visualrank: Applying pagerank to large-scale image search. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1877–1890 (2008)
  • (11) Karakasis, E.G., Amanatiadis, A., Gasteratos, A., Chatzichristofis, S.A.: Image moment invariants as local features for content based image retrieval using the bag-of-visual-words model. Pattern Recognition Letters 55, 22–27 (2015)
  • (12) Kato, T., Shimizu, I., Pajdla, T.: Selecting image pairs for sfm by introducing jaccard similarity. IPSJ Trans. Computer Vision and Applications 9, 12 (2017)
  • (13) Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors. In: 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC, USA, pp. 506–513 (2004)
  • (14) Khan, A., Aguirre, H.E., Tanaka, K.: Improving the efficiency in halftone image generation based on structure similarity index measurement. IEICE Transactions 95-D(10), 2495–2504 (2012)
  • (15) Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2(1), 1–19 (2006)
  • (16) Li, J., Qian, X., Li, Q., Zhao, Y., Wang, L., Tang, Y.Y.: Mining near duplicate image groups. Multimedia Tools Appl. 74(2), 655–669 (2015)
  • (17) Li Feng Hou Jiaying, Z.R.L.C.: Research on multi-character sentence similarity calculation method of fusion word vector. J. Computer Science and Exploration (2017)
  • (18) Liu, L., Lu, Y., Suen, C.Y.: Variable-length signature for near-duplicate image matching. IEEE Trans. Image Processing 24(4), 1282–1296 (2015)
  • (19) Liu, Y., Liu, S., Wang, Z.: Multi-focus image fusion with dense SIFT. Information Fusion 23, 139–155 (2015)
  • (20) Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)
  • (21) Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
  • (22) Mortensen, E.N., Deng, H., Shapiro, L.G.: A SIFT descriptor with global context. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 184–190 (2005)
  • (23) Nam, S., Kim, W., Mun, S., Hou, J., Choi, S., Lee, H.: A SIFT features based blind watermarking for DIBR 3d images. Multimedia Tools Appl. 77(7), 7811–7850 (2018)
  • (24) Nian, F., Li, T., Wu, X., Gao, Q., Li, F.: Efficient near-duplicate image detection with a local-based binary representation. Multimedia Tools Appl. 75(5), 2435–2452 (2016)
  • (25) dos Santos, J.M., de Moura, E.S., da Silva, A.S., da Silva Torres, R.: Color and texture applied to a signature-based bag of visual words method for image retrieval. Multimedia Tools Appl. 76(15), 16855–16872 (2017)
  • (26) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pp. 1470–1477 (2003)
  • (27) Su, M., Ma, Y., Zhang, X., Wang, Y., Zhang, Y.: Mbr-sift: A mirror reflected invariant feature descriptor using a binary representation for image matching:. Plos One 12(5) (2017)
  • (28)

    Wan, J., Wang, D., Hoi, S.C., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-based image retrieval: A comprehensive study.

    In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 157–166 (2014)
  • (29) Wang, F., Wang, H., Li, H., Zhang, S.: Large scale image retrieval with practical spatial weighting for bag-of-visual-words. In: Advances in Multimedia Modeling, 19th International Conference, MMM 2013, Huangshan, China, January 7-9, 2013, Proceedings, Part I, pp. 513–523 (2013)
  • (30) Wang, H., Feng, L., Zhang, J., Liu, Y.: Semantic discriminative metric learning for image similarity measurement. IEEE Transactions on Multimedia 18(8), 1579–1589 (2016)
  • (31) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp. 79–88 (2015)
  • (32) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Processing 26(3), 1393–1404 (2017)
  • (33) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: Exploiting correlation consensus: Towards subspace clustering for multi-modal data. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 981–984 (2014)
  • (34) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: LBMCH: learning bridging mapping for cross-modal hashing. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pp. 999–1002 (2015)
  • (35) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans. Image Processing 24(11), 3939–3949 (2015)
  • (36) Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pp. 805–810 (2013)
  • (37) Wang, Y., Lin, X., Zhang, Q., Wu, L.: Shifting hypergraphs by probabilistic voting. In: Advances in Knowledge Discovery and Data Mining - 18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part II, pp. 234–246 (2014)
  • (38)

    Wang, Y., Wu, L.: Beyond low-rank representations: Orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering.

    Neural Networks 103, 1–8 (2018)
  • (39) Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Networks and Learning Systems (2018)
  • (40) Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering.

    In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 2153–2159 (2016)

  • (41) Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X.: Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans. Neural Netw. Learning Syst. 28(1), 57–70 (2017)
  • (42) Wang, Y., Zhou, Z.: Spatial descriptor embedding for near-duplicate image retrieval. IJES 10(3), 241–247 (2018)
  • (43) Wu, L., Wang, Y.: Robust hashing for multi-view data: Jointly learning low-rank kernelized similarity consensus and hash functions. Image Vision Comput. 57, 58–66 (2017)
  • (44) Wu, L., Wang, Y., Gao, J., Li, X.: Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recognition 73, 275–288 (2018)
  • (45) Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: Deep siamese attention networks for video-based person re-identification. arXiv:1808.01911 (2018)
  • (46) Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural networks for fast person re-identification. Computer Vision and Image Understanding 167, 63–73 (2018)
  • (47) Wu, L., Wang, Y., Li, X., Gao, J.: Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Trans. Cybernetics (2018)
  • (48) Wu, L., Wang, Y., Li, X., Gao, J.: What-and-where to match: Deep spatially multiplicative integration networks for person re-identification. Pattern Recognition 76, 727–738 (2018)
  • (49) Wu, L., Wang, Y., Shao, L.: Cycle-consistent deep generative hashing for cross-modal retrieval. CoRR abs/1804.11013 (2018)
  • (50) Wu, L., Wang, Y., Shepherd, J.: Efficient image and tag co-ranking: a bregman divergence optimization method. In: ACM Multimedia Conference, MM ’13, Barcelona, Spain, October 21-25, 2013, pp. 593–596 (2013)
  • (51) Yao, J., Yang, B., Zhu, Q.: Near-duplicate image retrieval based on contextual descriptor. IEEE Signal Process. Lett. 22(9), 1404–1408 (2015)
  • (52) Yildiz, B., Demirci, M.F.: Distinctive interest point selection for efficient near-duplicate image retrieval. In: 2016 IEEE Southwest Symposium on Image Analysis and Interpretation, SSIAI 2016, Santa Fe, NM, USA, March 6-8, 2016, pp. 49–52 (2016)
  • (53) Zhang, Z., Zou, Q., Wang, Q., Lin, Y., Li, Q.: Instance similarity deep hashing for multi-label image retrieval. CoRR abs/1803.02987 (2018)
  • (54) Zhao, W., Luo, H., Peng, J., Fan, J.: Mapreduce-based clustering for near-duplicate image identification. Multimedia Tools Appl. 76(22), 23291–23307 (2017)
  • (55) Zlabinger, M., Hanbury, A.: Finding duplicate images in biology papers. In: Proceedings of the Symposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, pp. 957–959 (2017)