In the past decade, we have witnessed the rapid development of Internet techniques such as online social networking services, search engine and multimedia sharing services, which generate massive amount of multimedia data DBLP:conf/cikm/WangLZ13 ; DBLP:journals/tip/WangLWZZH15 , e.g., text, image, audio and video. For example, for online social networking services, the most famous social networking site, Facebook (https://facebook.com/), reports 350 million images uploaded everyday in the end of November 2013. On the popular social networking services, Twitter (http://www.twitter.com/), more than 400 million tweets with texts and images have been generated by 140 million users. In September 2017, the largest online social networking platform in China, Weibo (https://weibo.com/) have 376 million active users and more than 100 million micro-blogs are posted by them. For multimedia data DBLP:journals/corr/abs-1804-11013 sharing services, More than 3.5 million new images uploaded to Flickr(https://www.flickr.com/) which is the most famous photos sharing web site everyday in March 2013. every minute there are 100 hours of videos which are uploaded to YouTube (https://www.youtube.com/), and more than 2 billion videos totally stored in this platform by the end of 2013. In China, the total watch time monthly of the largest online video service, IQIYI (http://www.iqiyi.com/), exceeded 42 billion minutes and the number of users is more than 230 million monthly. The total amount of audio had exceeded 15 million in Himalaya (https://www.ximalaya.com/), the very popular audio sharing platform in China as of December 2015. These web services not only not only provide great convenience for our daily life, but creates possibilities for the generation, storage and sharing of large-scale multimedia data DBLP:conf/mm/WangLWZZ14 ; DBLP:journals/cviu/WuWGHL18 . Moreover, this trend has put forward higher request on massive multimedia data retrieval DBLP:conf/mm/WangLWZ15 ; DBLP:journals/tip/WangLWZ17 ; DBLP:conf/sigir/WangLWZZ15 ; DBLP:conf/ijcai/WangZWLFP16 .
As shown in Figure. 1, mobile smart devices equipped with mobile communications module (e.g., WiFi and 4G module) and position sensor module (e.g., GPS-Module) such as smartphones and tablets collective huge amounts of multimedia data DBLP:journals/corr/abs-1708-02288 ; DBLP:conf/pakdd/WangLZW14 ; DBLP:journals/ivc/WuW17 with geo-tags. For example, users can take photos or videos LINYANGARXIV with the geo-location information of the shoot place. Many mobile applications such as WeiChat, Twitter and Instagram support uploading and sharing of images and text with geo-tags. Other location-based services such as Google Places, Yahoo!Local, and Dianping provide the query services for geo-multimedia data by taking into account both geographical proximity and multimedia data similarity. Users can find the place they want to go by these query services, such as where is the nearest restaurant serving steak and seafood, which shop has this type of hat and scarf within a kilometre range. This type of query can be named geo-multimedia query.
Motivation. Spatial keyword search problem has become a hot issue in the community of spatial database due to the mobile applications and location-based services. this type of problem is to find out spatial objects considering two aspects of relevance, i.e., geo-location proximity and textual similarity. Many spatial indexing techniques have been developed like R-Tree DBLP:conf/sigmod/Guttman84 , R-Tree DBLP:conf/sigmod/BeckmannKSS90 , IL-quadtree DBLP:journals/tkde/ZhangZZL16 , KR-Tree DBLP:conf/ssdbm/HariharanHLM07 , IR-Tree DBLP:conf/icde/FelipeHR08 etc. Deng et al. DBLP:journals/tkde/DengLLZ15 studied a generic version of closest keywords search called best keyword Cover. Cao et al. DBLP:conf/sigmod/CaoCJO11 proposed the problem of collective spatial keyword querying, Fan et al. Fan2012Seal studied the problem of spatio-textual similarity search for regions of interests query. However, these researches just only consider the textual data such as keywords and spatial information, they do not take into account other multimedia data mentioned above, like images. Therefore, these solutions cannot be applied in the problem of geo-multimedia data query. In this paper, we aim to design a efficient solution to overcome the challenge of geo-multimedia query. We in the first time propose a novel geo-multimedia query called region of visual interest query. Figure. 2 is an simple but intuitive example to describe this query problem.
As illustrated in Figure. 2, an user who works for a chain restaurant which mainly serves pizza and coffee. She want to investigate which block the people who like pizza and coffee reside in. This can guide advertising in a specific area or district in the way of Internet advertising push service. As more and more people like sharing his or her favorite food by photos with the geo-location information of the shoot place in the Internet, the user can submit a region of visual interests which contains serval images about pizza and coffee and a geographical information of a region. Then the system will return a set of users which are high relevant to the query in both aspects of geographical proximity and visual content similarity.
To our best knowledge, we are the first to propose the problem of region of visual interests (RoVI) query. To solve this problem effectively and efficiently, we present the definition of region of visual interests query and the relevant notions. We introduce how to exploit existing techniques to address the RoVI query problem by present three baselines namely double index, visual first index, and spatial first index. After that, a novel spatial indexing structure is proposed named quadtree based inverted visual index which is a combination of quadtree, inverted index and visual words. Based on this indexing technique, we design a novel efficient search algorithm named region of visual interests search to improve the performance of search.
Contributions. Our main contributions can be summarized as follows:
To the best of our knowledge, we are the first to propose region of visual interests query. Firstly we introduce the definition of geo-image and region of visual interests query and relevant notions. The visual similarity function and geographical similarity function are proposed.
We introduce three hybrid indexes namely double index, visual first index, and spatial first index to explain how to exploit existing techniques to address the problem mentioned above.
To solve the region of visual interests query problem efficiently, we present a novel spatial indexing structure named quadtree based inverted visual index, which combines quadtree, inverted index and visual words. Based on it, we develop an efficient search algorithm called region of visual interests search (RoVI Search) for this problem.
We have conducted extensive experiments on real geo-image dataset. Experimental results demonstrate that our solution outperforms the state-of-the-art method.
Roadmap. In the remainder of this paper, In Section 2 we present the related works about spatial keyword query and image retrieval. In Section 3 we propose the definition of region of visual interests query and related conceptions. We introduce three baselines namely double index, visual first index, and spatial first index in Section 4. In Section 5 we propose a novel spatial indexing technique named quadtree based inverted visual index and an efficient search algorithm named region of visual interests search. In Section 6 we present the experiment results. Finally, we conclude the paper in Section 7.
2 Related Work
In this section, we introduce the related works of spatial keywords query and content-based image retrieval, which are related to our work.
Spatial Keywords Query. Recently, spatial keywords query has become a hot spot attracting many researchers in the filed of spatial database. A spatial keyword query aims to return a set of spatial textual objects which are relevant to the query in both aspects of spatial proximity and textual similarity DBLP:conf/er/CaoCCJQSWY12 . Lots of efficient spatial indexing techniques have been presented such as R-Tree DBLP:conf/sigmod/Guttman84 , R-Tree DBLP:conf/sigmod/BeckmannKSS90 , IR-Tree DBLP:journals/pvldb/CongJW09 , IR-Tree. Jo˜ao B. Rocha-Junior DBLP:conf/ssd/RochaGJN11 et al. presented a novel spatial indexing structure called Spatial Inverted Index (S2I) to solve top- spatial keyword query problem. This index maps each distinct term to a set of objects containing the term. They designed two efficient algorithms named SKA and MKA based on S2I to enhance the performance of top- spatial keyword search. Wang et al. DBLP:conf/icde/WangZZLW15 studied the problem of processing a large amount of continuous spatial keyword queries over streaming data. They proposed a novel adaptive index called AP-Tree which can adaptively groups registered queries using keyword and spatial partitions. Li et al. DBLP:journals/tkde/LiLZLLW11 developed a new spatial indexing structure named IR-Tree, which is together with a top- document search algorithm facilitates four tasks for document searches problem. Zhang et al. DBLP:conf/edbt/ZhangTT13 proposed a scalable integrated inverted index called which used quadtree to hierarchically partition the data space into cells. Zheng et al. DBLP:conf/icde/ZhengSZSXLZ15 investigated interactive top- spatial keyword (ITSK) query and they designed a three-phase solution focusing on both effectiveness and efficiency. To solve the problem of top- spatial keyword search, a novel index structure called IL-Quadtree was proposed by Zhang et al. DBLP:journals/tkde/ZhangZZL16 . This technique is developed to use both spatial and keyword to effectively reduce the search space.
Other spatial keyword search problems have been proposed and well studied. Deng et al. DBLP:journals/tkde/DengLLZ15 proposed a novel spatial keyword search problem named Best Keyword Cover, and they presented a novel scalable algorithm named keyword nearest neighbor expansion. Guo et al. DBLP:conf/sigmod/GuoCC15 proved that the problem of CK queries is NP-hard. In addition, they propose a 2-approximation greedy approach as a baseline and designed two approximation algorithms called SKECa and SKECa+ to solve this problem efficiently. João B. Rocha-Junior et al. DBLP:conf/edbt/Rocha-JuniorN12 solved the top- spatial keyword queries problem on road networks. Dislike the spatial search problem mentioned above, the geographical distance between the query location and an object is the shortest path in road network. They developed new spatial indexing structures and algorithms that are able to solve this peoblem efficiently. Lee et al. DBLP:journals/tkde/LeeLZT12 developed a novel system framework called ROAD to solve the problem of spatial search on road network. Zhang et al. DBLP:conf/edbt/ZhangZZLCW14 studied the problem of diversified spatial keyword search on road networks. They designed an efficient signature-based inverted indexing to improve the search performance. Guo et al. DBLP:journals/geoinformatica/GuoSAT15 studied continuous top- spatial keyword queries on road networks for the first time. They proposed two approaches which can monitor moving queries in an incremental manner and improve the search performance.
These aforementioned researches aim to find out spatial textual objects which are similar to the query. But they do not take into account the situation that query and objects containing geo-tagged images. Consequently, these approaches are not adequately suitable to overcome the challenge of region of visual interests query which contains geo-images.
Content-Based Image Retrieval. Content-based image retrieval (CBIR for short) DBLP:journals/pami/JingB08 ; DBLP:journals/tomccap/LewSDJ06 ; DBLP:conf/mm/WuWS13 aims to search for images through analyzing their visual contents, and thus image representation DBLP:conf/mm/WanWHWZZL14 ; DBLP:journals/pr/WuWGL18 . In these years, CBIR has attracted more and more attentions in the multimedia TC2018 ; DBLP:journals/pr/WuWLG18
and computer vision communityDBLP:journals/tnn/WangZWLZ17 ; NNLS2018 . Many techniques have been proposed to support efficient multimedia query and image recognition. Scale Invariant Feature Transform (SIFT for short) DBLP:conf/iccv/Lowe99 ; DBLP:journals/ijcv/Lowe04
is a classical method to extract visual features, which transforms an image into a large collection of local feature vectors. SIFT includes four main step: (1)scale-space extrema detection; (2)keypoint localization; (3)orientation assignment; (4)Kkeypoint descriptor. It is widely applied in lots of researches and applications. For example, Ke et al.DBLP:conf/cvpr/KeS04
proposed a novel image descriptor named PCA-SIFT which combines SIFT techniques and principal components analysis (PCA for short) method. Mortensen et al.DBLP:conf/cvpr/MortensenDS05 proposed a feature descriptor that augments SIFT with a global context vector. This approach adds curvilinear shape information from a much larger neighborhood to reduce mismatches. Liu et al. DBLP:journals/inffus/LiuLW15
proposes a novel image fusion method for multi-focus images with dense SIFT. This dense SIFT descriptor can not only be employed as the activity level measurement, but also be used to match the mis-registered pixels between multiple source images to improve the quality of the fused image. Su et al.Su2017MBR designed a horizontal or vertical mirror reflection invariant binary descriptor named MBR-SIFT to solve the problem of image matching. Nam et al. DBLP:journals/mta/NamKMHCL18 introduced a SIFT features based blind watermarking algorithm to address the issue of copyright protection for DIBR 3D images. Charfi et al. DBLP:journals/mta/CharfiTAS17 developed a bimodal hand identification system based on SIFT descriptors which are extracted from hand shape and palmprint modalities.
Bag-of-visual-words DBLP:conf/iccv/SivicZ03 ; DBLP:journals/tnn/WangZWLZ17 ; DBLP:journals/corr/abs-1804-11013 (BoVW for short) model is another popular technique for CBIR and image recognition, which was first used in textual classification. This model is a technique to transform images into sparse hierarchical vectors by using visual words, so that a large number of images can be manipulated. Santos et al. DBLP:journals/mta/SantosMST17 presented the first ever method based on the signature-based bag of visual words (S-BoVW for short) paradigm that considers information of texture to generate textual signatures of image blocks for representing images. Karakasis et al. DBLP:journals/prl/KarakasisAGC15
presents an image retrieval framework that uses affine image moment invariants as descriptors of local image areas by BoVW representation. Wang et al.DBLP:conf/mmm/WangWLZ13 presented an improved practical spatial weighting for BoV (PSW-BoV) to alleviate this effect while keep the efficiency.
These researches aforementioned greatly improved the performance of image retrieval and recognition, but they do not consider the images with geo-tags which are generated by smart devices and location-based services. Consequently, these techniques can not be applied to address the region of visual interest query problem.
In this section, we propose the definition of region of visual interests (RoVI for short) at the first time, then present the notion of region of visual interests query (RoVIQ for short) and the similarity measurement. Besides, we review the techniques of image retrieval which is the base of our work. Table 1 summarizes the notations frequently used throughout this paper to facilitate the discussion.
|A given database of geo-tagged images|
|The -th geo-tagged images|
|The geo-location component of|
|The visual component of|
|A region of visual interest user dataset|
|A retion of visual interest user in|
|The geographical information component of|
|The visual information component of|
|A weight set of visual words|
|The weight of visual word|
|A region of visual interests query|
|The result set of a query|
|The geographical similarity between and|
|The visual similarity between and|
|The geographical similarity threshold|
|The visual similarity threshold|
|The region intersection of and|
|The region union of and|
|The function to compute the area of|
|A node of quadtree|
3.1 Problem Definition
Definition 1 (Geo-Image)
Let be a database of geo-images in which is the number of geo-images in it, each is an image with a geo-tag which is the geo-location information about the shoot place. We denote a geo-image as , wherein is the geo-location component and is the visual component.
Definition 2 (Region of Visual Interests (RoVI))
Let be a region of visual interests user dataset in which be the number of users in it. Each user is denoted as , where represents the geographical information component and is the visual information component. The former is a geographical region, here we use minimum bounding rectangle (MBR for short) to represent it. Let and be the top-left point and bottom-right point of a MBR, the can be denoted as . The visual information component is a vector of visual words , which is generated from the geo-images located in this region. Let be the weight set of visual words. In the paper hereafter, whenever there is no ambiguity, ”region of visual interest user” is abbreviated to ”user”.
Definition 3 (Region of Visual Interests Query (RoVIQ))
Given a region of visual interest users dataset , a region of visual interest query aims to return a set of users from , which are similar to , where denotes geographical region information and represents visual information. In formal, and ,
where and are the geographical similarity function and visual similarity function respectively. To facilitate the computation, two thresholds and are defined, which represent geographical and visual similarity threshold respectively, and , . Therefore, for each user in results set ,
consequently, the a RoVIQ is to return a set
In order to measure the geographical similarity, we propose two important conceptions i.e., region intersection and region union in the following part.
Definition 4 (Region Intersection)
Given a region of visual interest users dataset , , the region intersection of these two users is defined as , denoted as follows:
where the function aims to compute the area of . Similarly, for a RoVIQ and an user , we can compute the region intersection of them as follows:
Definition 5 (Region Union)
Given a region of visual interest users dataset , , the region union of these two users is defined as , denoted as follows:
Like the Definition 4, we can compute the region union of a query and an user as follows:
Definition 6 (Geographical Similarity)
Given a region of visual interest users dataset and a RoVIQ , , the geographical similarity between and is defined by
Definition 7 (Visual Similarity)
Given a region of visual interest users dataset and a RoVIQ , , the visual similarity between and is measured by Jaccard similarity measurement, shown as below:
where denotes the weight of visual word .
Figure 2 shows an example of region of visual interest query. There are 7 users which contain geographical information denoted as rectangle and visual words set . The light blue rectangles are the region of these users and the pink rectangle is the region of query.
4 Baseline Method
Before proceeding to present the proposed solution, we introduce how to exploit existing techniques to address the problem defined above. Three hybrid indexes, namely double index, visual first index, and spatial first index, are introduced in the following.
4.1 Double Index
Double index consists of two components: R*-tree and visual inverted files separately. For each user , its geographical part is indexed by R*-tree and its visual information is indexed by inverted files. More specifically, the difference from conventional R*-tree is that the leaf node is composed of a series of region rather than a series of points. For example, in Figure. 4 R*-tree contains two inner nodes and . contains the elements and , while contains the elements and . And each leaf nodes include a set of Users. Visual inverted files are the same to conventional inverted files. It consists of a vocabulary for all distinct visual words in a collection of images and a set of posting lists related to this vocabulary. Each posting list is a sequence of visual pairs , where refers to the user that contains visual words , and is the weight of term in image .
When processing region of visual interests query, R*-tree is used to prune the irrelevant nodes as early as possible to shrink the search space, and visual inverted files are used only the users containing at least one query visual word will be included in the search. Because the candidates retrieved from R*-tree only satisfy the spatial constraint, and the candidates obtained from visual inverted file only satisfy the visual word requirment. To answer region of visual interests query, the final result are the merge of the candidates from two indexes.
We use an example to illustrate how double index works. Consider the users and region of visual interests query with thresholds = 0.3 and = 0.4 in Figure 4. Given a query , the index first loads the root nodes of R*-tree, and gradually finds the intersection leaf node . Then, we compute the geographical similarity between and . As , we discards , and only add to candidate list. Then, we probes the visual inverted lists of and and finds the candidate users which satisfying , i.e., and . Finally, we merge the candidate lists from R*-tree and visual inverted index, the list is reported as the final result.
4.2 Visual First Index
To efficiently facilitate the region of visual interests search, visual first index is fairly natural to employ the spatial index techniques to organize the users for each visual word, instead of keeping them in a list, as shown in Figure 5. For a given region of visual interests query, only the corresponding R*-trees related to query visual words need to access. Then, we can apply the region intersection operation on these R*-trees until arriving leaf level. For all users in leaf node, we first add the users with to candidate list. Then, we further verify whether it satisfied spatial constraint, if , we consider it as a result, otherwise we discard it.
We use an example to illustrate how visual first index works. Consider the users and region of visual interests query with thresholds = 0.3 and = 0.4 in Figure 5. Given a query , the index first loads the root nodes of , and , find the users whose region intersect with query region in these three R*-tree and satisfying , i.e., and . Then, for these users, we further verify whether it satisfied spatial constraint. Obviously, only satisfies the spatial requirement and can be considered as final result.
4.3 Spatial First Index
As shown in Figure 6, a R*-tree is first built on all MBRs included in spatial scopes of all user’ images. Next, all the users in each R*tree leaf node are indexed by visual inverted files based on their visual words. Hence, spatial first index consists of a primary R*-tree and a series of secondary visual inverted files corresponding to R*-tree leaf nodes.
When processing a region of visual interests query, the geographical region of query is first used to find out the leaf nodes that may contain the candidates. If the leaf node intersect with the query region and , then the visual words of query are employed to load the corresponding visual inverted files based on this node, otherwise the leaf node are thrown away. For all users in leaf node, we judge each user whether , if it satisfied the spatial constraint, we add the users to candidate list. Then, we further verify whether it satisfied visual requirement, if , we add it to result list, otherwise we discard it.
We use an example to illustrate how spatial first index works. Consider the users and region of visual interests query with thresholds = 0.3 and = 0.4 in Figure 6. Given a query , the R*-tree index first finds all leaf nodes intersect with the query region. i.e., and . As , only leaf node needs to further treatment. Then, we load the visual inverted index of for , , and . Both and satisfied spatial constraint, but only satisfied visual requirement. Thus, the final result is .
5 Quadtree Based Inverted Visual Index
In this section, we present a novel indexing technique named quadtree based inverted visual index based on shared quadtree and inverted index for the region of visual interest search problem. In subsection 5.1, we present our spatial indexing structure, and in subsection 5.2, we propose the efficient algorithm named RoVI Search to address the problem of RoVI query based on this novel index.
5.1 Spatial Indexing Structure
5.1.1 Virtual Quadtree
We utilize shared quadtree to construct our index. Quadtree DBLP:journals/cacm/Gargantini82 ; DBLP:journals/pami/HunterS79 is a classical space-partitioning indexing technique which is to subdivide a -dimensional space into 2 regions in a recursive manner. Figure 7 illustrate a virtual quadtree. It decomposes the space into levels and for -th level, the space is split into 4 equal area region. Each node in quadtree represents a geographical region. The root node of the virtual quadtree represents the entire geographical region, which is corresponding to level 0. For level 1, there are four equal area nodes which are partitioned from the root node, and the rest can be done in the same manner. There are three colors of nodes shown in Figure 7. Specifically, the root node and intermediate node are colored by light gray, which are already split into four subnodes. The dark gray nodes denote leaf nodes which can be located in any level of the tree according to the split condition. The node with white color in level 2 is not maintained in fact. Each node has a list of users whose the geo-location are in this node, but these users containing both geographical information and visual information are stored in the leaf node, which is maintained by an user id list. To sum up, the whole geographical region is spatially indexed into several nodes, and the users distribute in them.
Inserting a user into a virtual quadtree can be operated in a traditional manner which travels the whole quadtree from root node to find the leaf node whose geographical region denoted as overlaps the geographical area of the user, i.e., . If the geographical region of a leaf node overlaps the region of user , i.e., , then can be inserted into .
5.1.2 Z-order curve
Each node of the virtual quadtree can be encoded by the space filling curve techniques, such as Morton code Morton2015A , Hilbert code and gray code Faloutsos1986Multiattribute . For this study, we encode the quadtree nodes according to Morton code, i.e., Z-order curve, since it is encoded based on its partition sequence. The Z-order curve describe the path of the node in the virtual quadtree.
Figure. 8(a) demonstrate the how to obtain the Morton code of a node according to the partition sequences in 2-dimensional space. We assume that a node is split into four subnodes in the order as of Z-order curve. We denote these four subnodes by Z-order curve as 1, 2, 3, 4 respectively. For the situation shown in Figure. 8(a), the sixteen node are coded from 0 to 15 in decimal.
Figure. 8(a) shows that the our encoding method of quadtree node by Z-order. For level 1 of quadtree, there are four subnode, we encode them in the form of binary, i.e., 00, 01, 10, 11 as the order of z-order curve. For sixteen subnotes in level 2, the code of them are from 0000 to 1111. In general, for the -th level, the code of each node has 2 bits. Thus, in the example shown by Figure. 8(b), the code of node in level 2 has 4 bits. On the other hand, this example illustrates an example of distribution of 7 users in the whole geographical region colored by light blue and a query colored by pink. According to the code of each node, we can generate a region list for each user. For instance, for user , the region list is denoted as . The region list of is .
5.1.3 Quadtree Based Inverted Visual Index
We propose a novel inverted index structure based on quadtree mentioned above and visual words. This inverted index structure has three layers: the first layer is a visual words list which contains the visual words generated from all of the geo-images in users. Each of it contains a node list pointer which is point to a node list in the second layer. The node list is constructed by the visual quadtree introduced above based on z-order code, in which each node has an user list pointer. Third layer consists of several user lists. The user list pointer of a node is to point to the user list in which all the users are included in this node. All the user lists are stored in disk. Figure 9 illustrates a novel inverted index of example 2. There are 5 visual word generated from 7 users, which form the first layer. For visual word , there are three users containing it, i.e., , and , and is in node , and are in node .
Visual Filter. We design a visual filter for candidate set generation during the RoVI Search. According to the visual similarity measurement defined in Section 3, we can define a candidate visual threshold . It is clearly that visual similarity between a query and a user only if . Therefore, we can use the threshold as a visual filter.
5.2 RoVI Search Algorithm Based on Quadtree Based Inverted Visual Index
According to quadtree based inverted visual index, we design an efficient search algorithm for RoVI Query. The pseudo-code is show as follows.
Algorithm 1 illustrates the RoVI Search processing. Firstly the virtual quadtree is loaded and the algorithm find out the intersect nodes by the query and save them into a set . Then for each visual word contained in query, search out the node which contains any of these visual words and stores them into . For each node in the set of intersect nodes, the algorithm find out all the nodes which is overlap with it in , and then the can be called to filter these nodes which are stored in node list . The next step is to generate the candidate set by calling the procedure . For each candidate, the algorithm computes the visual similarity and geographical similarity by execute and . If these two similarities are all greater than or equal to the threshold, then will be added into the results set.
6 Performance Evaluation
In this section, we present results of a comprehensive performance study on real image datasets to evaluate the efficiency and scalability of the proposed techniques. Specifically, we evaluate the effectiveness of the following indexing techniques for region of visual interests search on road network.
Datasets. Performance of various algorithms is evaluated on both real spatial and image datasets. The following two datasets are deployed in the experiments. Real dataset Flickr is obtained by crawling millions image the photo-sharing site Flickr(http://www.flickr.com/). To evaluate the scalability of our proposed algorithm, The dataset size varies from 20K to 1M. The spatial locations of Flickr is obtained from the US Board on Geographic Names(http://geonames.usgs.gov). We selected 1 million POIs from this dataset as centers and extended the POIs with random widths and heights to generate regions. Similarly, Real dataset ImageNet
is obtained from is the largest image dataset ImageNet, which is widely used in image processing and computer vision. it includes 14,197,122 images and 1.2 million images with SIFT features. We generateImageNet dataset with varying size from 20K to 1M. The region of the users are randomly generated from spatial datasets Rtree-Portal ( http://www.rtreeportal.org) with the same method as dataset Flickr.
Workload. A workload for the region of visual interests query consists of queries. The query response time is employed to evaluate the performance of the algorithms. The image dataset size grows from 0.2M to 1M; the number of the query visual words changes from 50 to 150; the spatial similarity and visual similarity varies from 0.1 to 0.5. To generate a query region, we randomly select from the locations of the above spatial dataset, then the query region is set to a rectangle centered at these selected users’ locations. The query region varies from 1% of the data space to 5% of the data space. By default, The image dataset size, the number of the query visual words, the spatial similarity, the visual similarity, and the query region set to 1M, 100, 0.3, 0.3 and 2% respectively. Experiments are run on a PC with Intel Xeon 2.60GHz dual CPU and 16G memory running Ubuntu. All algorithms in the experiments are implemented in Java. Note that the virtual quadtree of QIV is maintained in memory, because the index only occupied very small memory space.
Evaluation on the size of dataset. We evaluate the effect of the size of dataset on Flickr and ImageNet shown in Figure 10. It is clear that the response time of QIV, VFI, SFI, DI rise with the increase of size of dataset on both two datasets. Figure 10(a) illustrates that the performance of our method, QIV, is much higher than others. The time cost of DI is the largest of them, it gradually increase from about 1000 ms to 4500 ms around. The response time of VFI and SFI are less than DI, both of them shows a fluctuating upward from 0.2M to 1M. Figure 10(b) shows the experiment on ImageNet. Like the evaluation on Flickr, the time cost of QIV is the lowest of these four methods. It shows a moderate growth with the increment of size of dataset. Similarly, the response time of DI is the highest, it climbs slowly and finally hit 1000ms at 1M. Apparently, the gentle upward trend of VIF and SIF are very similar, they are obvious inferior to QIV.
Evaluation on the number of query visual words. We evaluate the effect of the number of query visual words on Flickr and ImageNet dataset shown in Figure 10. The experiment on Flickr is shown in Figure 10(a). We can see that the response time of them ascend with the rising of the number of query visual words. Specifically, the performance of our method is the best. It increase very slowly from 50 to 150. On the other hand, the time cost of DI is the highest. the growth trend of VFI and SFI are similar, both of them are increase smoothly. In Figure 10(b), all of the trends are similar to the situation in Figure 10(a). In other words, all of them climb gradually and the efficiency of QIV outperforms VFI, SFI and DI.
Evaluation on the spatial similarity threshold. We evaluate the effect of the spatial similarity threshold on Flickr and ImageNet dataset shown in Figure 12. In Figure 12(a), with the increment of spatial similarity threshold, the response time of QIV moderately fluctuates around 80ms, which is the lowest in these methods. The trends of VFI and DI are almost unchanged, but the time cost of the latter is much higher than the former. On the other hand, the performance of SIF shows a moderate decrement with the rising of spatial similarity threshold, which is better than DI. Figure 12(b) shows that the efficiency of DI is almost the same with the increment of spatial similarity threshold, which is much lower than SFI, VFI and QIV. The response time of SFI decrease slowly, which is higher than VFI. Like the experiment on Flickr, our method has the best performance, which fluctuates slightly with the increasing of the threshold.
Evaluation on the visual similarity threshold. We evaluate the effect of the visual similarity threshold on Flickr and ImageNet dataset shown in Figure 12. We can find from the Figure 12(a) that the performance of DI and VFI are almost unchanged with the increasing of visual similarity threshold. The former is the highest of them. By comparison, the time cost of VFI is much less than DI. On the other hand, our method QIV and SFI shows a decrement. In the interval of , both of them go down very slowly and after that, they decrease obviously. Clearly, the performance of QIV is the best of them. Figure 12(b) illustrates that all of these treads are like the situation in Figure 12(a). QIV has the best performance, which decreases step by step with the rising of the threshold. The response time of SFI gradually goes down and at 0.5 it is a litter lower than VIF. Clearly, DI shows the worst efficiency on this dataset.
Evaluation on the visual similarity threshold. We evaluate the effect of different size of query region on Flickr and ImageNet dataset shown in Figure 14. Figure!14(a) tells us that the response time of SFI, VFI and QIV increase step by step with the increasing of the query region size. In the interval , the time cost of all these three methods rise evidently and after that the upward trends have become very gentle. The performance of DI is almost unaffected by the change of query region size, which is much lower than the performance of our method. In Figure 14(b), we find that the growth trends of SFI and VFI is obvious, and at 5% they are almost the same. By comparison, the upward trend of QIV is moderate, and the performance of DI is almost the same with the increment of region size. On both these two dataset, our method shows the best performance.
In this paper, we study a novel query problem named region of visual interests (RoVI) query. Given a set of region of visual interests users which contains geographical information and visual information, a RoVI query aims to find out the users which are similar to the query in both aspects of geographical similarity and visual similarity. Firstly we define RoVI query in formal and then propose the geographical and visual similarity function. In order to improve the efficiency of searching, we design a novel spatial indexing technique called quadtree based inverted visual index and a efficient algorithm called region of visual interests search. Besides, we introduce three baselines to explain how to exploit existing techniques to address this problem. The experimental evaluation on real geo-multimedia dataset shows that our solution outperforms the state-of-the-art method.
Acknowledgments: This work was supported in part by the National Natural Science Foundation of China (61702560), project (2018JJ3691, 2016JC2011) of Science and Technology Plan of Hunan Province, and the Research and Innovation Project of Central South University Graduate Students(2018zzts177,2018zzts588).
- (1) Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990., pp. 322–331 (1990)
- (2) Cao, X., Chen, L., Cong, G., Jensen, C.S., Qu, Q., Skovsgaard, A., Wu, D., Yiu, M.L.: Spatial keyword querying. In: Conceptual Modeling - 31st International Conference ER 2012, Florence, Italy, October 15-18, 2012. Proceedings, pp. 16–29 (2012)
- (3) Cao, X., Cong, G., Jensen, C.S., Ooi, B.C.: Collective spatial keyword querying. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pp. 373–384 (2011)
- (4) Charfi, N., Trichili, H., Alimi, A.M., Solaiman, B.: Bimodal biometric system for hand shape and palmprint recognition based on SIFT sparse representation. Multimedia Tools Appl. 76(20), 20457–20482 (2017)
- (5) Cong, G., Jensen, C.S., Wu, D.: Efficient retrieval of the top-k most relevant spatial web objects. PVLDB 2(1), 337–348 (2009)
- (6) Deng, K., Li, X., Lu, J., Zhou, X.: Best keyword cover search. IEEE Trans. Knowl. Data Eng. 27(1), 61–73 (2015)
- (7) Faloutsos, C.: Multiattribute hashing using gray codes. In: ACM SIGMOD International Conference on Management of Data, Washington, D.c., May, pp. 227–238 (1986)
- (8) Fan, J., Li, G., Zhou, L., Chen, S., Hu, J.: Seal: spatio-textual similarity search. Proceedings of the Vldb Endowment 5(9), 824–835 (2012)
- (9) Felipe, I.D., Hristidis, V., Rishe, N.: Keyword search on spatial databases. In: Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, Mexico, pp. 656–665 (2008)
- (10) Gargantini, I.: An effective way to represent quadtrees. Commun. ACM 25(12), 905–910 (1982)
- (11) Guo, L., Shao, J., Aung, H.H., Tan, K.: Efficient continuous top-k spatial keyword queries on road networks. GeoInformatica 19(1), 29–60 (2015)
- (12) Guo, T., Cao, X., Cong, G.: Efficient algorithms for answering the m-closest keywords query. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 405–418 (2015)
- (13) Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18-21, 1984, pp. 47–57 (1984)
- (14) Hariharan, R., Hore, B., Li, C., Mehrotra, S.: Processing spatial-keyword (SK) queries in geographic information retrieval (GIR) systems. In: 19th International Conference on Scientific and Statistical Database Management, SSDBM 2007, 9-11 July 2007, Banff, Canada, Proceedings, p. 16 (2007)
- (15) Hunter, G.M., Steiglitz, K.: Operations on images using quad trees. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 145–153 (1979)
- (16) Jing, Y., Baluja, S.: Visualrank: Applying pagerank to large-scale image search. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1877–1890 (2008)
- (17) Karakasis, E.G., Amanatiadis, A., Gasteratos, A., Chatzichristofis, S.A.: Image moment invariants as local features for content based image retrieval using the bag-of-visual-words model. Pattern Recognition Letters 55, 22–27 (2015)
- (18) Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors. In: 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC, USA, pp. 506–513 (2004)
- (19) Lee, K.C.K., Lee, W., Zheng, B., Tian, Y.: ROAD: A new spatial object search framework for road networks. IEEE Trans. Knowl. Data Eng. 24(3), 547–560 (2012). DOI 10.1109/TKDE.2010.243. URL https://doi.org/10.1109/TKDE.2010.243
- (20) Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2(1), 1–19 (2006)
- (21) Li, Z., Lee, K.C.K., Zheng, B., Lee, W., Lee, D.L., Wang, X.: Ir-tree: An efficient index for geographic document search. IEEE Trans. Knowl. Data Eng. 23(4), 585–599 (2011)
- (22) Liu, Y., Liu, S., Wang, Z.: Multi-focus image fusion with dense SIFT. Information Fusion 23, 139–155 (2015)
- (23) Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)
- (24) Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
- (25) Mortensen, E.N., Deng, H., Shapiro, L.G.: A SIFT descriptor with global context. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 184–190 (2005)
- (26) Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing. Physics of Plasmas 24(7), 159–173 (2015)
- (27) Nam, S., Kim, W., Mun, S., Hou, J., Choi, S., Lee, H.: A SIFT features based blind watermarking for DIBR 3d images. Multimedia Tools Appl. 77(7), 7811–7850 (2018)
- (28) Rocha-Junior, J.B., Gkorgkas, O., Jonassen, S., Nørvåg, K.: Efficient processing of top-k spatial keyword queries. In: Advances in Spatial and Temporal Databases - 12th International Symposium, SSTD 2011, Minneapolis, MN, USA, August 24-26, 2011, Proceedings, pp. 205–222 (2011)
- (29) Rocha-Junior, J.B., Nørvåg, K.: Top-k spatial keyword queries on road networks. In: 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, March 27-30, 2012, Proceedings, pp. 168–179 (2012)
- (30) dos Santos, J.M., de Moura, E.S., da Silva, A.S., da Silva Torres, R.: Color and texture applied to a signature-based bag of visual words method for image retrieval. Multimedia Tools Appl. 76(15), 16855–16872 (2017)
- (31) Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pp. 1470–1477 (2003)
- (32) Su, M., Ma, Y., Zhang, X., Wang, Y., Zhang, Y.: Mbr-sift: A mirror reflected invariant feature descriptor using a binary representation for image matching:. Plos One 12(5) (2017)
Wan, J., Wang, D., Hoi, S.C., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-based image retrieval: A comprehensive study.In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 157–166 (2014)
- (34) Wang, F., Wang, H., Li, H., Zhang, S.: Large scale image retrieval with practical spatial weighting for bag-of-visual-words. In: Advances in Multimedia Modeling, 19th International Conference, MMM 2013, Huangshan, China, January 7-9, 2013, Proceedings, Part I, pp. 513–523 (2013)
- (35) Wang, X., Zhang, Y., Zhang, W., Lin, X., Wang, W.: Ap-tree: Efficiently support continuous spatial-keyword queries over stream. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015, pp. 1107–1118 (2015)
- (36) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp. 79–88 (2015)
- (37) Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Processing 26(3), 1393–1404 (2017)
- (38) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: Exploiting correlation consensus: Towards subspace clustering for multi-modal data. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 981–984 (2014)
- (39) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: LBMCH: learning bridging mapping for cross-modal hashing. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pp. 999–1002 (2015)
- (40) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans. Image Processing 24(11), 3939–3949 (2015)
- (41) Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pp. 805–810 (2013)
- (42) Wang, Y., Lin, X., Zhang, Q., Wu, L.: Shifting hypergraphs by probabilistic voting. In: Advances in Knowledge Discovery and Data Mining - 18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014. Proceedings, Part II, pp. 234–246 (2014)
Wang, Y., Wu, L.: Beyond low-rank representations: Orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering.Neural Networks 103, 1–8 (2018)
- (44) Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Networks and Learning Systems (2018)
Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views
agreement: An iterative low-rank based structured optimization method to
multi-view spectral clustering.
In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 2153–2159 (2016)
- (46) Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X.: Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans. Neural Netw. Learning Syst. 28(1), 57–70 (2017)
- (47) Wu, L., Wang, Y.: Robust hashing for multi-view data: Jointly learning low-rank kernelized similarity consensus and hash functions. Image Vision Comput. 57, 58–66 (2017)
- (48) Wu, L., Wang, Y., Gao, J., Li, X.: Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recognition 73, 275–288 (2018)
- (49) Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: Deep siamese attention networks for video-based person re-identification. arXiv:1808.01911 (2018)
Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural networks for fast person re-identification.Computer Vision and Image Understanding 167, 63–73 (2018)
- (51) Wu, L., Wang, Y., Li, X., Gao, J.: Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Trans. Cybernetics (2018)
- (52) Wu, L., Wang, Y., Li, X., Gao, J.: What-and-where to match: Deep spatially multiplicative integration networks for person re-identification. Pattern Recognition 76, 727–738 (2018)
- (53) Wu, L., Wang, Y., Shao, L.: Cycle-consistent deep generative hashing for cross-modal retrieval. CoRR abs/1804.11013 (2018)
- (54) Wu, L., Wang, Y., Shepherd, J.: Efficient image and tag co-ranking: a bregman divergence optimization method. In: ACM Multimedia Conference, MM ’13, Barcelona, Spain, October 21-25, 2013, pp. 593–596 (2013)
- (55) Zhang, C., Zhang, Y., Zhang, W., Lin, X.: Inverted linear quadtree: Efficient top K spatial keyword search. IEEE Trans. Knowl. Data Eng. 28(7), 1706–1721 (2016)
- (56) Zhang, C., Zhang, Y., Zhang, W., Lin, X., Cheema, M.A., Wang, X.: Diversified spatial keyword search on road networks. In: Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24-28, 2014., pp. 367–378 (2014)
- (57) Zhang, D., Tan, K., Tung, A.K.H.: Scalable top-k spatial keyword search. In: Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pp. 359–370 (2013)
- (58) Zheng, K., Su, H., Zheng, B., Shang, S., Xu, J., Liu, J., Zhou, X.: Interactive top-k spatial keyword queries. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015, pp. 423–434 (2015)