CNN-VWII: An Efficient Approach for Large-Scale Video Retrieval by Image Queries

10/14/2018 ∙ by Chengyuan Zhang, et al. ∙ Central South University 0

This paper aims to solve the problem of large-scale video retrieval by a query image. Firstly, we define the problem of top-k image to video query. Then, we combine the merits of convolutional neural networks(CNN for short) and Bag of Visual Word(BoVW for short) module to design a model for video frames information extraction and representation. In order to meet the requirements of large-scale video retrieval, we proposed a visual weighted inverted index(VWII for short) and related algorithm to improve the efficiency and accuracy of retrieval process. Comprehensive experiments show that our proposed technique achieves substantial improvements (up to an order of magnitude speed up) over the state-of-the-art techniques with similar accuracy.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With advances in Internet technologies and the proliferation of smart phone, digital cameras, storage devices, there are a rapidly growing amount of video-related data collected in many applications, such as video search, recommendation, sharing, broadcasting and advertising websites. As a result, in recent years various visual search applications have emerged such as image to video retrieval, video to image retrieval and video to video retrieval.

In the paper, we investigate the problem of retrieving top- most relevant videos for a large video database by using an image as the query; that is, given a set of video, a query image, we aim to retrieve the most relevant videos each of which contains frames which are similar to query image. For example, users can find the video of a movie by a snap without the name on the Internet. Another application is that a user can search out a lecture video online by a slide.

Considerable effort has been invested by the research community on image to video retrieval problem. The simplest solution to this problem is to consider each frame in video as an individual image. Thus, the problem of detecting whether an image is contained by an video is converted to the problem that whether an image is similar to another image. Although this solution can achieve high accuracy, a large number of unnecessary image to image comparison are carried out, which is time consuming and prohibitive for large scale video databases.

Most recently, replacing image by video clip has shown great success in image to video retrieval in de Araújo and Girod (2018)

. Thus, the search space reduces from all frame in video database to a small group of video clips, which significantly improve the scalability of the retrieval process. Specifically, the query image’s descriptor is first compared to the bloom filter index that contains video clip level information. Then, the frames belonging to the most promising video clips are examined one by one to calculate the similarities with the query image. Although bloom filter technique can effectively improve the retrieval efficiency, it still face the following challenges. Firstly, comparing video clip level bloom filter one by one is still unacceptable for large video databases. Secondly, the false positive probability of bloom filter will increase, as the number of frames in video clip increases. Thirdly, bloom filter of video clips level cannot well preserve the aggregate information from the frame level.

This work aims to improve the performance of video retrieval by a query image in a large-scale video database. First, we introduce the definition of image to video query, and devise the visual similarity function to measure the visual similarity between query image and frame of video. Base on these notions, we propose a novel solution to improve the performance of video retrieval by a query image, which is a combination of CNNs, bag-of-visual-words and inverted index technique. Specifically, We utilize convolutional neural network to construct the visual feature extractor in our solution. All the frames of videos in database are fed into this extractor and the features are extracted, which are used to generate the BoVW model of visual representation. The frames of videos are grouped by -means algorithm to reduce the size of index significantly. Besides, a novel indexing structure named VWII and an efficient search algorithm are developed, which can be used to boost the search performance.

Contributions. The main contributions of this work can be summarized as follows:

  • We introduce the formal definition of image to video top- query problem, and present the similarity function to measure the visual similarity between query image and frame of video.

  • We propose a novel solution to improve the efficiency and precision of the query, which is based on convolutional neural networks, bag-of-visual-words and visual weighted inverted index.

  • We devise an efficient search method named VWII Search algorithm based visual weighted inverted index technique. This algorithm can boost the performance of video search in a large-scale database.

  • We have conducted extensive performance evaluation on two video dataset. Experimental results demonstrate that our approach outperforms the state-of-the-art method.

Roadmap. In the remainder of this paper, Section 2 introduce the related works about this paper. In Section 3, we propose the definition of image to video top- query and give the visual similarity function. In Section 4, we introduce our model to represent video and query image and measure the similarity between image and video. In Section 5, we devise a novel indexing structure named VWII and present an efficient search algorithm. Section 6 presents the the experiment results on two video datasets. In Section 7, we conclude the paper.

2 Related Work

In this section, we introduce the related works of large-scale video retrieval, which includes video to video problem, image to video problem and video to image problem.

Video to Video Problem. Video to video problem if one of the most common search problem in the area of video retrieval, such as Wu et al. (2018a). It aims at finding the most similar videos in the aspect of visual contents from the database. Lots of visual representation and search techniques have been proposed by researches in recent years. Bag-of-words (BoW for short) is one of traditional visual representation methods for image and video representation, which is proposed by Sivic and Zisserman (2003) who used this model to index videos throughout a movie database.  Ulutas et al. (2018) presented a novel approach to solve the problem of frame duplication detection in a video. They utilized BoW model to generate visual words and build a dictionary from Scale Independent Feature Transform (SIFT for short) keypoints of frames in video. They adopted Hierarchical -means (HKM) to generate a large vocabulary tree for quantization, and represent each video clip and query topic by a BoW model.  Wang et al. (2012)

integrated the spatial-temporal information with BoW model to improve computational efficiency significantly. In specific, they model the pair-wise spatial-temporal correlations by a Gaussian distribution and devised a novel similarity measurement to emphasizes the discriminative visual words about the query. 

Jiang and Ngo (2008) introduced a novel method extending from BoW model to reduce the effect of visual word correlation. In their approach, a visual ontology was generated to model the relationship between visual words, in which visual relatedness was defined strictly and then incorporated into BoW model.

Due to the great advantages of deep learning technology in multimedia retrieval, such as 

Wu et al. (2018e), more and more researchers utilize deep neural networks such as CNN for the task of video retrieval, such as Wu et al. (2018b)Podlesnaya and Podlesnyy (2016) utilized CNN to extract visual features of videos, which serve as universal signature for retrieval tasks. For the problem of near-duplicate video retrieval which is one of the types of video to video problems,  Kordopatis-Zilos et al. (2017) a novel scheme by using CNN features from intermediate layers to create discerning global video representations integrating with a deep metric learning framework. Lou et al. (2017) used a novel deep-learning features in CDVA evaluation framework for video retrieval problem. Specifically, they devised a Nested Invariance Pooling (NIP for short) method to generate compact and robust CNN descriptors. Gu et al. (2016)

combined CNN and Long-short Term Memory (LSTM for short) Networks to implement Supervised Recurrent Hashing (SRH for short) to overcome the challenge of large-scale video retrieval.

Image to Video Problem. For image to video problem, many previous researches were inspired by the solutions for traditional image retrieval tasks. Sivic et al. (2006) presented a novel method which automatically generates object representations for the task of image query that is to return the objects of interest in video shots. For the task of relevant video segments searching,  Zhu and Satoh (2012); Wang et al. (2015b); Wang and Wu (2018) formulated this problem by a large vocabulary quantization based Bag-of-Words framework. de Araújo et al. (2015)

proposed to solve the problem of large-scale video retrieval by image query based on binarized Fisher Vectors. They presented an asymmetric comparison scheme to improve the mean average precision. Besides, Several shot-based aggregation techniques was developed by them to achieve retrieval performance with a 3X speed-up. For the task of face video retrieval by image query, 

Li et al. (2015) presented Hashing across Euclidean space and Riemannian manifold based on a unified framework to integrate the two heterogeneous spaces into a common discriminant Hamming space. Then the intra-space and inter-space Hamming distances are optimized in a maxmargin framework, which is to generate hash functions. Zhu et al. (2017) introduced a joint feature projection matrix and heterogeneous dictionary pair learning (PHDL for short) approach to solve the problem of image to video person re-identification, which is an important technique in video surveillance. This approach learns an intra-video projection matrix and a pair of heterogeneous image and video dictionaries, in which the heterogeneous visual features can be turned into coding coefficients. These coding coefficients can be used to implement visual matching. Tang et al. (2012); Wu et al. (2018f); Wang et al. (2017b, 2018, a, 2015a, 2016)

investigated the problem of adapting detectors from image to video and introduced a novel approach to overcome this challenge. They classify tracks to discover examples in the unlabeled videos to leverage temporal continuity. In addition, they designed a new self-paced domain adaptation algorithm to iteratively adapt from source domain to target domain. 

de Araújo and Girod (2018); Wu et al. (2018d, c) developed an asymmetric comparison method by using Fisher vectors for large-scale video retrieval by image query. A novel video descriptors which can be compared directly with image descriptors was devised by them.

3 Problem Definition

In this section, we introduce the definition of video retrieval by image query and related notions. Table  1 summarizes the mathematical notations used throughout this paper to facilitate the discussion of our work.

Notation Definition
  A given database of videos
  The number of videos in
  A video
  A frame in a video
  The frame set of video
  A query image
  The visual word set of image
  A image to video query
  The result set of
  The similarity between and
  A visual word
  The number of visual words in image
  The number of visual words in frame
  The visual similarity between and
  The weight of visual word
  The frame cluster set of video
  The visual dictionary
Table 1: The summary of notations

Top- Image to Video Query. Let be a video database which contains videos, where is a video containing successive frames. Let be a query image and be a positive integer, a top- image to video query is denoted as which aims to return a set containing videos in which each video contains the visual content of . Formally, we denote as,

where represents a similarity function which is to measure the visual similarity between an image and a video . The more similar and in the aspect of visual content, the higher the value of is.

In this work, we use the traditional visual representation technique named Bag-of-Visual-Word (BoVW for short) to represent the visual content of image and video. Specifically, for the query image , it is encoded by a bag-of-words vector with several visual words, denoted as , represent the number of visual works. In the same way, for a frame in a video, it can be denoted as .

Visual Similarity between image and frame. Given a image , and a video , , the visual similarity between and is measured by the following similarity function:


where is the weight of the visual word . Inspired by the solution in text mining, in this work we measure the weight of visual word by term frequency-inverse document frequency , shown as follows:


where is the word frequency of in the current visual word dictionary, is the total number of frames in the database, represents the total number of frames containing in the database.

4 Our Model

In this section, we propose our model for visual feature extraction and representation based on convolutional neural networks and bag-of-visual-word model. At first we present the overall of this model. Then, we introduce the visual feature extraction by CNNs and present how to represent video based on a visual dictionary construct from video database.

Figure 1: The overview of our model

4.1 Overview of Our Model

Fig. 1 demonstrates the overview of our model for visual feature extraction and video representation by utilizing CNNs and BoVW model. It aims to extract the visual features of videos from a database and then construct a visual dictionary by using clustering method. For each video, we represent it as a set of blocks in which each block consists of several frames which are similar to each other. More concretely, Given a video database, we select each of the video and construct a set of frames by using sampling at fixed-time intervals technique to select frames from the video. After video frames selection, all the frames in are fed into the feature extractor implementing by convolutional neural networks to extract the visual features. For each video, the CNNs extracts the low-level visual features and construct a feature vector and then put all the vectors into the BoVW module to training the visual dictionary by using clustering methods. Based on the visual dictionary , the frames in each video are clustered into several frame clusters according to the visual similarity and then the video is represented by a series of frame clusters, denoted as , and , .

For the task of query image representation, we treat it as a frame of the video. That is, we directly feed it into the feature extractor and the use the similar technique mentioned above to generate the visual word vector .

4.2 The Feature Extractor

We utilize convolutional neural networks to implement the visual feature extractor. CNNs is one of the most significant neural networks, it has become a hotspot in the field of image classification, computer vision, speech analysis, etc..

In this work, we use the AlexNet architecture to construct our model, which is an important deep neural network model for the task of image recognition introduced by Krizhevsky et al. (2012)

. AlexNet has 650,000 neurons and 60 million parameters, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers and a 1000-way softmax at last. The activation functions of neurons in this network are modeled by Rectified Linear Units (ReLUs). A 4096-dimensions feature vector can be generated from the first seven layers (the five convolutional layers and the first two fully-connected layers) of the network. For the input video

, this CNNs based feature extractor can be seen as a non-linear function , which is to return a visual feature vector from a input frame of a video , i.e.,


where represents the -th-dimension visual feature of . Apparently, the output of feature extractor for a video is a matrix .

Each frame fed into the network is resized into . The number of neural units of the second-to-last fully-connected layer is 4096. The 4096-dimension feature vector of it is used as the input of the BoVW module.

4.3 The BoVW Module

Similar to the processing in the text similarity computation, the BoVW module aims at constructing a visual dictionary for the target video database based on the output of the feature extractor. We use K-means clustering method to implement the converting from visual features into visual words. As one of the most widely used unsupervised learning algorithm, K-means accepts an unlabeled data set and clusters the data into different groups. Given the visual feature vectors set

, K-means algorithm partition the sample set into clusters , the objective function can be defined as follows:


where is the mean vector of the cluster , i.e., .

4.4 Frame Clusters Construction

In order to speed up the retrieval performance, we improve the video representation based on frames and propose the novel representation named frame Clusters. Specifically, for a video , we cluster these frames according to the visual similarity by using K-means algorithm. After the clustering, all the frames of are grouped into frame clusters, denoted as , and . For each frame cluster , we can construct the visual word set of it in the following manner:


In the processing of video retrieval, we only needs to compute the visual similarity between the query image and each frame cluster of each video. We set a threshold to measure the similarity. That is, , and , if , then is similar to .

5 Visual Weighted Inverted Index

Figure 2: VWII structure

In order to improve the search performance in large-scale video database, we propose to construct an efficient index for videos and develop an algorithm based on it. In this section, we present a novel indexing structure named VWII, which is a combination of visual words of videos and inverted index. First we introduce the framework of our indexing structure and then propose our search algorithm.

5.1 Our Index Framework: VWII

We model the image to video query as a visual -aggregation problem, which is inspired by the aggregation problem in the area of database Fagin et al. (2001); Zhang et al. (2014).

Visual -Aggregation Problem. Consider a video frame set where each frame has visual words. Let be a monotonic aggregation function, where denote the overall visual score of frame . Given a similarity threshold , the visual -aggregation problem aims to return a set of frames with the visual score is larger than or equal to , i.e., return a set , in which and .

It is suitable that define the image to video query problem as the visual -aggregation problem. For a large-scale video database , frames can be selected by frame sampling techniques and a frame set of them can be constructed. To put it in another word, the database is modeled by straightforward. Given a query image , the similarity between and a frame can be measured by the visual similarity function and the similarity threshold . For a frame , if , then can be added into . Therefore, when the search is terminated, we can easily to know that and , .

By defining the image to video query problem as a visual -aggregation problem, we can devise an indexing structure which is dependent on the inverted index rather than complex hybrid methods to solve this problem efficiently. Thus, we integrate visual words of frames and the inverted index technique to construct a novel index named Visual Word Invered Index (VWII for short) Fig. 2 illustrates an example of VWII structure combining visual words and inverted indexing structure used in image to video query. Given a query image with visual words, a list set containing sorted lists is generated, and for each list , it is sorted by a descending order based on the score of the -th visual word.

5.2 Search Algorithm

0:  A query image , a positive integer , the number of query visual words .
0:  A results set .
1:  Initializing: ;
2:  Initializing: ;
3:  ;
4:  for ;; do
5:      ;
6:  end for
7:  while  do
8:      for ;; do
9:          for ;; do
10:              ;
11:              ;
12:              ;
13:          end for
14:      end for
15:      for each available frame  do
16:          ;
17:          ;
18:      end for
19:      ;
20:  end while
21:  for each  do
22:      if  && is available then
23:          ;
24:          ;
25:      end if
26:  end for
27:  return ;
Algorithm 1 VWII Search Algorithm

Base on VWII structure, we develop a efficient search algorithm named VWII Search to improve the search performance in large-scale video databases. In this subsection, we discuss this algorithm in details.

In VWII Search algorithm, we use a parameter to control the successive access. In each iteration, there are frame clusters in each list are accessed successively. For each accessed frame , is defined as an upper bound of the aggregated score of this cluster. A frame is considered as an available frame if the bound is greater than the highest score that has been computed so far. When an iteration is to the end, the available frame with the maximum bound is chosen for random access to compute the aggregated score. Algorithm 1 shows the pseudo-code of VWII Search algorithm.

6 Performance Evaluation

(a) YouTube-8M (b) Sports-1M
Figure 3: Precision evaluation on the size of dataset
(a) YouTube-8M (b) Sports-1M
Figure 4: Precision evaluation on the number of visual words
(a) YouTube-8M (b) Sports-1M
Figure 5: Precision evaluation on the number of results
(a) YouTube-8M (b) Sports-1M
Figure 6: Efficiency evaluation on the size of dataset

In this section, we present results of a comprehensive performance study on real video datasets to evaluate the accuracy, efficiency and scalability of the proposed techniques. Specifically, we evaluate the accuracy and effectiveness of the following indexing techniques for top- image to video query.

  • CNN-VWII CNN-VWII is the CNNs based visual weighted inverted index technique, which are proposed in our paper.

  • SIFT-VWII SIFT-VWII is the visual weighted inverted index technique, which is proposed in Section  5, combines the traditional SIFT technique.

  • BF-PI BF-PI is the state-of-art technique, which is proposed in  de Araújo and Girod (2018), designed for image to video retrieval.

Dataset. Performance of various algorithms is evaluated comprehensively on two following video dataset. YouTube-8M ( is a large-scale labeled video dataset. It consists of 6.1 million you YouTube video IDs with high-quality machine-geneated annotations from a diverse vocabulary of more than 3800 visual entities. This dataset comes with precomputed audio-visual feature from 2.6 billions of frames and audio segments. Each video in it is between 120 aand 500 seconds long. Sports-1M ( is another large-scale video dataset of 1 million YouTube videos belonging to a taxonomy of 487 classes of sports.

Workload. A workload for the top- image to video query consists of 100 queries. The query response time is employed to evaluate the performance of the algorithms. The video dataset size grows from 2000 to 8000 hours; the number of the query visual words changes from 50 to 150; the number of results changes from 10 to 50. By default, The video dataset size, the number of the query visual words, the number of results set to 2000, 100 and 10 respectively. Experiments are run on a PC with Intel Xeon 2.60GHz dual CPU and 16G memory running Ubuntu. All algorithms in the experiments are implemented in Java and Python.

Precision evaluation on the size of dataset. We evaluate the effect of the size of dataset on YouTube-8M and Sports-1M for precision evaluation. Fig. 4(a) illustrates that the precision of CNN-VWII, SIFT-VWII and BF-PI decrease with the rising the size of YouTube-8M. Specifically, when dataset size is 500 hours, the mAP of BF-PI is the highest among them. However, its precision drops faster than CNN-VWII in the interval of . It is clearly that our method, CNN-VWII, has a better performance with the enlarging of the scale of data set. On Sports-1M dataset shown in Fig. 4(b), we can see that the mAP of all the three method descend step by step, similar to the situation on YouTube-8M. However, the precision of CNN-VWII is the highest all the time. Besides, when the dataset size if 500, the mAP of SIFT-VWII is a litter higher than BF-PI. In the interval of , the precision of these two methods show a downward and fluctuating trend, lower than our method.

Precision evaluation on the number of visual words. We evaluate the effect of the number of visual words on YouTube-8M and Sports-1M for precision evaluation. As the method BF-PI do not generate visual words, we just compare the results of CNN-VWII and SIFT-VWII. In Fig. 4(a), it is no doubt that with the increasing of the visual words number, both of them grow gradually. the mAP of our method is higher than SIFT-VWII all the time as the advantage of CNN on visual representation. The situation on Sports-1M dataset is a little different from YouTube-8M. In specific, shown in Fig. 4(b), when the number of visual words increase from 100 to 150, the precision of SIFT-VWII rise faster and after that, the growth rate slows down. When the number of visual words is 250, the mAP of SIFT-VWII is near to CNN-SIFT.

Efficiency evaluation on the number of results . Fig. 6 demonstrates the results of evaluation on the number of results on YouTube-8M and Sports-1M for precision evaluation. The experiment results on YouTube-8M are shown in Fig. 6. With the increasing of , the performance of BF-PI rises gradually, which is much higher than CNN-VWII and SIFT-VWII. The performance of our method is better than SIFT-VWII when since the advantage of CNN in visual feature extraction and representation, which makes it easier to find the better results. On Sports-1M, illustrated in Fig. 6(b) CNN-VWII has the best performance among them, little higher than SIFT-VWII. No doubtly, the response time of BF-PI is the highest.

Efficiency evaluation on the size of dataset. We evaluate the effect of the size of dataset on YouTube-8M and Sports-1M for search efficiency. Not surprisingly, Fig. 6(a) shows that the response time of CNN-VWII, SIFT-VWII and BF-PI go up gradually with the rising of dataset size. More concretely, when the daataset size increases to 2000, the rising speed of all these method are fast. But in the interval of , their growth becomes a little more gentle. It is obvious that the efficiency of CNN-VWII and SIFT-VWII are much higher than BF-PI because the using of index VWII can greatly improve search efficiency. The trends of CNN-VWII and SIFT-VWII are very similar. Fig. 6(b) shows that the evaluation on Sports-1M. Like the situation on YouTube-8M, the performance of CNN-VWII and SIFE-VWII are much better than BF-PI.

7 Conclution

In this paper, we study the problem of top- video retrieval by a query image, which aims to return most relevant video for a large-scale video database by a query image. We define top- image to video query formally and present the visual similarity function. To improve the retrieval precision and efficiency, we propose a novel model based on CNN and BoVW techniques for image and video representation. Besides, to boost the search efficiency, we design a novel indexing structure called VWII which is a combination of visual words and inverted index. Based on it we propose a novel algorithm for the task of top- image to video query. The experimental evaluation on two video datasets shows that our approach outperforms the state-of-the-art method.


This work was supported in part by the National Natural Science Foundation of China(61702560), the Natural Science Foundation of Hunan Province (2018JJ3691, 2016JC2011).


  • de Araújo et al. (2015) de Araújo, A.F., Chaves, J., Angst, R., Girod, B., 2015. Temporal aggregation for large-scale query-by-image video retrieval, in: 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, pp. 1519–1522.
  • de Araújo and Girod (2018) de Araújo, A.F., Girod, B., 2018. Large-scale video retrieval using image queries. IEEE Trans. Circuits Syst. Video Techn. 28, 1406–1420.
  • Fagin et al. (2001) Fagin, R., Lotem, A., Naor, M., 2001. Optimal aggregation algorithms for middleware, in: Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, California, USA.
  • Gu et al. (2016) Gu, Y., Ma, C., Yang, J., 2016. Supervised recurrent hashing for large scale video retrieval, in: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, pp. 272–276.
  • Jiang and Ngo (2008) Jiang, Y., Ngo, C., 2008. Bag-of-visual-words expansion using visual relatedness for video indexing, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pp. 769–770.
  • Kordopatis-Zilos et al. (2017) Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y., 2017. Near-duplicate video retrieval with deep metric learning, in: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, pp. 347–356.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114.
  • Li et al. (2015) Li, Y., Wang, R., Huang, Z., Shan, S., Chen, X., 2015. Face video retrieval with image query via hashing across euclidean space and riemannian manifold, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 4758–4767.
  • Lou et al. (2017) Lou, Y., Bai, Y., Lin, J., Wang, S., Chen, J., Chandrasekhar, V., Duan, L., Huang, T., Kot, A.C., Gao, W., 2017. Compact deep invariant descriptors for video retrieval, in: 2017 Data Compression Conference, DCC 2017, Snowbird, UT, USA, April 4-7, 2017, pp. 420–429.
  • Podlesnaya and Podlesnyy (2016) Podlesnaya, A., Podlesnyy, S., 2016. Deep learning based semantic video indexing and retrieval. CoRR abs/1601.07754.
  • Sivic et al. (2006) Sivic, J., Schaffalitzky, F., Zisserman, A., 2006. Object level grouping for video shots. International Journal of Computer Vision 67, 189–210.
  • Sivic and Zisserman (2003) Sivic, J., Zisserman, A., 2003. Video google: A text retrieval approach to object matching in videos, in: 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pp. 1470–1477.
  • Tang et al. (2012) Tang, K.D., Ramanathan, V., Li, F., Koller, D., 2012. Shifting weights: Adapting object detectors from image to video, in: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 647–655.
  • Ulutas et al. (2018) Ulutas, G., Ustubioglu, B., Ulutas, M., Nabiyev, V.V., 2018. Frame duplication detection based on bow model. Multimedia Syst. 24, 549–567.
  • Wang et al. (2012) Wang, L., Song, D., Elyan, E., 2012. Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval, in: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pp. 1303–1312.
  • Wang et al. (2015a) Wang, Y., Lin, X., Wu, L., Zhang, W., 2015a. Effective multi-query expansions: Robust landmark retrieval, in: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp. 79–88.
  • Wang et al. (2017a) Wang, Y., Lin, X., Wu, L., Zhang, W., 2017a. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Processing 26, 1393–1404.
  • Wang et al. (2015b) Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X., 2015b. Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Transactions on Image Processing 24, 3939–3949.
  • Wang and Wu (2018) Wang, Y., Wu, L., 2018.

    Beyond low-rank representations: Orthogonal clustering basis reconstruction with optimized graph structure for multi-view spectral clustering.

    Neural Networks 103, 1–8.
  • Wang et al. (2018) Wang, Y., Wu, L., Lin, X., Gao, J., 2018. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Networks and Learning Systems 29, 4833–4843.
  • Wang et al. (2016) Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S., 2016.

    Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 2153–2159.

  • Wang et al. (2017b) Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X., 2017b. Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans. Neural Netw. Learning Syst. 28, 57–70.
  • Wu et al. (2018a) Wu, L., Wang, Y., Gao, J., Li, X., 2018a. Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recognition 73, 275–288.
  • Wu et al. (2018b) Wu, L., Wang, Y., Gao, J., Li, X., 2018b. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia .
  • Wu et al. (2018c) Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X., 2018c. Structured deep hashing with convolutional neural networks for fast person re-identification. Computer Vision and Image Understanding 167, 63–73.
  • Wu et al. (2018d) Wu, L., Wang, Y., Li, X., Gao, J., 2018d. Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Trans. Cybernetics .
  • Wu et al. (2018e) Wu, L., Wang, Y., Li, X., Gao, J., 2018e. What-and-where to match: Deep spatially multiplicative integration networks for person re-identification. Pattern Recognition 76, 727–738.
  • Wu et al. (2018f) Wu, L., Wang, Y., Shao, L., 2018f. Cycle-consistent deep generative hashing for cross-modal retrieval. CoRR abs/1804.11013.
  • Zhang et al. (2014) Zhang, D., Chan, C., Tan, K., 2014. Processing spatial keyword query as a top-k aggregation query, in: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, pp. 355–364.
  • Zhu and Satoh (2012) Zhu, C., Satoh, S., 2012. Large vocabulary quantization for searching instances from videos, in: International Conference on Multimedia Retrieval, ICMR ’12, Hong Kong, China, June 5-8, 2012, p. 52.
  • Zhu et al. (2017) Zhu, X., Jing, X., Wu, F., Wang, Y., Zuo, W., Zheng, W., 2017. Learning heterogeneous dictionary pair with feature projection matrix for pedestrian video retrieval via single query image, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 4341–4348.