Binary Subspace Coding for Query-by-Image Video Retrieval

by   Ruicong Xu, et al.
The University of Queensland

The query-by-image video retrieval (QBIVR) task has been attracting considerable research attention recently. However, most existing methods represent a video by either aggregating or projecting all its frames into a single datum point, which may easily cause severe information loss. In this paper, we propose an efficient QBIVR framework to enable an effective and efficient video search with image query. We first define a similarity-preserving distance metric between an image and its orthogonal projection in the subspace of the video, which can be equivalently transformed to a Maximum Inner Product Search (MIPS) problem. Besides, to boost the efficiency of solving the MIPS problem, we propose two asymmetric hashing schemes, which bridge the domain gap of images and videos. The first approach, termed Inner-product Binary Coding (IBC), preserves the inner relationships of images and videos in a common Hamming space. To further improve the retrieval efficiency, we devise a Bilinear Binary Coding (BBC) approach, which employs compact bilinear projections instead of a single large projection matrix. Extensive experiments have been conducted on four real-world video datasets to verify the effectiveness of our proposed approaches as compared to the state-of-the-arts.


page 1

page 2

page 3

page 4


Central Similarity Hashing via Hadamard matrix

Hashing has been widely used for efficient large-scale multimedia data r...

Bilinear Supervised Hashing Based on 2D Image Features

Hashing has been recognized as an efficient representation learning meth...

Video retrieval based on deep convolutional neural network

Recently, with the enormous growth of online videos, fast video retrieva...

Bilinear Random Projections for Locality-Sensitive Binary Codes

Locality-sensitive hashing (LSH) is a popular data-independent indexing ...

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

We present the first provably sublinear time algorithm for approximate M...

A Proposal-based Approach for Activity Image-to-Video Retrieval

Activity image-to-video retrieval task aims to retrieve videos containin...

Sparse Transfer Learning for Interactive Video Search Reranking

Visual reranking is effective to improve the performance of the text-bas...

1 Introduction

In recent years, we have witnessed a tremendous explosion of multimedia data (e.g., images, videos) on the Web driven by the advance of digital camera, high-speed Internet, massive storage, etc. Among all the types of multimedia data, videos have been playing a significant role in reshaping the ways of recording daily life, self-expression and communication [31]. And video retrieval has drawn considerable research attention by its extensive application value in the social networking, such as query-by-image video retrieval (QBIVR), which is applied to a variety of real-life applications, ranging from searching video lectures using a slide, to recommending relevant videos based on images, to searching news videos using photos [3]. Hashing [19, 29] has been proposed to efficiently solving the problem of multimedia retrieval.

However, the QBIVR task is challenged by the similarity-preserving measurement of images and videos and an efficient retrieval method for the huge dataset. For superior similarity-preserving, it is closely connected to feature representation of videos [30]. Considerable research endeavors [5, 17] have been dedicated to developing effective schemes for improving the global signature of the whole video, and promising methods in subspace representations such as single or mixture of linear subspaces [25], affine subspaces [11], covariance matrix [24] have demonstrated their superiorities in underpinning a range of multimedia applications. Subspace representations of videos completely preserve the rich structural properties of visual objects such as viewpoint, location, spatial transformation, movement, etc, which superior to a single point representation in a high-dimensional feature space. However, they also make the similarity-preserving measurement between subspace and point data even harder, one existing method is to compute the similarity between the query image and each frames of the video and then integrate these similarities by averaging or taking the maximum. Obviously, this measurement suffers from high computational cost and massive storage, as well as ignores the correlations among the video frames.

Meanwhile, to achieve large-scale fast retrieval, several powerful hashing methods have been proposed. However, they are unsatisfied for the QBIVR task because of their non-trivial due to the different modalities of videos and images. By projecting each video (i.e.,subspace) into a datum point in a certain high-dimensional space, Basri et al. [1] proposed an Approximate Nearest Subspace algorithm to solve the QBIVR task. Then the point-to-subspace search problem is reduced to the well-known point search problem which can be addressed by approximate nearest neighbor (ANN) techniques. Nonetheless, the performance is far from ideal effect due to the inevitable loss of physical and/or geometric structural information in inter-frame structure and intra-frame relationships resulting from the operation of aggregation and/or projection of video frames. So it is urged to propose an effective strategy to define the similarity of two distinct modalities of query point and subspace database and exert ideal search efficiency.

Figure 1: The flowchart of BSC framework. We first represent videos as subspaces of frames, and then asymmetrically project images and videos into a common Hamming space, where we can efficiently retrieve the most relevant videos provided with an image query.

To tackle the aforementioned two challenges in QBIVR task, we propose a novel retrieval framework, termed Binary Subspace Coding (BSC), which can fully explore the genuine geometric relationships between query point (image) and subspace (video) as well as provide significant efficiency improvements. In particular, we measure image-to-video similarity by calculating the distance of the image and its orthogonal projection in the subspace of the video and then equivalently transform the target to a Maximum Inner Product Search (MIPS) problem. To further accelerate search process and simplify the optimization, our BSC framework employs asymmetric learning strategy to generate different hash codes/functions for images and videos, which can narrow their domain gap in the common Hamming space and suitable for high-dimensional computation. Specifically, two asymmetric hashing models are designed. We first propose an Inner-product Binary Coding (IBC) approach which preserves image-to-video inner-relationships and guarantees high-quality binary codes by a tractable discrete optimization method. Moreover, we also devise a Bilinear Binary Coding (BBC) approach to significantly lower the computational cost by exploiting compact bilinear projections instead of a single large projection matrix.

We illustrate the flowchart of our proposed BSC framework in Figure 1. The main contributions of our work are summarized as follows:

  • We devise a novel query-by-image video retrieval framework, termed Binary Subspace Coding (BSC). We define an image-to-video distance metric to preferably preserve geometric information of data.

  • We propose two asymmetric hashing schemes to unify images and videos in a common Hashing space, where the domain gap is minimized and efficient retrieval is fully supported.

  • Extensive experiments on four datasets, i.e., BBT1, UQE50, FCVID and a micro video dataset collected by ourselves demonstrate the effectiveness of our approaches.

The reminder of this paper is organized as follows. Section 2 gives an introduction to the related work. In Section 3, we elaborate the details of our BSC framework. Section 4 demonstrates the experimental results and analysis, followed by the conclusion of this work in Section 5.

2 Related Work

In this section, we give a brief view of previous literatures that are closely related to our work in some aspects.

Recently, there has been a significant research interest in video retrieval such as event detection [28, 6] and recommendation system [13]. One type of video retrieval, the QBIVR task, is urgently in a need of performance boosting for its widespread applications and an effective informative representation of video undoubtly needs to be researched as an guarantee for retrieval quality [10]. Typically, shots indexing, sets aggregation and global signature representation are three prevailing video retrieval methods. Differ from the severe scarcity of intra-frame relationships and sensitivity of opting training sets respectively in the first two patterns, global signature representation methods could reserve pretty rich structure information, such as inter-frame and intra-frame relationships which contributes to modelling pertinently and accurately. However, high computational costs and efficient retrieval speed is hard to handle, even though Araujo proposed [4]

an integral representation method which reduces retrieval latency and memory requirements by Scalable Compressed Fisher Vectors (SCFV), but sacrifices a plethora of original spatial-temporal features which are crucial for integral representations. At the same time, a suitable similarity-preserving measurement between images and videos also merits attention which results in more difficulities in the QBIVR task.

By preserving invariant domain with low-dimensional structure information and then projecting affinity matrix into a datum point, an easier similarity-preserving measurement between images and videos was proposed for the QBIVR task by using approximate nearest neighbors (ANN) search methods 

[1]. Meanwhile, several powerful hashing  [20] ,i.e., supervised hashing, semi-supervised hashing, unsupervised hashing did bring light to ANN search problem for pursuing efficiency. Although supervised hashing methods have demonstrated promising performance in some applications with semantic labels, it’s troublesome or even impossible to get semantic labels in many real-life applications. Besides, the learning process is by far more complex and time-consuming than unsupervised techniques especially dealing with high-resolution videos or images. Some classical unsupervised methods include Spectral Hashing (SH) [26], preserving the Euclidean distance in the database; Inductive Manifold Hashing (IMH) [23], adopting manifold learning techniques to better model the intrinsic structure embedded in the feature space; Iterative Quantization(ITQ) [9], focusing on minimizing quantization error during unsupervised training. Other noticeable unsupervised hashing methods including anchor graph hashing (AGH) [16] and scalable graph hashing with feature transformation (SGH) [12] directly exploit the similarity to guide the hashing code learning procedure and achieve a good performance.

Although the above hashing methods can efficiently deal with complexity of computational cost and storage, the different modalities of images and videos are neglected which can cause domain gap between images and videos. Some work have already focused on the domain gap. One solution proposed by Yan [15], dubbed Hashing across Euclidean Space and Riemannian Manifold (HER), learns hash functions in a max-margin framework across Euclidean space and Riemannian manifold, but becomes unsuitable for large-scale database owing to unaffordable time when dimension grows. Shirvastava and Li [22] also proposed an Asymmetric Locality-Sensitive Hashing (ALSH) which performs a simple asymmetric transformation on data pairs for different learning. Inspired by their work and a dimensionality reduction bilinear projections method by Gong [7], we aim to seek a more powerful asymmetric binary learning framework which properly balances QBIVR task and high quality hashing based on subspace learning.

3 Binary Subspace Coding

In this section, we describe the details of our BSC framework in two different schemes. We first present the geometry-preserving distance metric between images and videos and deduce how the objective is transformed to MIPS problem. Then, we respectively introduce the two different asymmetric learning approaches of hash codes/fucntions which perfectly solve domain gap between images and videos.

3.1 Problem Formulation

Given a database of videos, denoted as , where () represents the subspace covering all the video keyframes. Given a query image , the main objective of the query-by-image video task can be formulated as below:


where is certain distance measurement of two data points. As shown, the major objective of query-by-image video retrieval is to find the subspace whose distance from query point is the shortest.

3.2 Geometry-Preserving Distance

The QBIVR is essentially a point-to-subspace search problem in which the query is represented as a point, and the database comprises subspaces. Recall that existing solutions to the above problem either aggregate or project all the frames into a single datum compatible with the given query, which may cause serious information loss, such as spatial arrangement and/or temporal order. To compensate such drawbacks, we propose to measure the image-to-video relationship by the distance between the query and its corresponding projection on the subspace plane. In this way, the geometric property and structural information of subspace can be fully preserved. We denote the new image-to-video distance as


where is the norm. It is easy to see that the nearest point is the orthographic projection of on ,which is calculated as follows:


where is the orthographic projection of on , . Note that can be computed offline to increase efficiency. Substituting Eq. (3) into Eq. (2), we obtain the distance of point-to-subspace:



is an identity matrix of size

. Denoting , given that we have


Therefore, we can obtain the further conclusion


where is the trace of a matrix. Based on Eq. (6), our objective is equivalent to the following problem:


Noting that , where denotes the inner product of two vectors. Then Eq. (7) can be seen as a Maximum Inner Product Search (MIPS) problem w.r.t. the query and its orthographic projection in the subspace. However, all the in the dataset have to be preprocessed every time a new query is provided, which apparently increases the computational cost to solve. To bypass this issue, based on the linear algebra manipulation , we rewrite the problem (7) as in (8):


where is the function of transforming a matrix of size to a column vector of size by performing column-wise concatenation of the matrix. In this way, we obtain an equivalent MIPS problem w.r.t. and from the original QBIVR problem.

Considering the unaffordable computation of and when is large, i.e.,

, we employ hashing approaches to binarize image query and video data. Different properties of query images and videos in the database are unnegligible factors for accurate binary codes. Therefore, we learn asymmetric hash functions for image query

and video data respectively, then the MIPS problem is reformulated as


where and are hash functions for videos and images, respectively.

3.3 Inner-product Binary Coding

We first present the Inner-product Binary Coding approach. To facilitate the asymmetric learning of hash functions, we first construct video data and image data for training:

where are images randomly sampled from video frames for training. Let be the correlation matrix of and . We choose to use the inner product to represent the similarity, i.e., . Following [18], we now consider the following optimization problem:


where , , and is the Frobenius norm. For simplicity, we choose to learn linear hash functions, i.e., and , where and are the two mapping variables for binarizing videos and images, respectively.

In practice, to further speed up the optimization, we deliberately discard the quadratic term , in view of the quadratic term with no help in leveraging the ground-truth similarity. In fact, the term can be treated a regularization in the magnitude of the learned inner product. Hence, we arrive at the new objective:


which can be optimized by alternatingly updating and . In particular, when learning with fixed, we have


When updating with fixed, we arrive at


Both of the above sub-problems are of the same form. Next, we will show how to solve (12) and the sub-problem (13) can be solved in the same way.

It is non-trivial to optimize the sub-problem (12) due to the existence of the sign function . To bypass the obstacle, we introduce an auxiliary variable to approximate , and thus we have


where is a balance parameter. Setting the derivative of the above objective w.r.t. to zero, we have


Fixing , then we can update with


The above analytical solution of significantly reduces the training cost, similarly making the algorithm easily performed on large-scale databases.

3.4 Bilinear Binary Coding

Note that in IBC, hashing the vectored images and videos data and with a full projection matrix may cause high computational cost. In this part, we propose a Bilinear Binary Coding (BBC) approach to further accelerate the efficiency of QBIVR task.

We first present a bilinear rotation to maintain matrix structure instead of a single large projection matrix, denoted as , which is remarkable successful in lowering running time and storage for code generation. It also has been proved by [7] that a bilinear rotation to is equivalent to a rotation to , denoted as , where is the Kronecker product [14]. Now, we can equivalently learn asymmetric hash functions for images and videos as follows:


where , , . Then, we can generate binary codes for vectored images and videos with code length for performing an efficient retrieval.

Following [9], a feasible objective is to learn a bilinear rotation which minimizes the angle between and its binary encoding . We preprocess the video dataset and image dataset to be zero-centered and have unit norm, then our goal is to maximize the following objective:


where is the angle of the -th rotated image/video and its binary code. . is the angle between the binary codes of and , where and are the binary codes of the -th image and -th video, respectively. preserves the similarity property of images and videos in the different or same category:

For images, is expressed as:


To simplify the subsequent optimization, we follow [8] to relax the above objective function by ignoring , and arrive at:


Similarly, we can derive the objectives of video angles and image-to-video angles:


Hence, the objective function is transformed to


where . .

For optimization, we use block coordinate ascent to alternatingly update ,,. The updating processes w.r.t. images and videos are symmetric. Hence we just describe the updates of variables of videos by fixing all the variables of images.

Step 1: Update , with all other variables fixed. We have the following reduced problem:


where . We can solve the above optimization problem using polar decomposition:


where and are the left-singular vectors and the top right-singular vectors of , respectively, by performing SVD.

Step 2: Update , with all the others fixed, we have


where . Similar to the previous step, the update for is , where and are the top left-singular vectors and the right-singular vectors of , respectively, by performing SVD.

Step 3: Update , by fixing all the other variables, we obtain


where . It can be easily seen that the solution to the above problem is as below:


Then, we can similarly update , and .

Comparing to the time complexity of full rotation ,i.e, , the asymmetric bilinear hashing learning of videos and images significantly reduces the training cost to , where . We summarize the algorithm for optimizing the proposed Bilinear Binary Coding (BBC) approach in Algorithm 1.

0:  Subspaces of videos and images ;
0:  Hash function and ;
1:  Compute , ;
2:  Construct video and image training data as below:
3:  Randomly initialize ,,;
4:  repeat
5:     Update by solving the problem (23);
6:     Update by solving the problem (25);
7:     Sequentially update by solving the problem (26);
8:     Update according to the problem (23);
9:     Update according to the problem (25);
10:     Sequentially update according to the problem (26);
11:  until there is no change to all the variables;
12:  return  ,,.
Algorithm 1 Optimization of Bilinear Binary Coding.

4 Experiments

In this section, we evaluate our two proposed IBC, BBC approaches on four datasets for the query-by-image video retrieval (QBIVR) task.

4.1 Data and Experimental Settings

We used four video datasets, a face video dataset BBT1 (Big Bang Theory1) [15], UQE50 (UQ Event dataset with 50 pre-defined events) dataset [32], a micro object-based video dataset collected by ourselves from Vine111, and a wide range of objects and events dataset Fudan-Columbia Video Dataset (FCVID)222 BBT1 dataset, a low-dimensional small dataset, has been proved by HER method in  [15]. We conduct experiments on this publicly-available video dataset to verify the effectiveness of our approaches. For the other datasets, we first adopted the FFmpeg333 to sample the videos at the rate of frames per second as keyframes, and subsequently extracted the visual features of keyframes using fc7 layer (-d) of VGG Net model [27]. In view of the potential redundancy, the data in our experiment were further reduced to -d by PCA.

We compared our IBC and BBC approaches against several state-of-the-art unsupervised hashing methods for large-scale video retrieval, including ALSH [21], IMH [23], SH [26], ITQ [9]. In our IBC approach, each column of the original inner product matrix is binarized, where the top largest elements are set to and the rest ones to . is set to for the Vine dataset, for UQE50 and BBT1 datasets, and for FCIVD dataset. Notably, the balance parameter is empirically set to and the number of local iterations is set to . In our BBC approach, we firstly initiate the bilinear rotation parameters randomly and then learn two asymmetric hash functions respectively. The number of local iterations is set to in light of the excellent converging property of our devised approach. The parameters of the rest compared approaches are set as suggested above. In the experiment, the code length is tested in the range of .

The evaluation metrics are chosen as Hamming ranking including mean of average precision (mAP) and mean precision of the top

retrieved neighbors (Precision@500).

4.2 BBT1: video retrieval with face images

The Big Bang Theory (BBT) [15] is a sitcom (20 minutes an episode) which includes many full-view shots of multiple characters at a time. It take places mostly indoors and mainly focuses on characters. BBT1 consists of 3,341 face videos of the first 6 episodes from season 1 of BBT which are represented by block Discrete Cosine Transformation (DCT) feature as used in [2], which forms a covariance video representation.

4.2.1 Compared to other state-of-the-art methods

To compare with HER method [15], an effective heterogeneous spaces hashing method, we tested our approaches on the BBT1 which has been verified successfully on HER method. Following [15], for each database, we randomly extracted 300 image-video pairs (both elements of the pair come from the same subject) for training and 100 images from the rest as query for the retrieval task. The results are shown in Table1. Our proposed approaches not only perform better than HER method on mAP, but also overcome the limitation of HER method in low dimension and the subsequent experiments show that our approaches can be applied to high-dimensional large datasets fittingly.

Method mAP
16-bit 32-bit 64-bit 96-bit
HER 0.5049 0.5227 0.5490 0.5531
IBC 0.5152 0.5369 0.5561 0.5638
BBC 0.5080 0.5401 0.5643 0.5711
Table 1: Comparison of HER method and IBC, BBC approaches on BBT1 dataset with -bit.

4.3 Vine: video retrieval with object images

Vine is a micro video sharing platform, where users can only share videos which are no more than six seconds by mobile devices. We collected a micro object-based video dataset from Vine comprising micro videos in categories, and randomly sampled 11,000 videos and 11,000 images from videos for training respectively and the rest 1,000 images as test.

4.3.1 Compared to other state-of-the-art methods

We report the compared results with some state-of-the-art methods, i.e., hashing methods and vector-based approach in terms of both hash lookup: mAP and Precision@500. The compared vector-based method, a temporal aggregation [3]

employs the Scalable Compressed Fisher Vectors (SCFV) to reduce retrieval latency and memory requirements for achieving higher speed and maintaining good performance. However, the approach sacrifices useful information such as structure similarity when pursuing a higher speed. Moreover, though binarized fisher features that TAVR uses are more representative and effective than low-level image features, it still fails when competing with deep features, especially the ones after redundancy removing. The performance of the vector-based approach and the proposed approaches in 64-bit is clearly illustrated in Table


Method mAP Precision@500
TAVR 0.3785 0.3021
IBC 0.4997 0.4990
BBC 0.5045 0.5039
Table 2: Comparison of vector-based retrieval and IBC, BBC on Vine dataset with code length is set to .

In the comparisons with hashing methods, we treat a query a false case if no point is returned when calculating precision. Ground truths are defined by the category information from the datasets. As Table 3 shows, the two proposed approaches outperform all the other state-of-the-art methods in terms of every metric at all code lengths, noticeably remaining much greater expression ability when encoding length is as large as 128 bits.

Method mAP Precision@500
16-bit 32-bit 64-bit 96-bit 128-bit 16-bit 32-bit 64-bit 96-bit 128-bit
ALSH 0.2965 0.2983 0.3006 0.3067 0.3122 0.2942 0.2976 0.3014 0.3142 0.3224
ITQ 0.3121 0.3156 0.3256 0.3325 0.3378 0.2846 0.3256 0.3274 0.3302 0.3544
IMH 0.3308 0.3569 0.3596 0.3628 0.3642 0.3282 0.3366 0.3644 0.3722 0.3804
SGH 0.3849 0.4135 0.4258 0.4322 0.4371 0.4165 0.4276 0.4294 0.4322 0.4424
IBC 0.4970 0.4985 0.4997 0.5012 0.5064 0.4892 0.4953 0.4990 0.5010 0.5048
BBC 0.4933 0.5012 0.5045 0.5082 0.5125 0.4842 0.4993 0.5039 0.5115 0.5202
Table 3: Comparison of IBC, BBC and other hashing methods on Vine with code lengthes are set to ,,, and .

4.4 UQE50: video retrieval with event images

The video dataset UQE50 (UQ Event dataset with 50 pre-defined events) aims at event analysis tasks which was downloaded from YouTube444 [32]. The dataset contains videos that belong to different event categories, and all the videos are from trending events happened in the last few years whose granularity is comparably larger than the existing video event datasets. Compared with the Vine dataset of object-based videos, UQE50 is a longer-time event-based video dataset in a smaller size. To verify the generality of our proposed approaches in different typies of video retrieval, we used UQE50 video dataset to compare the performance of our approaches with that of other state-of-the-art methods. We randomly chose videos and images from videos as training samples respectively and the rest 200 images were used as test samples.

4.4.1 Compared with other state-of-the-art methods

To examine the practical efficacy of our proposed approaches, in this part, we conducted a similar experiment on UQE50 to evaluate its scalability compared with other state-of-the-art hashing methods as Vine experiment. Obviously, even for a smaller dataset of longer time-length, the performance (i.e., mAP, Precision@500) is also excellent as shown in Figure 2(a)(b). As the code length enlarges, all the aspects of the proposed approaches’ performance steadily are better than the other ones.

(a) mAP on UQE50 dataset
(b) Precision@500 on UQE50 dataset
(c) mAP on FCVID dataset
(d) Precision@500 on FCVID dataset
Figure 2: Comparison of IBC, BBC and other hashing methods on FCVID dataset with different code lengths.

4.5 FCVID: video retrieval with event/object images

The video dataset FCVID (Fudan-Columbia Video Dataset)555 is a web videos dataset containing Web videos annotated manually according to 239 categories. The categories in FCVID cover a wide range of topics like social events (e.g., tailgate party ), procedural events (e.g., making cake ), objects (e.g., panda ), scenes (e.g., beach ), etc. These categories were defined very carefully and organized in a hierarchy of 11 high-level groups. Specifically, the categories were conducted by user surveys and the organization structures on YouTube and Vimeo as references to identify. In this section, we chose as our video dataset and randomly selected images and videos for training, then we tested another images in the video dataset.

4.5.1 Compared with other state-of-the-art methods

In this part, we conducted the same experiment as Vine and UQE50 to the FCVID dataset for studying the performance of event/object image retrieval. In view of the wide range of image typies and higher reliability of dataset, we can demonstrate the effect of our approaches clearly. As Figure

2 (c)(d) show, our approaches outperform both of mAP and Precision@500 in different code lengths than the other state-of-the-art hashing methods. Better performances on the four datasets in different types and sizes prove our validity of the proposed IBC and BBC approaches.

(a) IBC on Vine
(b) BBC on UQE50
(c) IBC on Vine
(d) BBC on UQE50
Figure 3: Training size effects of IBC, BBC on mAP performance over Vine and UQE50 with 64 bits fixed.

4.6 Effect of training size on UQE50 and Vine

This part of experiment mainly studies on evaluating the effect of training size on the searching quality of our IBC and BBC approaches. We performed the experiments on object-based Vine and event-based UQE50 datasets and selected mean of average precision (mAP) as the comprehensive assessment index. We fixed the code as -bit and varied the training size of Vine dataset from to with a regular interval of and UQE50 is tuned from to with a regular interval of . The subsequent consequences are shown in Figure 3. As we can see, IBC and BBC approaches both have the suitable training size for the best performance, even though add more training data, the two approaches do not gain noticeable performance boost to some extent. The ideal training size of Vine is for IBC approach and for BBC approach. And for UQE50 dataset, the performance ultimately is optimal with the training size of for IBC approach and for BBC approach. This section also gives us guidance for chosing the suitable training size.

5 Conclusion

In this paper, we developed Binary Subspace Coding (BSC) framework which includes two different approaches for query-by-image video retrieval. Different from traditional video retrieval methods, we focused on subspace-based video representation and discovered a common Hamming space for both images and videos, to enable an efficient retrieval. Our proposed similarity-preserving measurement can preserve geometric structure properties of videos than the traditional methods by a new distance metric. Furthermore, we deduce an equivalent MIPS solution to solve the objective of point-to-subspace problem, which decreases the computational cost significantly. Besides, BSC framework is an asymmetric learning model which handles the complexity of computation and memory as well as the domain differences of videos and images efficiently. Extensive experiments on the four datasets, BBT1, UQE50, Vine, FCVID dataset, demonstrated the advantages of our two approaches compared to several state-of-the-arts.


  • [1] R. Basri, T. Hassner, and L. Zelnik-Manor. Approximate nearest subspace search. TPAMI, 33(2):266–278, 2011.
  • [2] M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised learning with constraints for person identification in multimedia data. In CVPR, 2013.
  • [3] A. F. de Araújo, J. Chaves, R. Angst, and B. Girod. Temporal aggregation for large-scale query-by-image video retrieval. In ICIP, 2015.
  • [4] A. F. de Araújo, J. Chaves, R. Angst, and B. Girod. Temporal aggregation for large-scale query-by-image video retrieval. In ICIP, 2015.
  • [5] A. F. de Araújo, M. Makar, V. Chandrasekhar, D. M. Chen, S. S. Tsai, H. Chen, R. Angst, and B. Girod. Efficient video search using image queries. In ICIP, 2014.
  • [6] C. Gan, N. Wang, Y. Yang, D. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2015.
  • [7] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik.

    Learning binary codes for high-dimensional data using bilinear projections.

    In CVPR, 2013.
  • [8] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik. Angular quantization-based binary codes for fast similarity search. In NIPS, 2012.
  • [9] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 35(12):2916–2929, 2013.
  • [10] R. Hong, Y. Yang, M. Wang, and X. S. Hua. Learning visual semantic relationships for efficient visual retrieval. Big Data IEEE Transactions on, 1(4):152–161, 2015.
  • [11] Y. Hu, A. S. Mian, and R. A. Owens. Sparse approximated nearest points for image set classification. In CVPR, 2011.
  • [12] Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In IJCAI, 2015.
  • [13] Y. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In AAAI, 2013.
  • [14] A. J. Laub. Matrix analysis - for scientists and engineers. SIAM, 2005.
  • [15] Y. Li, R. Wang, Z. Huang, S. Shan, and X. Chen. Face video retrieval with image query via hashing across euclidean space and riemannian manifold. In CVPR, 2015.
  • [16] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, 2011.
  • [17] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, 2010.
  • [18] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. T. Shen. Learning binary codes for maximum inner product search. In ICCV, 2015.
  • [19] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. A fast optimization method for general binary code learning. IEEE Transactions on Image Processing, 25(12), 2016.
  • [20] X. Shen, F. Shen, Q. S. Sun, Y. Yang, Y. H. Yuan, and H. T. Shen. Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE Transactions on Cybernetics, 2016.
  • [21] A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NTPS, 2014.
  • [22] A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). In UAI, 2015.
  • [23] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD, 2013.
  • [24] R. Vemulapalli, J. K. Pillai, and R. Chellappa. Kernel learning for extrinsic classification of manifold features. In CVPR, 2013.
  • [25] R. Wang and X. Chen. Manifold discriminant analysis. In CVPR, 2009.
  • [26] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
  • [27] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015.
  • [28] Y. Yan, Y. Yang, D. Meng, G. Liu, W. Tong, A. G. Hauptmann, and N. Sebe. Event oriented dictionary learning for complex event detection. TIP, 24(6):1867–1878, 2015.
  • [29] Y. Yang, F. Shen, H. T. Shen, H. Li, and X. Li. Robust discrete spectral hashing for large-scale image semantic indexing. IEEE Transactions on Big Data, 1(4):162–171, 2015.
  • [30] B. Yi, Y. Yang, F. Shen, X. Xu, and H. T. Shen.

    Bidirectional long-short term memory for video description.

    In ACM, 2016.
  • [31] L. Yu, Y. Yang, Z. Huang, P. Wang, J. Song, and H. Shen. Web video event recognition by semantic analysis from ubiquitous documents. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, PP(99):1–1, 2016.
  • [32] L. Yu, Y. Yang, Z. Huang, P. Wang, J. Song, and H. T. Shen. Web video event recognition by semantic analysis from ubiquitous documents. In TIP, preprint.