1 Introduction
In recent years, we have witnessed a tremendous explosion of multimedia data (e.g., images, videos) on the Web driven by the advance of digital camera, highspeed Internet, massive storage, etc. Among all the types of multimedia data, videos have been playing a significant role in reshaping the ways of recording daily life, selfexpression and communication [31]. And video retrieval has drawn considerable research attention by its extensive application value in the social networking, such as querybyimage video retrieval (QBIVR), which is applied to a variety of reallife applications, ranging from searching video lectures using a slide, to recommending relevant videos based on images, to searching news videos using photos [3]. Hashing [19, 29] has been proposed to efficiently solving the problem of multimedia retrieval.
However, the QBIVR task is challenged by the similaritypreserving measurement of images and videos and an efficient retrieval method for the huge dataset. For superior similaritypreserving, it is closely connected to feature representation of videos [30]. Considerable research endeavors [5, 17] have been dedicated to developing effective schemes for improving the global signature of the whole video, and promising methods in subspace representations such as single or mixture of linear subspaces [25], affine subspaces [11], covariance matrix [24] have demonstrated their superiorities in underpinning a range of multimedia applications. Subspace representations of videos completely preserve the rich structural properties of visual objects such as viewpoint, location, spatial transformation, movement, etc, which superior to a single point representation in a highdimensional feature space. However, they also make the similaritypreserving measurement between subspace and point data even harder, one existing method is to compute the similarity between the query image and each frames of the video and then integrate these similarities by averaging or taking the maximum. Obviously, this measurement suffers from high computational cost and massive storage, as well as ignores the correlations among the video frames.
Meanwhile, to achieve largescale fast retrieval, several powerful hashing methods have been proposed. However, they are unsatisfied for the QBIVR task because of their nontrivial due to the different modalities of videos and images. By projecting each video (i.e.,subspace) into a datum point in a certain highdimensional space, Basri et al. [1] proposed an Approximate Nearest Subspace algorithm to solve the QBIVR task. Then the pointtosubspace search problem is reduced to the wellknown point search problem which can be addressed by approximate nearest neighbor (ANN) techniques. Nonetheless, the performance is far from ideal effect due to the inevitable loss of physical and/or geometric structural information in interframe structure and intraframe relationships resulting from the operation of aggregation and/or projection of video frames. So it is urged to propose an effective strategy to define the similarity of two distinct modalities of query point and subspace database and exert ideal search efficiency.
To tackle the aforementioned two challenges in QBIVR task, we propose a novel retrieval framework, termed Binary Subspace Coding (BSC), which can fully explore the genuine geometric relationships between query point (image) and subspace (video) as well as provide significant efficiency improvements. In particular, we measure imagetovideo similarity by calculating the distance of the image and its orthogonal projection in the subspace of the video and then equivalently transform the target to a Maximum Inner Product Search (MIPS) problem. To further accelerate search process and simplify the optimization, our BSC framework employs asymmetric learning strategy to generate different hash codes/functions for images and videos, which can narrow their domain gap in the common Hamming space and suitable for highdimensional computation. Specifically, two asymmetric hashing models are designed. We first propose an Innerproduct Binary Coding (IBC) approach which preserves imagetovideo innerrelationships and guarantees highquality binary codes by a tractable discrete optimization method. Moreover, we also devise a Bilinear Binary Coding (BBC) approach to significantly lower the computational cost by exploiting compact bilinear projections instead of a single large projection matrix.
We illustrate the flowchart of our proposed BSC framework in Figure 1. The main contributions of our work are summarized as follows:

We devise a novel querybyimage video retrieval framework, termed Binary Subspace Coding (BSC). We define an imagetovideo distance metric to preferably preserve geometric information of data.

We propose two asymmetric hashing schemes to unify images and videos in a common Hashing space, where the domain gap is minimized and efficient retrieval is fully supported.

Extensive experiments on four datasets, i.e., BBT1, UQE50, FCVID and a micro video dataset collected by ourselves demonstrate the effectiveness of our approaches.
The reminder of this paper is organized as follows. Section 2 gives an introduction to the related work. In Section 3, we elaborate the details of our BSC framework. Section 4 demonstrates the experimental results and analysis, followed by the conclusion of this work in Section 5.
2 Related Work
In this section, we give a brief view of previous literatures that are closely related to our work in some aspects.
Recently, there has been a significant research interest in video retrieval such as event detection [28, 6] and recommendation system [13]. One type of video retrieval, the QBIVR task, is urgently in a need of performance boosting for its widespread applications and an effective informative representation of video undoubtly needs to be researched as an guarantee for retrieval quality [10]. Typically, shots indexing, sets aggregation and global signature representation are three prevailing video retrieval methods. Differ from the severe scarcity of intraframe relationships and sensitivity of opting training sets respectively in the first two patterns, global signature representation methods could reserve pretty rich structure information, such as interframe and intraframe relationships which contributes to modelling pertinently and accurately. However, high computational costs and efficient retrieval speed is hard to handle, even though Araujo proposed [4]
an integral representation method which reduces retrieval latency and memory requirements by Scalable Compressed Fisher Vectors (SCFV), but sacrifices a plethora of original spatialtemporal features which are crucial for integral representations. At the same time, a suitable similaritypreserving measurement between images and videos also merits attention which results in more difficulities in the QBIVR task.
By preserving invariant domain with lowdimensional structure information and then projecting affinity matrix into a datum point, an easier similaritypreserving measurement between images and videos was proposed for the QBIVR task by using approximate nearest neighbors (ANN) search methods
[1]. Meanwhile, several powerful hashing [20] ,i.e., supervised hashing, semisupervised hashing, unsupervised hashing did bring light to ANN search problem for pursuing efficiency. Although supervised hashing methods have demonstrated promising performance in some applications with semantic labels, it’s troublesome or even impossible to get semantic labels in many reallife applications. Besides, the learning process is by far more complex and timeconsuming than unsupervised techniques especially dealing with highresolution videos or images. Some classical unsupervised methods include Spectral Hashing (SH) [26], preserving the Euclidean distance in the database; Inductive Manifold Hashing (IMH) [23], adopting manifold learning techniques to better model the intrinsic structure embedded in the feature space; Iterative Quantization(ITQ) [9], focusing on minimizing quantization error during unsupervised training. Other noticeable unsupervised hashing methods including anchor graph hashing (AGH) [16] and scalable graph hashing with feature transformation (SGH) [12] directly exploit the similarity to guide the hashing code learning procedure and achieve a good performance.Although the above hashing methods can efficiently deal with complexity of computational cost and storage, the different modalities of images and videos are neglected which can cause domain gap between images and videos. Some work have already focused on the domain gap. One solution proposed by Yan [15], dubbed Hashing across Euclidean Space and Riemannian Manifold (HER), learns hash functions in a maxmargin framework across Euclidean space and Riemannian manifold, but becomes unsuitable for largescale database owing to unaffordable time when dimension grows. Shirvastava and Li [22] also proposed an Asymmetric LocalitySensitive Hashing (ALSH) which performs a simple asymmetric transformation on data pairs for different learning. Inspired by their work and a dimensionality reduction bilinear projections method by Gong [7], we aim to seek a more powerful asymmetric binary learning framework which properly balances QBIVR task and high quality hashing based on subspace learning.
3 Binary Subspace Coding
In this section, we describe the details of our BSC framework in two different schemes. We first present the geometrypreserving distance metric between images and videos and deduce how the objective is transformed to MIPS problem. Then, we respectively introduce the two different asymmetric learning approaches of hash codes/fucntions which perfectly solve domain gap between images and videos.
3.1 Problem Formulation
Given a database of videos, denoted as , where () represents the subspace covering all the video keyframes. Given a query image , the main objective of the querybyimage video task can be formulated as below:
(1) 
where is certain distance measurement of two data points. As shown, the major objective of querybyimage video retrieval is to find the subspace whose distance from query point is the shortest.
3.2 GeometryPreserving Distance
The QBIVR is essentially a pointtosubspace search problem in which the query is represented as a point, and the database comprises subspaces. Recall that existing solutions to the above problem either aggregate or project all the frames into a single datum compatible with the given query, which may cause serious information loss, such as spatial arrangement and/or temporal order. To compensate such drawbacks, we propose to measure the imagetovideo relationship by the distance between the query and its corresponding projection on the subspace plane. In this way, the geometric property and structural information of subspace can be fully preserved. We denote the new imagetovideo distance as
(2) 
where is the norm. It is easy to see that the nearest point is the orthographic projection of on ,which is calculated as follows:
(3) 
where is the orthographic projection of on , . Note that can be computed offline to increase efficiency. Substituting Eq. (3) into Eq. (2), we obtain the distance of pointtosubspace:
(4) 
where
is an identity matrix of size
. Denoting , given that we have(5) 
Therefore, we can obtain the further conclusion
(6)  
where is the trace of a matrix. Based on Eq. (6), our objective is equivalent to the following problem:
(7) 
Noting that , where denotes the inner product of two vectors. Then Eq. (7) can be seen as a Maximum Inner Product Search (MIPS) problem w.r.t. the query and its orthographic projection in the subspace. However, all the in the dataset have to be preprocessed every time a new query is provided, which apparently increases the computational cost to solve. To bypass this issue, based on the linear algebra manipulation , we rewrite the problem (7) as in (8):
(8) 
where is the function of transforming a matrix of size to a column vector of size by performing columnwise concatenation of the matrix. In this way, we obtain an equivalent MIPS problem w.r.t. and from the original QBIVR problem.
Considering the unaffordable computation of and when is large, i.e.,
, we employ hashing approaches to binarize image query and video data. Different properties of query images and videos in the database are unnegligible factors for accurate binary codes. Therefore, we learn asymmetric hash functions for image query
and video data respectively, then the MIPS problem is reformulated as(9) 
where and are hash functions for videos and images, respectively.
3.3 Innerproduct Binary Coding
We first present the Innerproduct Binary Coding approach. To facilitate the asymmetric learning of hash functions, we first construct video data and image data for training:
where are images randomly sampled from video frames for training. Let be the correlation matrix of and . We choose to use the inner product to represent the similarity, i.e., . Following [18], we now consider the following optimization problem:
(10) 
where , , and is the Frobenius norm. For simplicity, we choose to learn linear hash functions, i.e., and , where and are the two mapping variables for binarizing videos and images, respectively.
In practice, to further speed up the optimization, we deliberately discard the quadratic term , in view of the quadratic term with no help in leveraging the groundtruth similarity. In fact, the term can be treated a regularization in the magnitude of the learned inner product. Hence, we arrive at the new objective:
(11) 
which can be optimized by alternatingly updating and . In particular, when learning with fixed, we have
(12) 
When updating with fixed, we arrive at
(13) 
Both of the above subproblems are of the same form. Next, we will show how to solve (12) and the subproblem (13) can be solved in the same way.
It is nontrivial to optimize the subproblem (12) due to the existence of the sign function . To bypass the obstacle, we introduce an auxiliary variable to approximate , and thus we have
(14)  
s.t. 
where is a balance parameter. Setting the derivative of the above objective w.r.t. to zero, we have
(15) 
Fixing , then we can update with
(16) 
The above analytical solution of significantly reduces the training cost, similarly making the algorithm easily performed on largescale databases.
3.4 Bilinear Binary Coding
Note that in IBC, hashing the vectored images and videos data and with a full projection matrix may cause high computational cost. In this part, we propose a Bilinear Binary Coding (BBC) approach to further accelerate the efficiency of QBIVR task.
We first present a bilinear rotation to maintain matrix structure instead of a single large projection matrix, denoted as , which is remarkable successful in lowering running time and storage for code generation. It also has been proved by [7] that a bilinear rotation to is equivalent to a rotation to , denoted as , where is the Kronecker product [14]. Now, we can equivalently learn asymmetric hash functions for images and videos as follows:
(17) 
where , , . Then, we can generate binary codes for vectored images and videos with code length for performing an efficient retrieval.
Following [9], a feasible objective is to learn a bilinear rotation which minimizes the angle between and its binary encoding . We preprocess the video dataset and image dataset to be zerocentered and have unit norm, then our goal is to maximize the following objective:
(18) 
where is the angle of the th rotated image/video and its binary code. . is the angle between the binary codes of and , where and are the binary codes of the th image and th video, respectively. preserves the similarity property of images and videos in the different or same category:
For images, is expressed as:
(19) 
To simplify the subsequent optimization, we follow [8] to relax the above objective function by ignoring , and arrive at:
(20)  
Similarly, we can derive the objectives of video angles and imagetovideo angles:
(21) 
Hence, the objective function is transformed to
(22)  
where . .
For optimization, we use block coordinate ascent to alternatingly update ,,. The updating processes w.r.t. images and videos are symmetric. Hence we just describe the updates of variables of videos by fixing all the variables of images.
Step 1: Update , with all other variables fixed. We have the following reduced problem:
(23) 
where . We can solve the above optimization problem using polar decomposition:
(24) 
where and are the leftsingular vectors and the top rightsingular vectors of , respectively, by performing SVD.
Step 2: Update , with all the others fixed, we have
(25) 
where . Similar to the previous step, the update for is , where and are the top leftsingular vectors and the rightsingular vectors of , respectively, by performing SVD.
Step 3: Update , by fixing all the other variables, we obtain
(26) 
where . It can be easily seen that the solution to the above problem is as below:
(27) 
Then, we can similarly update , and .
Comparing to the time complexity of full rotation ,i.e, , the asymmetric bilinear hashing learning of videos and images significantly reduces the training cost to , where . We summarize the algorithm for optimizing the proposed Bilinear Binary Coding (BBC) approach in Algorithm 1.
4 Experiments
In this section, we evaluate our two proposed IBC, BBC approaches on four datasets for the querybyimage video retrieval (QBIVR) task.
4.1 Data and Experimental Settings
We used four video datasets, a face video dataset BBT1 (Big Bang Theory1) [15], UQE50 (UQ Event dataset with 50 predefined events) dataset [32], a micro objectbased video dataset collected by ourselves from Vine^{1}^{1}1https://vine.co/, and a wide range of objects and events dataset FudanColumbia Video Dataset (FCVID)^{2}^{2}2http://bigvid.fudan.edu.cn/FCVID/. BBT1 dataset, a lowdimensional small dataset, has been proved by HER method in [15]. We conduct experiments on this publiclyavailable video dataset to verify the effectiveness of our approaches. For the other datasets, we first adopted the FFmpeg^{3}^{3}3http://ffmpeg.org/ to sample the videos at the rate of frames per second as keyframes, and subsequently extracted the visual features of keyframes using fc7 layer (d) of VGG Net model [27]. In view of the potential redundancy, the data in our experiment were further reduced to d by PCA.
We compared our IBC and BBC approaches against several stateoftheart unsupervised hashing methods for largescale video retrieval, including ALSH [21], IMH [23], SH [26], ITQ [9]. In our IBC approach, each column of the original inner product matrix is binarized, where the top largest elements are set to and the rest ones to . is set to for the Vine dataset, for UQE50 and BBT1 datasets, and for FCIVD dataset. Notably, the balance parameter is empirically set to and the number of local iterations is set to . In our BBC approach, we firstly initiate the bilinear rotation parameters randomly and then learn two asymmetric hash functions respectively. The number of local iterations is set to in light of the excellent converging property of our devised approach. The parameters of the rest compared approaches are set as suggested above. In the experiment, the code length is tested in the range of .
The evaluation metrics are chosen as Hamming ranking including mean of average precision (mAP) and mean precision of the top
retrieved neighbors (Precision@500).4.2 BBT1: video retrieval with face images
The Big Bang Theory (BBT) [15] is a sitcom (20 minutes an episode) which includes many fullview shots of multiple characters at a time. It take places mostly indoors and mainly focuses on characters. BBT1 consists of 3,341 face videos of the first 6 episodes from season 1 of BBT which are represented by block Discrete Cosine Transformation (DCT) feature as used in [2], which forms a covariance video representation.
4.2.1 Compared to other stateoftheart methods
To compare with HER method [15], an effective heterogeneous spaces hashing method, we tested our approaches on the BBT1 which has been verified successfully on HER method. Following [15], for each database, we randomly extracted 300 imagevideo pairs (both elements of the pair come from the same subject) for training and 100 images from the rest as query for the retrieval task. The results are shown in Table1. Our proposed approaches not only perform better than HER method on mAP, but also overcome the limitation of HER method in low dimension and the subsequent experiments show that our approaches can be applied to highdimensional large datasets fittingly.
Method  mAP  

16bit  32bit  64bit  96bit  
HER  0.5049  0.5227  0.5490  0.5531 
IBC  0.5152  0.5369  0.5561  0.5638 
BBC  0.5080  0.5401  0.5643  0.5711 
4.3 Vine: video retrieval with object images
Vine is a micro video sharing platform, where users can only share videos which are no more than six seconds by mobile devices. We collected a micro objectbased video dataset from Vine comprising micro videos in categories, and randomly sampled 11,000 videos and 11,000 images from videos for training respectively and the rest 1,000 images as test.
4.3.1 Compared to other stateoftheart methods
We report the compared results with some stateoftheart methods, i.e., hashing methods and vectorbased approach in terms of both hash lookup: mAP and Precision@500. The compared vectorbased method, a temporal aggregation [3]
employs the Scalable Compressed Fisher Vectors (SCFV) to reduce retrieval latency and memory requirements for achieving higher speed and maintaining good performance. However, the approach sacrifices useful information such as structure similarity when pursuing a higher speed. Moreover, though binarized fisher features that TAVR uses are more representative and effective than lowlevel image features, it still fails when competing with deep features, especially the ones after redundancy removing. The performance of the vectorbased approach and the proposed approaches in 64bit is clearly illustrated in Table
2.Method  mAP  Precision@500 

TAVR  0.3785  0.3021 
IBC  0.4997  0.4990 
BBC  0.5045  0.5039 
In the comparisons with hashing methods, we treat a query a false case if no point is returned when calculating precision. Ground truths are defined by the category information from the datasets. As Table 3 shows, the two proposed approaches outperform all the other stateoftheart methods in terms of every metric at all code lengths, noticeably remaining much greater expression ability when encoding length is as large as 128 bits.
Method  mAP  Precision@500  

16bit  32bit  64bit  96bit  128bit  16bit  32bit  64bit  96bit  128bit  
ALSH  0.2965  0.2983  0.3006  0.3067  0.3122  0.2942  0.2976  0.3014  0.3142  0.3224 
ITQ  0.3121  0.3156  0.3256  0.3325  0.3378  0.2846  0.3256  0.3274  0.3302  0.3544 
IMH  0.3308  0.3569  0.3596  0.3628  0.3642  0.3282  0.3366  0.3644  0.3722  0.3804 
SGH  0.3849  0.4135  0.4258  0.4322  0.4371  0.4165  0.4276  0.4294  0.4322  0.4424 
IBC  0.4970  0.4985  0.4997  0.5012  0.5064  0.4892  0.4953  0.4990  0.5010  0.5048 
BBC  0.4933  0.5012  0.5045  0.5082  0.5125  0.4842  0.4993  0.5039  0.5115  0.5202 
4.4 UQE50: video retrieval with event images
The video dataset UQE50 (UQ Event dataset with 50 predefined events) aims at event analysis tasks which was downloaded from YouTube^{4}^{4}4http://www.youtube.com/ [32]. The dataset contains videos that belong to different event categories, and all the videos are from trending events happened in the last few years whose granularity is comparably larger than the existing video event datasets. Compared with the Vine dataset of objectbased videos, UQE50 is a longertime eventbased video dataset in a smaller size. To verify the generality of our proposed approaches in different typies of video retrieval, we used UQE50 video dataset to compare the performance of our approaches with that of other stateoftheart methods. We randomly chose videos and images from videos as training samples respectively and the rest 200 images were used as test samples.
4.4.1 Compared with other stateoftheart methods
To examine the practical efficacy of our proposed approaches, in this part, we conducted a similar experiment on UQE50 to evaluate its scalability compared with other stateoftheart hashing methods as Vine experiment. Obviously, even for a smaller dataset of longer timelength, the performance (i.e., mAP, Precision@500) is also excellent as shown in Figure 2(a)(b). As the code length enlarges, all the aspects of the proposed approaches’ performance steadily are better than the other ones.
4.5 FCVID: video retrieval with event/object images
The video dataset FCVID (FudanColumbia Video Dataset)^{5}^{5}5http://bigvid.fudan.edu.cn/FCVID/ is a web videos dataset containing Web videos annotated manually according to 239 categories. The categories in FCVID cover a wide range of topics like social events (e.g., tailgate party ), procedural events (e.g., making cake ), objects (e.g., panda ), scenes (e.g., beach ), etc. These categories were defined very carefully and organized in a hierarchy of 11 highlevel groups. Specifically, the categories were conducted by user surveys and the organization structures on YouTube and Vimeo as references to identify. In this section, we chose as our video dataset and randomly selected images and videos for training, then we tested another images in the video dataset.
4.5.1 Compared with other stateoftheart methods
In this part, we conducted the same experiment as Vine and UQE50 to the FCVID dataset for studying the performance of event/object image retrieval. In view of the wide range of image typies and higher reliability of dataset, we can demonstrate the effect of our approaches clearly. As Figure
2 (c)(d) show, our approaches outperform both of mAP and Precision@500 in different code lengths than the other stateoftheart hashing methods. Better performances on the four datasets in different types and sizes prove our validity of the proposed IBC and BBC approaches.4.6 Effect of training size on UQE50 and Vine
This part of experiment mainly studies on evaluating the effect of training size on the searching quality of our IBC and BBC approaches. We performed the experiments on objectbased Vine and eventbased UQE50 datasets and selected mean of average precision (mAP) as the comprehensive assessment index. We fixed the code as bit and varied the training size of Vine dataset from to with a regular interval of and UQE50 is tuned from to with a regular interval of . The subsequent consequences are shown in Figure 3. As we can see, IBC and BBC approaches both have the suitable training size for the best performance, even though add more training data, the two approaches do not gain noticeable performance boost to some extent. The ideal training size of Vine is for IBC approach and for BBC approach. And for UQE50 dataset, the performance ultimately is optimal with the training size of for IBC approach and for BBC approach. This section also gives us guidance for chosing the suitable training size.
5 Conclusion
In this paper, we developed Binary Subspace Coding (BSC) framework which includes two different approaches for querybyimage video retrieval. Different from traditional video retrieval methods, we focused on subspacebased video representation and discovered a common Hamming space for both images and videos, to enable an efficient retrieval. Our proposed similaritypreserving measurement can preserve geometric structure properties of videos than the traditional methods by a new distance metric. Furthermore, we deduce an equivalent MIPS solution to solve the objective of pointtosubspace problem, which decreases the computational cost significantly. Besides, BSC framework is an asymmetric learning model which handles the complexity of computation and memory as well as the domain differences of videos and images efficiently. Extensive experiments on the four datasets, BBT1, UQE50, Vine, FCVID dataset, demonstrated the advantages of our two approaches compared to several stateofthearts.
References
 [1] R. Basri, T. Hassner, and L. ZelnikManor. Approximate nearest subspace search. TPAMI, 33(2):266–278, 2011.
 [2] M. Bäuml, M. Tapaswi, and R. Stiefelhagen. Semisupervised learning with constraints for person identification in multimedia data. In CVPR, 2013.
 [3] A. F. de Araújo, J. Chaves, R. Angst, and B. Girod. Temporal aggregation for largescale querybyimage video retrieval. In ICIP, 2015.
 [4] A. F. de Araújo, J. Chaves, R. Angst, and B. Girod. Temporal aggregation for largescale querybyimage video retrieval. In ICIP, 2015.
 [5] A. F. de Araújo, M. Makar, V. Chandrasekhar, D. M. Chen, S. S. Tsai, H. Chen, R. Angst, and B. Girod. Efficient video search using image queries. In ICIP, 2014.
 [6] C. Gan, N. Wang, Y. Yang, D. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, 2015.

[7]
Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik.
Learning binary codes for highdimensional data using bilinear projections.
In CVPR, 2013.  [8] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik. Angular quantizationbased binary codes for fast similarity search. In NIPS, 2012.
 [9] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. TPAMI, 35(12):2916–2929, 2013.
 [10] R. Hong, Y. Yang, M. Wang, and X. S. Hua. Learning visual semantic relationships for efficient visual retrieval. Big Data IEEE Transactions on, 1(4):152–161, 2015.
 [11] Y. Hu, A. S. Mian, and R. A. Owens. Sparse approximated nearest points for image set classification. In CVPR, 2011.
 [12] Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In IJCAI, 2015.
 [13] Y. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In AAAI, 2013.
 [14] A. J. Laub. Matrix analysis  for scientists and engineers. SIAM, 2005.
 [15] Y. Li, R. Wang, Z. Huang, S. Shan, and X. Chen. Face video retrieval with image query via hashing across euclidean space and riemannian manifold. In CVPR, 2015.
 [16] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, 2011.
 [17] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for largescale image classification. In ECCV, 2010.
 [18] F. Shen, W. Liu, S. Zhang, Y. Yang, and H. T. Shen. Learning binary codes for maximum inner product search. In ICCV, 2015.
 [19] F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao. A fast optimization method for general binary code learning. IEEE Transactions on Image Processing, 25(12), 2016.
 [20] X. Shen, F. Shen, Q. S. Sun, Y. Yang, Y. H. Yuan, and H. T. Shen. Semipaired discrete hashing: Learning latent hash codes for semipaired crossview retrieval. IEEE Transactions on Cybernetics, 2016.
 [21] A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NTPS, 2014.
 [22] A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). In UAI, 2015.
 [23] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Intermedia hashing for largescale retrieval from heterogeneous data sources. In SIGMOD, 2013.
 [24] R. Vemulapalli, J. K. Pillai, and R. Chellappa. Kernel learning for extrinsic classification of manifold features. In CVPR, 2013.
 [25] R. Wang and X. Chen. Manifold discriminant analysis. In CVPR, 2009.
 [26] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.
 [27] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015.
 [28] Y. Yan, Y. Yang, D. Meng, G. Liu, W. Tong, A. G. Hauptmann, and N. Sebe. Event oriented dictionary learning for complex event detection. TIP, 24(6):1867–1878, 2015.
 [29] Y. Yang, F. Shen, H. T. Shen, H. Li, and X. Li. Robust discrete spectral hashing for largescale image semantic indexing. IEEE Transactions on Big Data, 1(4):162–171, 2015.

[30]
B. Yi, Y. Yang, F. Shen, X. Xu, and H. T. Shen.
Bidirectional longshort term memory for video description.
In ACM, 2016.  [31] L. Yu, Y. Yang, Z. Huang, P. Wang, J. Song, and H. Shen. Web video event recognition by semantic analysis from ubiquitous documents. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, PP(99):1–1, 2016.
 [32] L. Yu, Y. Yang, Z. Huang, P. Wang, J. Song, and H. T. Shen. Web video event recognition by semantic analysis from ubiquitous documents. In TIP, preprint.