Deep Hashing with Category Mask for Fast Video Retrieval

12/22/2017 ∙ by Xu Liu, et al. ∙ 0

This paper proposes an end-to-end deep hashing framework with category mask for fast video retrieval. We train our network in a supervised way by fully exploiting inter-class diversity and intra-class identity. Classification loss is optimized to maximize inter-class diversity, while intra-pair is introduced to learn representative intra-class identity. We investigate the binary bits distribution related to categories and find out that the effectiveness of binary bits is highly related to categories, and certain bits may degrade classification performance of some categories. We then design hash code generation scheme with category mask to filter out bits with negative contribution. Experimental results demonstrate the proposed method outperforms state-of-the-arts under various evaluation metrics on public datasets. We are making our code and models public online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over recent years, industry has been witnessing the booming of short-video sharing apps and platforms, through which people record and share their daily moments in the form of short videos of seconds-length. This has encouraged the development of advanced techniques for a wide range of multimedia understanding applications. One open question is how to efficiently retrieve the relevant video from the large-scale video database, which requires efficient video representation learning. In industrial applications, efficiently learned video representations should satisfy three qualities: content representative, storage effective and with low computational complexity. One approach that qualifies these requirements is learning based video hashing  

[Wu et al.2017, Liong et al.2017b].

Existing learning to hash methods can be classified into two approaches: non-deep hash learning  

[Weiss et al.2009, Liu et al.2012, Jiang and Li2015] and deep hash learning  [Xia et al.2014, Lai et al.2015, Liu et al.2016, Liong et al.2017a, Jain et al.2017]

. The non-deep approach uses various statistical learning techniques to learn hash functions which map samples into binary codes. In the past few years, deep convolutional neural networks (CNN)  

[Krizhevsky et al.2012, Simonyan and Zisserman2014, He et al.2016] have demonstrated state-of-the-art performance on various visual tasks. Inspired from the advancement of deep CNN techniques, many deep hashing methods have been proposed. By training an end-to-end CNN model, existing deep hashing techniques manage to simultaneously learn image representations as well as binary codes [Xia et al.2014, Liong et al.2015, Jain et al.2017, Lin et al.2016, Venkateswara et al.2017, Duan et al.2017].

Although existing deep hashing approaches have achieved remarkable performance, they were mostly designed for image based binary code learning. By contrast, there are relatively fewer deep hashing methods specially designed for videos in literature. Learning to hash for videos is much more challenging than that for images as videos provide far more diverse and complex visual information than images provide. Existing video hashing approaches [Coskun et al.2006, Weng and Preneel2010, Cao et al.2012, Ye et al.2013, Song et al.2013, Sun et al.2016]

mostly focus on first extracting statistical or perceptual features from videos and then applying image hashing methods on those features to obtain the binary codes, which is unidirectional. As a result, the quality of produced hash code heavily depends on the quality of obtained features, while the hash code was not utilized to guide the learning of features. To address this problem, Wu et al. wu-et-al:scheme integrated video feature learning and hash value learning into a joint learning model, where offline processing such as K-means clustering and Canonical Correlation Analysis (CCA) need to be employed on the learned video features to learn binary codes and hash functions. Liong et al. liong-et-al:scheme simplified the learning pipeline by integrating the learnings of video feature, binary code and hash function into a single deep neural network. Model parameters are learned by employing siamese network. In this work we propose an end-to-end deep video hashing network by learning intra-class identity and maximizing inter class diversity. Furthermore, inspired from  

[Li et al.2017, Molchanov et al.2017] which demonstrated that convolutional neural networks involve massive redundant parameters, we study binary bits distribution related to categories and present a category mask related binary code generation approach. The contributions of the present work can be summarized as follows:

We present an end-to-end deep video hashing framework which simultaneously learns feature representation and hash code.

We propose to employ inter-class diversity and intra-class identity as training objectives to learn discriminative yet representative binary descriptor. Video intra-pair is introduced by this work to learn intra-class identity.

We investigate the hash code bits distribution correlated with the data categories and propose a category mask based hash code generation for efficient video retrieval. To our best knowledge, none of existing works ever investigated the relationship between the data categories and the hash code bits distribution for efficient video retrieval.

2 Related Work

Learning to Hash. Typical works on statistical hash learning include supervised hashing with kernels (KSH) [Liu et al.2012], PCA-random rotation (PCA-RR)  [Gong et al.2013], spectral hashing (SH)  [Weiss et al.2009], iterative quantization (ITQ)  [Gong et al.2013], scalable graph hashing (SGH)  [Jiang and Li2015]

, sparse embedding and least variance encoding (SELVE) 

[Zhu et al.2014]

. All these hashing methods take a vector of hand-crafted visual features extracted from an image as input. In the last few years, many deep hashing methods have been proposed to simultaneously learn image representations and binary code. Xia et al. via-et-al:scheme presented a two-stage supervised hashing method via image representation learning, where the learned approximate hash codes are used to guide the learning of the image representation, but the learned image representation cannot give feedback for learning better approximate hash codes. To address this issue, simultaneous feature learning and hashing techniques in a single neural network were proposed  

[Lai et al.2015, Lin et al.2015]. Jain et al. jain-et-al:scheme presented a structured hash code learning framework by introducing block-softmax nonlinearity. Venkateswara et al. venkateswara-et-al:scheme proposed a supervised deep hashing framework that tries to address the domain adaption problem. Regarding training objectives, Long et al. erin-et-al:compact incorporated pair-wise supervision to train the deep hashing model. Similar pair-wise based works can be found in  [Li et al.2016, Liu et al.2016]. Triplet ranking was employed to learn parameters in  [Lai et al.2015], which achieved improved performance. Lin et al. lin-et-al:uhash learned deep hashing function in an unsupervised way with three training objectives: image rotation invariant, quantization loss minimization and learned bits evenly distributed. Duan et al. duan-et-al:scheme also proposed an unsupervised binary descriptor learning framework, where K-Auto Encoders were used to minimize multi-quantization loss.

Video Hashing. Convenient approaches for video hashing usually select frames from a video, treat the selected frames as separate images and then employ image hashing techniques on them  [Coskun et al.2006, Weng and Preneel2010, Cao et al.2012, Ye et al.2013, Song et al.2013, Hao et al.2017b]

. For example, Weng and Preneel weng-preneel:scheme proposed to extract feature from each frame and then generate hash code based on the extracted statistical feature vector, while Cao et al. cao-et-al:scheme and Hao et al. hao-et-al:scheme utilized multiple frame sets and multiple key frames to learn hash functions. Sun et al. sun-et-al:scheme proposed a hash learning method via deep belief network, where a fusion of visual-appearance and visual-attention features are used as inputs. All the above methods employed hand-crafted features which were fixed during hash learning process. In  

[Hao et al.2017a], CNN features were extracted to learn hash function. Zhang et al. zhang-et-al:scheme presented an unsupervised video hash learning framework, by using a binary LSTM module and a normal LSTM module as the encoder and the decoder respectively. Frame-level features are extracted via a deep CNN. Still, the feature generation and the hash code generation are processed separately. Wu et al. wu-et-al:scheme proposed an integrated framework in which feature extraction, binary code learning and hash function learning are optimized in a self-taught manner. Yet this method doesn’t learn binary code and hash function as part of the deep architecture.

We propose to learn both video feature representation and hash functions within a deep learning pipeline. As far as we know, only one other approach has provided an end-to-end deep hash learning pipeline  

[Liong et al.2017b]

. Our approach differs from that work both in neural network structure and in supervision learning metric. We also investigate the influence of category on hash code bits distribution and devise a category mask based hash code generation method for efficient video retrieval.

Figure 1: Overview of the proposed deep hashing framework. At training phase, model is trained with batches of intra-pair data by optimizing cross-entropy loss for classification and loss within an intra-pair. At retrieval phase, shown in the bottom part, Hamming distance between the query video and the target is calculated. First, the query video is forwarded through the network and binary code is produced by thresholding on the outputs of the binary encoding module. Second, is operated with hash code of target video, followed by category masking on the outputs of operation to filter out bits that degrade classification accuracy, producing a binary code for counting Hamming distance. The category mask is a matrix, where is the number of categories and denotes the length of binary vector .

3 The Proposed Approach

In this section, we present the design of our deep video hashing architecture, its supervised learning and the category mask for fast retrieval.

3.1 Architecture

The proposed deep video hashing architecture is shown in Fig. 1, which consists of the input module, the backbone network, the binary encoding module and the output module.

The input module selects frames from video. We introduce intra-pairs for training purpose. An intra-pair , is defined as a pair of frame sets extracted from the same video. Each frame set consists of a group of video frames randomly selected at even interval, i.e., and . There are no overlap frames between two frame sets within an intra-pair, i.e. . At training phase, the input is an intra-pair , while at retrieval phase it is a single frame set .

The backbone network is a deep CNN involving multiple convolutional layers followed by a fully-connected layer. The deep CNN is used to learn video representation: frames selected evenly from the input video are forwarded to the backbone CNN module which extracts feature maps for each selected frame and fuse them to generate a single feature map set. The fused feature maps are connected to a full connection layer to generate the video representation. To better capture temporal evolution across consecutive frames, we fuse the feature maps in a weighted way, as illustrated in Equation 1.

(1)

where represents the fusion output from feature maps of input frames, and is a combination of feature maps in a layer for frame with , and is the feature map output for the channel, . is the fusion weights that are learned during training process.

The backbone network is loosely coupled with the other modules in the proposed architecture, thus can be replaced with other efficient CNN modules, such as Alexnet  [Krizhevsky et al.2012], VGG  [Simonyan and Zisserman2014] and ResNet  [He et al.2016], or other LSTM modules employed to extract video representations.

The binary encoding module consists of a fully-connected layer which encodes the video representation vector into binary-like outputs by employing sigmoid operations.

The output module

outputs both the class-probability estimates and the binary-like vector which is then thresholded to produce the binary hash code.

3.2 Supervised Learning with Intra-Pair Loss and Classification Loss

The proposed deep hashing model is learned in a supervised way. We aim to learn a set of parameters that quantize the input video into a compact binary vector while preserving high level semantic information. We enforce two criterions on a compact yet discriminative binary descriptor. First, the learned binary descriptor should maximize inter-class diversity. Second, the learned binary descriptor should be highly representative of intra-class identity. To achieve the above two objectives, we formulate the following optimization problem to learn using the proposed deep hashing network:

(2)

where defines the loss to maximize inter-class diversity, is minimized to learn intra-class identity, and and are parameters to balance different objectives.

Inter-class diversity. As classification information describes high-level semantic category of video data, it is a simple yet effective method to represent inter-class diversity. As such, we employ cross-entropy loss to define in order to maximize inter-class diversity, as interpreted in Equation 3.

(3)

Intra-class identity. Ideally, the intra-class identity is the unique identifier for a specific class. In real scenarios, training data usually owns two characteristics: 1) the intra-class sample distances vary for different classes. 2) large amount of noisy data exists especially in complex datasets. As a result, it involves much difficulty to train a representative class identity directly from video pairs within the same class. Instead, highly identical sample pairs with minimized noise yet with maximized low-level visual discrimination are required to learn high-quality intra-class identity.

To address this issue, we introduce intra-pair which is defined as a pair of frame sets extracted from the same video, as described in section 3.1. It is supposed that for any frame set extracted over the whole lifespan from a video, a perfectly learned hash function would output the same binary code, as frame sets extracted from the same video share the same semantic information. Consequently, we learn intra-class identity by minimizing the distance between the two frame sets within an intra-pair. In practice, the intra-pair loss is defined as the loss between the binary-like outputs of an intra-pair, as Equation 4 shows.

(4)

where and are binary-like vector respectively learned to represent and , and defines the -norm distance. is the positive margin value as defined in  [Liu et al.2016]. Without

, minimizing the loss function will be restrained in a representation tending to be all 0s, seriously affecting the system performance.

3.3 Category Mask

Previous works have shown that convolutional neural networks involve massive redundant parameters  [Li et al.2017, Molchanov et al.2017]. Inspired from that observation, we investigated the distribution of hash bits produced by thresholding on the outputs of the binary encoding module in the proposed method.

3.3.1 Observation

For a dataset with categories, we define a binary vector for each category of dimension , . Given a ratio , binary vector for the th category is set as in Equation 5.

(5)

The sum of contributed categories is then represented as a natural-number vector of dimension , , calculated as:

(6)

where and denotes the -th entry of vector and of vector , respectively.

Obviously, for an evenly distributed hash code of length , the value of , , should be , given ratio . We learn hash code @64 bits on UCF101111UCF101:a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. dataset and plot the values of vector under ratios 0.3, 0.5 and 0.7 as shown in Fig. 2, where the solid curves represent the experimental results and the dashed lines represent the ideal distribution. It can be observed that the deviation from the ideal value for each bit is within a small range and on average the number of contributed categories for each bit equals the ideal value, proving that every bit in the learned binary code contributes evenly to the category classification.

Figure 2: Sum of contributed categories for each bit, where the x-axis denotes the bit order with value starting from 0 to 63, and the y-axis is the total number of categories calculated using Equation 6.
Figure 3: Visualization of mapping relationship between bits and categories under ratio 0.3. The horizontal axis is the bit order starting from 0 to 63, and the vertical axis illustrates 10 categories from UCF101. Each block defines a mapping relationship, and a green block at means for category , the value of is set to 1 according to Equation 5, while a white block says .

To illustrate the influence of different bits on the classification output, we further visualize the mapping relationship between each bit and each category, as shown in Fig. 3. Due to page limitation, the results of only 10 categories are presented here. We can see that every single bit behaves differently on different categories, and for each category only specific bits contribute to its classification outputs.

From the above analysis, we can see that the learned hash bits are evenly distributed for all categories while each bit contributes distinguishably on different categories. In other words, every bit of hash code is essential for the classification task, but there is indeed redundancy existing locally with respect to categories.

Consequently, we consider that the hash code generation should be category related and propose a category mask based hash code generation method for fast video retrieval.

3.3.2 Methodology

The category mask is defined as a binary matrix of dimension with respect to a ratio value , where K is the number of classes in the database and is length of the hash code. Each row vector of category mask is assigned by a binary vector for that class, as defined in Equation 7.

(7)

where is calculated as Equation 5.

Category mask is calculated after training is completed. Let represent the hash code collection of retrieval dataset with classes, and denote a binary hash code. At retrieval phase, the classification output indicating the category of the query video is used to index the binary mask, which is then employed on the output of operation between the query and the target.

(8)

where denotes the query hash code and the output is a binary vector of length . The Hamming distance between the query and the target is then calculated with .

As can be seen from Equation 7, mask is actually a filter on the hash code bit with as the adjusting factor to control the strength. When is set to 1.0, the mask is filled with all 1s and Equation 7 turns into a general Hamming distance computation between the query and the target.

The overall hash code generation process with category mask is illustrated in Fig. 1. Experimental section demonstrates the efficiency of the proposed category masking scheme.

4 Experimental Results

In this section, we verify the efficacy of the proposed category mask based deep video hashing method named DVHCM. The experimental settings are described in subsection 4.1, including the benchmark datasets used for evaluation and the network layer setup for the proposed method. For the performance verification, we first demonstrate in subsection 4.2 the efficacy of the proposed category mask scheme by employing various masks generated under different ratios, and then in subsection 4.3 we provide extensive evaluations on the proposed DVHCM by comparing it with state-of-the-arts.

4.1 Experimental settings

We evaluate our approach on two benchmark datasets for action recognition: UCF101 and HMDB51 [Kuehne et al.2011].

  • The UCF101 dataset consists of 101 action categories from 13320 realistic action videos, covering five activity types: human-object interaction, body-motion only, human-human interaction, playing musical instruments, and sports. The clip duration for most videos in UCF101 is less than 10 seconds.

  • The HMDB51 dataset contains 6766 videos with 51 distinct action categories, covering various facial actions and body movements. The clip duration for most videos in HMDB51 is less than 5 seconds.

Training datasets for UCF101 and HMDB51 consist of respectively 9624 and 5115 videos, and the rest videos comprise the test datasets. The retrieval is performed by using the videos from the testing set as the queries for the system to retrieve relevant ones from the training set. Semantic-level labels define the similarity labels, i.e., the queried video is relevant to the query if they share the same semantic label.

Following previous hashing works, three standard evaluation metrics are used to measure the accuracy of our proposed method and other baselines: mean Average Precision (mAP), Precision-Recall curves and Precision curves w.r.t. different numbers of top returned samples.

For the proposed method, we employ ResNet-50  [He et al.2016] as the backbone module in the proposed deep hashing architecture. Model is trained in supervised way by optimizing the proposed intra-pair loss and category loss. At training phase, the input to the network is intra-pairs consisting of two frame sets, while at retrieval phase, the input is a single frameset. Each frameset contains a number of randomly selected frames from a video. In the experiments, we set to 5. Each frame in a frame set is resized to and then forwarded to the ResNet-50 module to generate a 2048-d image representation. As described in section 3.1, image representations from the same frame set which is are fused with weights to produce a single output of , which is regarded as the video representation. The weights for fusion is learned at training phase. The binary encoding module consists of a fully connected layer followed by a sigmoid activation layer to produce binary-like vectors. To evaluate the performance of hash codes with various lengths, we set the length of binary-like vectors to 32, 64, 128, 256 and 512, respectively. Following the binary encoding module, the output module employs a layer to do classification.

4.2 Evaluation on category mask

To explore how well the proposed category mask scheme improves retrieval performance, we generate category masks with various top weight ratios ranging from 0.1 to 1.0 with different binary code lengths.

Table 1 presents the mAP values of different mask ratios @32, 64, 128, and 256 bits. Items in bold show the best mAP values under corresponding code length in each column. We can observe that in general longer binary code shows better mAP performance. And as the binary code length becomes larger, the value of best mask ratio increases. Hence at retrieval phase, we set larger ratio to produce masks for longer binary code, and use smaller ratio for short binary code. For masks with very low ratios, the retrieval accuracy degrades at short bits. For example, curves of ratio 0.1 @64 bits degrade the precision-recall performance. We consider this degradation is due to lacks of enough bits to represent the video, which is the opposite of bits redundancy.

Fig. 4 demonstrates the precision curves w.r.t. top- retrieved videos @64 bits under various mask ratios. Here precision values of maximum 60 returned samples are presented as each category in UCF101 and HMDB51 training set consists of more than 60 videos on average. We can find that retrieval performance with mask ratios larger 0.4 clearly outperforms the one without category masking shown by the black curve in the figure.

UCF101 HMDB51
Ratio 256 bits 128 bits 64 bits 32 bits 256 bits 128 bits 64 bits 32 bits
0.1
0.2
0.3 0.588 0.588
0.4 0.959
0.5
0.6 0.949 0.901 0.605
0.7 0.857 0.487
0.8
0.9
1.0
Table 1: mAP performance by Hamming ranking for different category mask ratios under various binary code length.
Figure 4: Precision curve w.r.t. top-N @64 bits under different mask ratios on UCF101 and HMDB51.

4.3 Comparison with the state-of-the-arts

We compare the proposed DVHCM method with state-of-the-art baselines on video retrieval tasks, including seven non-deep approaches: SH  [Weiss et al.2009], ITQ  [Gong et al.2013], AGH  [Liu et al.2011], PCA-RR  [Gong et al.2013], SGH  [Jiang and Li2015], SELVE  [Zhu et al.2014] and KSH  [Liu et al.2012], and three deep approaches: DBH  [Lin et al.2015], DNNH [Lai et al.2015] and SUBIC [Jain et al.2017]. All of the methods use identical training and query sets.

For the proposed method, we present results with the best mask ratios as well as results without using category mask. For dataset UCF101, mask ratios are set to 0.6, 0.4, 0.4 and 0.3 respectively at 64 bits, 128 bits, 256 bits and 512 bits, while for dataset HMDB51, they are set to 0.5, 0.3, 0.3 and 0.2. The network setting is described in the experimental setting section.

For a fair comparison, the deep baselines use the same backbone network as the proposed method, i.e. ResNet-50 initialized with the model pre-trained on the ImageNet dataset. The rest of layers for the baselines are set according to the reference paper. The triplets to train DNNH are generated using the same method presented in the original paper  

[Lai et al.2015] by replacing images with videos. For SUBIC  [Jain et al.2017], the dimension of each one-hot vector is set to 8. We implemented the proposed method and deep baselines on MXNet222

MXNet:A flexible and efficient machine learning library for heterogeneous distributed systems.arXiv:1512.01274

, and as the baselines have been originally designed for image retrieval, we first trained and tested hash models for image retrieval on Cifar-10 dataset  

[Krizhevsky2009] to make sure our implementation reproduce similar results with those presented in the reference paper.

For non-deep baselines, instead of using hand-crafted visual features as most of previous hash works did, we represent each video with a 2048-dimensional deep feature extracted using the DBH  

[Lin et al.2015] to avoid retrieval performance gap caused by feature representations. All the non-deep baselines are evaluated using the implementions from HABIR toolkit333HABIR:hashing baseline for image retrieval. https://github.com/willard-yuan/hashing-baseline-for-image-retrieval. As the original SUBIC paper proposes to use floating vector instead of hash code to do retrieval, we compare our method with SUBIC using mAP calculated based on the top returned samples. While with the rest of baselines, we do comparison using mAP by Hamming ranking.

UCF101 HMDB51
Method 512 bits 256 bits 128 bits 64 bits 512 bits 256 bits 128 bits 64 bits
AGH
IT
KSH 0.848
PCA-RR
SELVE
SGH
SH
DBH
DNNH 0.817 0.789 0.740 0.480 0.493 0.503 0.487
Proposed w/o CM
Proposed 0.953 0.959 0.949 0.901 0.672 0.588 0.588 0.605
Table 2: mAP performance by Hamming ranking of the proposed methods and the baselines. Note that deep features are used for non-deep hashing baselines.
UCF101 HMDB51
Method 512 bits 256 bits 128 bits 64 bits 512 bits 256 bits 128 bits 64 bits
SUBIC
Proposed w/o CM
Proposed 0.870 0.843 0.817 0.759 0.372 0.368 0.367 0.356
Table 3: mAP performance of the proposed method and SUBIC calculated on top returned samples.

Table 2 and Table 3 compare the mAP performance of the proposed method and the baselines at 64, 128, 256 and 512 bits. For the proposed method, results both with and without using category masks are reported in the tables, represented respectively as ’Proposed’ and ’Proposed w/o CM’. We can find that on both datasets the proposed method DVHCM dramtically outperforms the other baselines at every bit length.

To be specific, on dataset UCF101, the proposed DVHCM achieves accuracy of 0.901, 0.949, 0.959 and 0.953 respectively at 64 to 512 bits, yielding 10.5% to 16.1% retrieval improvement over the best baselines DNNH and KSH with deep features. On dataset HMDB51, the DVHCM achieves around 8.5% to 19.2% improvements over the best baseline. In contrast to DVHCM, the proposed without category masking shows inferior performance, but still it outperforms all the baselines on dataset UCF101 with a maximum improvements of 12.7%, proving the proposed supervised training with intra-pair loss and classification loss does have positive effects on hash code learning. On HMDB51, the proposed w/o CM exceeds the other baselines except for the DNNH which is trained with triplet ranking. We consider this performance degradation is caused by training using videos in short duration length, as videos from HMDB51 are all less than 5 seconds and some of them last event less than 2 seconds. With videos in such short duration, the extracted two frame sets in an intra-pair are usually very close to each other in low-level or even pixel-level features, thus the intra-pair loss will be very small and contributes very little during training, compared with category-loss.

Fig. 5 and Fig. 6 demonstrate the retrieval performance of the proposed method and the baselines in terms of precision-recall curve and precision curve with respect to top returned samples at 64 bits on dataset UCF101 and HMDB51. We can find that the proposed DVHCM outperforms baselines by large margins under both evaluation terms.

(a) Precision-Recall curve
(b) Precision curve w.r.t. the top-
Figure 5: Performance of DVHCM and baselines @64 bits on UCF101 dataset
(a) Precision-Recall curve
(b) Precision curve w.r.t. the top-
Figure 6: Performance of DVHCM and baselines @64 bits on HMDB51 dataset

5 Conclusion

In this work we presented DVHCM, an end-to-end deep hashing approach with category mask for fast video retrieval. We introduced intra-pair and proposed to learn hash model by optimizing the classification loss and the intra-pair loss. The binary bits distribution related to categories was investigated and category masking scheme was proposed to improve retrieval accuracy. Experimental results show that the proposed method achieves superior performance under various evaluation metrics, compared with both deep and non-deep state-of-the-arts.

References

  • [Cao et al.2012] Liangliang Cao, Zhenguo Li, Yadong Mu, and Shih-Fu Chang. Submodular video hashing: a unified framework towards video pooling and indexing. In ACM MM, pages 299–308. ACM, 2012.
  • [Coskun et al.2006] Baris Coskun, Bulent Sankur, and Nasir Memon. Spatio–temporal transform based video hashing. IEEE Transactions on Multimedia, 8(6):1190–1208, 2006.
  • [Duan et al.2017] Yueqi Duan, Jiwen Lu, Ziwei Wang, Jianjiang Feng, and Jie Zhou. Learning deep binary descriptor with multi-quantization. In CVPR, pages 1183–1192, 2017.
  • [Gong et al.2013] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
  • [Hao et al.2017a] Yanbin Hao, Tingting Mu, John Y Goulermas, Jianguo Jiang, Richang Hong, and Meng Wang. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing, 26(11):5531–5544, 2017.
  • [Hao et al.2017b] Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 19(1):1–14, 2017.
  • [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [Jain et al.2017] Himalaya Jain, Joaquin Zepeda, Patrick Pérez, and Rémi Gribonval. Subic: A supervised, structured binary code for image search. ICCV, 2017.
  • [Jiang and Li2015] Qing-Yuan Jiang and Wu-Jun Li. Scalable graph hashing with feature transformation. In IJCAI, pages 2248–2254, 2015.
  • [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
  • [Krizhevsky2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Report, 2009.
  • [Kuehne et al.2011] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In ICCV, pages 2556–2563. IEEE, 2011.
  • [Lai et al.2015] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
  • [Li et al.2016] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. IJCAI, 2016.
  • [Li et al.2017] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ICLR, 2017.
  • [Lin et al.2015] Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. Deep learning of binary hash codes for fast image retrieval. In CVPR Workshops, pages 27–35, 2015.
  • [Lin et al.2016] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR, pages 1183–1192, 2016.
  • [Liong et al.2015] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In CVPR, pages 2475–2483, 2015.
  • [Liong et al.2017a] Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. Cross-modal deep variational hashing. In ICCV, pages 4077–4085, 2017.
  • [Liong et al.2017b] Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. Deep video hashing. IEEE Transactions on Multimedia, 19(6):1209–1219, 2017.
  • [Liu et al.2011] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
  • [Liu et al.2012] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081. IEEE, 2012.
  • [Liu et al.2016] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In CVPR, pages 2064–2072, 2016.
  • [Molchanov et al.2017] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. ICLR, 2017.
  • [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
  • [Song et al.2013] Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Jiebo Luo. Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, 15(8):1997–2008, 2013.
  • [Sun et al.2016] Jiande Sun, Xiaocui Liu, Wenbo Wan, Jing Li, Dong Zhao, and Huaxiang Zhang. Video hashing based on appearance and attention features fusion via dbn. Neurocomputing, 213:84–94, 2016.
  • [Venkateswara et al.2017] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. CVPR, 2017.
  • [Weiss et al.2009] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2009.
  • [Weng and Preneel2010] Li Weng and Bart Preneel. From image hashing to video hashing. In MMM, pages 662–668. Springer, 2010.
  • [Wu et al.2017] Gengshen Wu, Li Liu, Yuchen Guo, Guiguang Ding, Jungong Han, Jialie Shen, and Ling Shao. Unsupervised deep video hashing with balanced rotation. IJCAI, 2017.
  • [Xia et al.2014] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, volume 1, pages 2156–2162, 2014.
  • [Ye et al.2013] Guangnan Ye, Dong Liu, Jun Wang, and Shih-Fu Chang. Large-scale video hashing via structure learning. In ICCV, pages 2272–2279, 2013.
  • [Zhang et al.2016] Hanwang Zhang, Meng Wang, Richang Hong, and Tat-Seng Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM MM, pages 781–790. ACM, 2016.
  • [Zhu et al.2014] Xiaofeng Zhu, Lei Zhang, and Zi Huang. A sparse embedding and least variance encoding approach to hashing. IEEE transactions on image processing, 23(9):3737–3750, 2014.