I Introduction
Nowadays, due to the advances in transmission technologies, capture devices and display techniques, we are witnessing the rapid growth of videos and video retrieval related services. Take YouTube and Facebook as examples. According to Youtube Statistics 2017, 300 hours of videos are uploaded to Youtube every minute, and it attracts over 30 million visitors per day. For Facebook, it also has 8 billion average daily video views from 500 million users. Therefore, the explosive growth of online videos makes largescale contentbased video retrieval (CBVR) [1, 2, 3]
an urgent need. However, unlike Contentbased Image Retrieval (CBIR) that has been extensively studied in the past decades
[4, 5, 6] and considerable progress has been achieved, CBVR has not received sufficient attention in multimedia community [7, 8, 9].On the other hand, hashing methods have been widely acknowledged as a good solution for approximate nearest neighbor search by transforming highdimensional features as short binary codes to support efficient largescale retrieval and data storage. Therefore, contentbased video hashing is a promising direction for CBVR, but video is beyond a set of frames and video retrieval is more challenging than image hashing. Essentially, the rich temporal information in videos is a barrier for directly utilizing imagebased hashing methods. However, most current works on video analytics generally resort to pooling framelevel features into a single videolevel feature by discarding the temporal order of the frame sequence [2, 3, 10]. Such bagofframes degeneration works well when highdimensional framelevel features such as CNN responses [11] and motion trajectories [12] are used, as certain temporal information encoded in a high dimension can be preserved after pooling. However, for largescale CBVR, where hashing (or indexing) of these highdimensional features as short binary codes is necessary, the temporal information loss caused by frame pooling will inevitably result in suboptimal binary codes of videos. The loss usually takes place in the process of hash function learning [13], which is a poststep after pooling; Compared to dominant video appearances (e.g., objects, scenes and shortterm motions), nuanced video dynamics (e.g., longterm event evolution) are more likely to be discarded as noise in the drastic feature dimensionality reduction during hashing [14].
Recently, deep learning has dramatically improved the stateoftheart in speech recognition, visual object detection and image classification
[15, 16, 17]. Inspired by this, researchers [18]started to extract from various deep Convolutional Neural Networks (Deep ConvNets) (e.g., VGG
[19] and RestNet [20]) to support video hashing, but video temporal information is significantly ignored. To capture the temporal information, the Recurrent Neural Network (RNN) is used to achieve the stateoftheart performance in sequential data streams
[21, 22]. RNNbased hashing methods [23] utilized RNN and video labels to learn discriminative hash codes for videos. However, human labels are time and labor consuming, especially for largescale datasets. Therefore, we argue that the key reason to the above defect is that both the temporal pooling and the hash code learning steps have not adequately addressed the temporal nature of videos. Also, we argue that the hash codes which can simply reconstruct the video content are unable to achieve high accuracy for the task of video retrieval.SelfSupervised Temporal Hashing (SSTH) [1]
aims to improve video hashing by taking temporal information into consideration. A stacking strategy is proposed to integrate Long ShortTerm Memory Networks (LSTMs)
[24]with hashing to simultaneously preserve the temporal and spatial information of a video using hash codes. Despite the improved performance, a major disadvantage of stacked LSTM is that it introduces a long path from the input to the output video vector representation, therefore may result in heavier computational cost.
Therefore, to address the issues of leveraging temporal information and reducing computational as costs as mentioned above, we improve the version of SSTH from [1] by proposing SelfSupervised Video Hashing (SSVH), which encodes video temporal and visual information simultaneously using an endtoend hierarchical binary autoencoder and a neighborhood structure. It is worth to highlight the following contributions.

We develop a novel LSTM variant dubbed Binary LSTM (BLSTM), which serves as the building block of the temporalaware hash function. We also develop an efficient backpropagation rule that directly tackles the challenging problem of binary optimization for BLSTM without any relaxation.

We design a hierarchical binary autoencoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations. To achieve accurate video retrieval, we impose the binary codes to simultaneously reconstruct the visual content and neighborhood structures of the videos.

We successfully show that our SSVH model achieves promising results on two largescale video dataset FCVID ( videos) and YFCC ( videos). Experimental results on two datasets show superior performance compared to other unsupervised video hashing methods.
Compared with SSTH, we use hierarchical structure to replace stacked LSTMs which can efficiently learn longrange dependency and reduce computational cost. Besides, we also use a neighborhood structure to enhance the representation ability of binary codes. Experimental results verify the effectiveness of our proposed method.
Ii Related work
In this section, we describe related work. Our SSVH is closely related to hashing in terms of the algorithm, and video content analysis in terms of the application.
Iia Hashing
The rapid growth of massive databases in various applications has promoted the research and study of hashbased indexing algorithm. Learning to hash [25] has been widely applied to approximate nearest neighbor search for largescale multimedia data, due to its computation efficiency and retrieval quality. Learningbased hashing learns a hash function, , mapping an input item to a compact code . By mapping data into binary codes, efficient storage and search can be achieved due to the fast bit XOR operations in Hamming space. Current hashing methods can be generally categorized into supervised and unsupervised ones.
Supervised hashing methods [23, 2, 18, 26, 6] are proposed to utilize available supervision information like class labels or pairwise labels of the training data for improving the performance of hash codes. Ye et. al. [2]
proposed a supervised framework Video Hashing with both Discriminative commonality and Temporal consistency (VHDT) with structure learning to design efficient linear hash functions for video indexing and formulate the objective function as a minimization problem of a structureregularized empirical loss function. But it only generated framelevel codes. Cao
et. al. [27] proposed a novel deep learning architecture named HashNet to learn hash codes by a continuation method, which learned exactly binary codes from imbalanced similarity labels but ignored the temporal information of video frames. Deep PairwiseSupervised Hashing (DPSH) [26]took into account the pairwise relationship and proposed a novel deep hashing method to perform simultaneous features extraction and hash codes learning. Supervised Recurrent Hashing (SRH)
[23]was proposed to deploy the Long ShortTerm Memory (LSTM) to model the structure of video samples and introduce a maxpooling mechanism to embedding the frames into fixedlength representations that are fed into supervised hashing loss. In addition, Liong
et.al [18] proposed a method named Deep Video Hashing (DVH), which utilized spatialtemporal information after the stacked convolutional pooling layers to extract representative video features, and then obtained compact binary codes.Unsupervised hashing methods [5, 4, 28, 1, 3, 29] often utilize the data properties such as distribution or manifold structure to design effective indexing schemes. ITQ [5] rotated data points iteratively to minimize the quantization loss. Liong et.al [28] proposed to use deep neural network to learn hash codes by three objective: (1) the loss between the realvalued feature descriptor and the learned binary codes, (2) binary codes distribute evenly on each bit and (3) different bits are as independent as possible. Most of the proposed methods are devoted to image retrieval, which cannot be directly applied to video hashing due to its inherent temporal information. A more completed survey of hashing methods can be found in [30].
There are also some research focusing on video hashing. For example, the hash functions proposed by Song et al. [3] and Cao et al. [27] are unaware of the temporal order of video frames. Ye et al. [2] exploited the pairwise frame order but their method requires video labels and only generated framelevel codes. Although Revaud et al. [31] exploited the shortterm temporal order, their video quantization codes were not binary. In contrast to the above methods, our SSVH is an unsupervised binary code learning framework that explicitly exploits the longterm video temporal information.
IiB Video Content Analysis based on LSTM
A video is a sequence of frames. But it is beyond a set of frames and the temporal information in video is important for video content analysis. To capture the temporal order, Ng et al. [32] introduced Long ShortTerm Memory (LSTM) inspired by the general sequence to sequence learning neural model proposed by Sutskever et al. [33]. Since then, LSTM has been widely used in different research fields such as image and video captioning [17, 21, 34]
[35, 36] and visual questionanswering [37, 38]. The basic video content analysis model is an encoderdecoder framework composed of LSTM. They usually use deep neural network as an encoder to extract framelevel features. Vinyals et al. [39] proposed to use LSTM to decode the latent semantic information of images and generate words in order. Sequence to sequence model [21] used a twolayer stacked LSTM to encode and decode the static and motion information to generate the video caption. Hierarchical Recurrent Neural Encoder [22] was able to exploit video temporal structure in a longer range by reducing the length of input information flow. Hierarchical LSTM layer was able to uncover the temporal transitions between frame chucks with different granularities and can model the temporal transpositions between frames as well as the transitions between segments. Different from the stacked LSTM which simply aims to introduce more nonlinearity into the neural model, the Hierarchical LSTM aimed to abstract the visual information at different time scales, and learn the visual features with multiple granularities. Song et al. [34] proposed a hierarchical LSTM with adaptive attention to generate caption. This model used LSTM to encode video content and semantic information, and decode video information by adaptive attention to generate video caption. SAN (Stacked Attention Network) were proposed with CNN and RNN as an encoder to explore visual structure. Encoderdecoder framework composed of LSTM is popular in sequential data understanding and makes a great success. We also choose LSTM as a basic unit in our encoderdecoder model.Iii Proposed Method
In this section, we formulate the proposed SelfSupervised Video Hashing framework (Fig. 1). Notations and problem definition that will be used in the rest of the paper will be introduced first. Then we will present two major components of our novel framework, Hierarchical Binary AutoEncoder and Neighborhood Structure. Finally, optimization method is introduced to train our model.
Iiia Notations and Problem Definition
Given a video , where is the number of frames in each video^{1}^{1}1For each video, we extract equal number of frames, and is the feature dimensionality of each frame. denotes the number of videos in dataset. The features for each frame in a video are extracted as a preprocessing step, and indicates the th frame features of a video. The goal of a video hashing method is to learn a binary code for each video where is the code length. is the hidden state of BLSTM unit before function, and .
IiiB Hierarchical Binary AutoEncoder
Long ShortTerm Memory (LSTM) is popular in video analysis in recent years because of the effectiveness of processing sequential information. However, the original LSTM can only generate continuous values instead of binary codes. A straightforward way is to add a hashing layer which consists of a full connected layer to obtain the hidden variable h, and a activation layer to binarize the h to binary codes b. However, as pointed out in [1], this strategy is essentially based on frame pooling, where the pooling function is an RNN. Even though the pooling is temporalaware, the hash codes do not directly capture the temporal nature of videos.
In order to design an architecture that can simultaneously capture the temporal information of videos and generate binary codes, we propose a novel hierarchical binary autoencoder based on hierarchical LSTM [22]. Specifically, our architecture consists of an encoder and a decoder, while the decoder is divided into forward hierarchical binary decoder, backward hierarchical binary decoder and global hierarchical binary decoder.
IiiB1 Hierarchical Binary Encoder
As can be seen in Fig. 2, the binary encoder is a twolayered hierarchical RNN structure, and it consists of vanilla LSTM and binaryLSTM. The first layer is an original LSTM which can be considered as a higherlevel feature extractor for the frames. The hidden state at time step is :
(1) 
where indicates the features of the th frame, is the hidden state at time step () and f() is the function of this LSTM layer. The hidden state is used as input to the second layer (BLSTM).
The second layer is a binary LSTM layer (Fig. 2), which embeds the higherlevel realvalued features of a video to a binary code. To achieve this, a straightforward way is stacking another layer, as introduced in [1]. However, it will increase computation operations. Inspired by [22], we proposed a hierarchical binary LSTM. Different from stacked LSTM [22], in the hierarchical binary LSTM, not all the output of the firstlayer LSTM is connected to the secondlayer BLSTM. The motivation is that this hierarchical structure can efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive frames at a higher level. Also, computation operations are significantly reduced. The difference between a stacked LSTM and a hierarchical LSTM is illustrated in Fig.3.
Suppose the stride of the hierarchical BLSTM is . Then, only the output of the th time step in the first layer LSTM will be used as input to th time step in the second BLSTM layer. Our BLSTM follows a similar data flow as LSTM, and the detailed implementation of BLSTM is given as follows:
(2)  
(3)  
(4)  
(5)  
(6)  
(7)  
(8) 
where denotes the elementwise multiplication and
means batch normalization. Therefore, the output of our encoder will be a binary code
. The behaviors of “forget”, “input” and “output” in BSLTM unit are respectively controlled by three gate variables: forget gate , input gate , and output gate . , and denote the shared weight matrices of BLSTM to be learned and means bias term. is the input to the memory cell , which is gated by the input gate .denotes the elementwise logistic sigmoid function and
denotes hyperbolic tangent function tanh.IiiB2 Forward Hierarchical Binary Decoder
The forward hierarchical binary decoder reconstructs the input frame features in a forward order using the binary codes . The decoder also has a hierarchical structure which consists of two layers of LSTM. Specifically, the hidden state at time step of the first layer LSTM is :
(9) 
where is the function for the forward LSTM, is the hidden state at time step and .
Similarly, the output in the first layer LSTM will not be connected to all the units in the second layer LSTM. Suppose we have the same stride of . Then, the output of the th time step in the first layer LSTM will be used as input to th time step in the second LSTM layer. The reconstructed is formulated as:
(10) 
where is the function for the second layer of the forward LSTM, is the hidden state at time step of the second layer LSTM (), if is a multiple of and otherwise.
Then the reconstructed features will be attained by linear reconstructions for the output of the decoder LSTMs:
(11) 
where is weight matrix and means the bias.
We can define the forward decoder loss of a video as the Euclidean distance of the original features and the reconstructed features as:
(12) 
IiiB3 Backward Hierarchical Binary Decoder
The backward hierarchical binary decoder is similar to the forward hierarchical decoder. It reconstructs framelevel features in a reverse order, i.e., . Specifically, the hidden state at time step of the first layer LSTM is :
(13) 
where is the function for the backward LSTM, is the hidden state at time step and .
The reconstructed is formulated as:
(14) 
where is the function for the second layer of the backward LSTM, is the hidden state at time step of the second layer LSTM (), if is a multiple of and otherwise.
Then the reconstructed features will be attained by linear reconstructions for the output of the decoder LSTMs:
(15) 
where is weight matrix and means the bias.
The backward decoder loss is defined as:
(16) 
IiiB4 Global Hierarchical Binary Decoder
Apart from the forward and backward hierarchical binary decoder, we also use a global hierarchical decoder to reconstruct the video level features. Here, we use meanpooling of all the framelevel features as the videolevel feature. The LSTM used in both layers are two basic LSTMs. The global reconstruction loss of a video is defined as:
(17) 
indicates the mean of all framelevel features of a video.
IiiB5 Hierarchical Binary AutoEncoder Loss Function
The hierarchical binary autoencoder consists of three components, and the reconstruction loss is also composed of forward loss, backward loss and global loss, which is defined as:
(18) 
IiiC Neighborhood Structure
We argue that achieving a good quality of video content reconstruction is not enough to equip the binary codes with the ability of accurate video retrieval. Using basic reconstruction loss, we can learn a binary code of a video, which can only reconstruct the video. Lots of previous studies shows that it is beneficial to exploit the data structure for learning a lowdimensional embedding for the retrieval task. Neighborhood structure enforces similar videos to have close binary codes and dissimilar videos to have different binary code. Inspired by this, we propose a novel method to exploit the neighborhood structure of videos. Then we can train our model to encourage the binary codes to preserve this neighborhood structure.
IiiC1 Neighborhood Structure Construction
We construct this neighborhood structure as a preprocessing step.^{2}^{2}2
While we can also construct the neighborhood structure directly using the features during the learning of our neural network, without this preprocessing step of feature extraction, we found that the construction of neighborhood structure is timeconsuming, and the updating of neighborhood structures based on the updating of video features in each epoch does not have significant improvement on the performance. Therefore, we fix this neighborhood structure calculated based on the features.
First, we use VGG [19] network to extract framelevel features of videos , and use meanpooling to get a videolevel representation. We expect that neighboring videos have similar videolevel representations. We use cosine similarity to measure the similarity between any two videos
and :(19) 
For each video, we find its NN and store their indexes in . If we set to a small number, there are not enough neighboring information to preserve for our large dataset. On the other hand, if we simply increase , the accuracy of the retrieved neighbors drops. Therefore, we design a new strategy to obtain more neighborhood structure information. Specifically, we compute the intersection of a video’s most relevant videos’ indexes then we get the top indexes based on the size of intersection. Every video in the indexes shares at least one common neighbor with the video. For example, after retrieving NN, we get the relevant video indexes of video as and the relevant video indexes of video as . If the size of intersection is among the top NN of video , all videos in the intersection will be neighbors of video , which means will be regarded as neighbors of video . We construct the similarity matrix by preprocessing the train data as mentioned above. If , video and video are considered as neighbors. And means video and video are not neighbors.
Neighborhood Structure Loss: As described in Sec.IIIB, we can get the binary codes from binary encoder for all the videos. Then we define a loss function based on the neighborhood structure as:
(20) 
where are the hash codes of videos and , indicates the similarity of video and . , and is the hidden state of encoder before binarization of video as in Eq.8. Instead of defining the loss function on the binary codes and , we put the constraints of neighborhood structure preserving on the and . Then, we have the following regularized problem by replacing the equality constraint in Eq.20 by a regularization terms as:
(21) 
where is weight of regularization term. This term impels neighboring videos to have similar hash codes. The overview is illustrated in Fig.4. The final loss function is composed of the reconstruction loss as Eq.18 and pairwise loss as Eq.21:
(22) 
is a hyper parameter of our model which balances the reconstruction loss and neighbor loss.
IiiD Optimization Method
In this section, we will formulate our loss function and come up with a scheme to train our model. Suppose that denotes the parameters of hierarchical binary encoder. denotes the parameters of decoder composed of the forward decoder parameters , backward decoder parameters and global decoder parameters . Binary codes can be obtained by encoder as:
(23) 
Then is used to reconstruct video frame features in forward, backward and global mode.
(24)  
(25)  
(26) 
As illustrated in Sec.IIIB, we can rewrite the reconstruction loss as:
(27) 
We can update the parameters , , and by utilizing back propagation to optimize our model.
However, training SSVH equipped with BLSTM is essential NPhard as it involves binary optimization of the hash codes that requires combinatorial search space. We follow SSTH [1] to deal with this binary optimization problem and use approximated sgn function as Fig.5:
for  
for  
for  (28) 
Then we can get the derivative of sgn(h) as :
(29) 
The derivative states a simple backpropagation rule for BLSTM: when the gradients back propagate to the sgn function, we only allow gradients, whose neural response are between 1 and +1, to pass through. Note that we can also utilize other functions, e.g., to approximate . We will evaluate the performance differences in the experiments.
IiiE Comparison with SSTH
We improve the version of SSTH from [1] by defining a hierarchical recurrent structure rather than using a stacked structure. Despite of the improved performance of SSTH, a major disadvantage of stacking is that it introduces a long path from the input to the output video vector representation, thereby resulting in heavier computational cost. Compared with SSTH, SSVH proposed dramatically shortens the path with the capability of adding nonlinearity, providing a better tradeoff between efficiency and effective. In other words, SSVH extracts the video information at different time scales, and learns the video hash codes with multiple granularities. Moreover, in [1], the encoder RNN with BLSTM runs through the sequence, generating a set of hash codes and then the decoder RNN decodes them to reconstruct the framelevel feature sequence in both forward and reverse orders. Compared with SSTH, we reconstruct the video by not only with forward and backward hierarchical decoders but also with a new global hierarchical binary decoder. Global reconstruction ensure the accuracy of reconstruct appearance information. In addition, we propose a neighbor structure to further improve the performance. It makes similar videos have similar hash codes and different videos have different hash codes.
Iv Experiments
All our experiments for SSVH are conducted with Theano
[40] on a NVIDIA TITAN X GPU server. Our model can be trained at the speed of about 32 videos per second with a single TITAN X GPU.Iva Datasets and Setting
FCVID
: FCVID is a large video dataset named FudanColumbia Video Dataset containing 91,223 web videos annotated manually according to 239 categories. The categories in FCVID cover a wide range of topics like social events (e.g. “tailgate party”), procedural events (e.g. “making cake”), objects (e.g. “panda”), scenes (e.g. “beach”), etc. We download 91,185 videos from this dataset because of some damaged videos. This dataset is split into training set containing 45,585 videos and test set containing 45,600 videos. We use the training set for unsupervised learning and the test set as retrieval database and queries.
YFCC: Yahoo Flickr Creative Commons 100 Million Dataset is a huge collection of multimedia data. There are 0.8M videos in this dataset officially. But we only collected 700,882 videos because of invalid url and corrupted videos. We select 511,044 videos as our dataset from YFCC. We use the 101,256 labeled videos as in [1] for retrieval and the left 409,788 unlabeled data as training data.
IvB Implementation Details
In this section, we will introduce some implementation details of our model. For each video, we get equallyspaced 24 frames and we think that is enough to represent a video. We use VGG [19] network to extract the framelevel features in our experiment and obtain the 4096d features as the input of our model. The stride of hierarchical autoencoder is 2 and the second layer of decoders has 12 units. Due to the huge scale of YFCC, we can not construct a neighbor similarity matrix of 400K400K. We split the training data into 9 parts because of the limitation of memory and each part have around 45K videos. For each part, we get their neighbor structure matrix. Then we train our model in order. Some videos in YFCC (around 50K) do not belong to the 80 categories, and following the instructions of the data provider, we regard these unlabeled videos as the “others”.
During the training, we use Stochastic Gradient Descent (SGD) algorithm to do parameters updating with a minibatch size of 256. The regularization parameters are set as
= 0.2 in Eq.21 and in Eq.22. In neighborhood structure, we choose K1 as 20 and K2 as 10. To compare with baseline methods, we use the publicly available codes and run them on both FCVID and YFCC.IvC Evaluation Metrics
We adopted Average Precision at top K retrieval videos(AP@K) for retrieval performance evaluation [43]. AP means the average of precisions at each correctly retrieved data point. denotes the number of total relevant videos. means the number of relevant videos, means the retrieved video is relevant and otherwise. AP@K is defined as . We use the whole test video set as the queries and database. Then we can obtain mAP@K by taking the mean of AP@K of all queries. Hamming ranking is used as the search protocol.
IvD Components and Baseline Methods
In this subsection, we aim to investigate the effect of each component in our framework. Here we introduce some combinations of the components:

FB (Forward and Backward Reconstruction): FB consists of forward and backward reconstruction loss and as illustrated in Fig.1. FB reconstructs the framelevel features and trains the hash functions simultaneously.

FB + GR (FB + Global Reconstruction (GR)): Based on forward and backward reconstruction loss, we add the global reconstruction loss .

Neighborhood Structure. Neighborhood structure loss preserves the neighborhood structure in the original space, as is illustrated in Sec.IIIC.

SSVH. SSVH is composed of the , , and , which is illustrated in Fig. 1.

FB + GR + GTHNS.(FB + GR + Groundtruth Neighborhood Structure) Here, we compute the neighbor similarity using the labels of training data to compare with our SSVH.
We also compare our method with several the stateoftheart unsupervised hashing methods to validate the performance of our method. These methods are:
ITQ. Iterative Quantization (ITQ) [5] is a representative unsupervised hashing method for image retrieval, and we extend it for video retrieval. We get a videolevel feature by applying meanpooling on the framelevel features.
Submod. Submodular Video Hashing (Submod) [10] is also a common video retrieval method. We first use mean pooling to extract video representation then hash it into a 1024dimension code using traditional hashing method LSH. Then we measure the informativeness of training data to select most informative hash functions.
MFH. Multiple feature hashing (MFH) [3] learns hash functions based on the similarity graph of the frames. It learns framelevel hash codes and uses average pooling to get the realvalued videolevel representation, followed by binarization to get the hash codes.
DH. Deep hashing (DH) [28] learns hash functions based on a deep neural network by adding a binarization loss. We use original encoderdecoder model to extract video representations by getting the output of the encoder.
SSTH. SelfSupervised Temporal Hashing (SSTH) [1] means SelfSupervised Temporal Hashing, which trains hash functions as an autoencoder to reconstruct the frame features in forward and backward order.
mAP@K  20  40  60  80  100 
SSTH  28.37%  24.12%  21.26%  19.51%  17.93% 
FB(Forward+Backward)  30.90%  24.52%  21.25%  19.05%  17.38% 
FB + GR(Global Reconstruction)  31.72%  25.36%  22.09%  19.87%  19.18% 
Neighborhood Structure  36.51%  32.40%  30.15%  28.40%  26.82% 
FB + GR + Neighborhood Structure  37.92%  33.40%  30.92%  29.00%  27.29% 
FB + GR + GroundTruth  42.41%  39.27%  37.67%  36.38%  35.17% 
K2=10  K1  5  10  20  40  60  80  100 

5_10  52.76%  42.46%  35.21%  29.66%  26.76%  24.70%  23.05%  
10_10  54.75%  45.53%  39.25%  34.39%  31.71%  29.66%  27.91%  
20_10  53.10%  44.09%  38.10%  33.63%  31.19%  29.29%  27.58%  
30_10  52.08%  43.06%  37.06%  32.56%  30.12%  28.21%  26.57%  
40_10  51.31%  42.19%  36.13%  31.72%  29.40%  27.58%  25.99%  
50_10  50.19%  40.37%  35.78%  31.03&  28.63%  26.83%  24.76%  
K1=10  K2  5  10  20  40  60  80  100 
10_5  53.03%  42.82%  35.67%  30.09%  27.16%  25.06%  23.39%  
10_10  54.75%  45.53%  39.25%  34.39%  31.71%  29.66%  27.91%  
10_20  53.18%  44.19%  38.25%  33.74%  31.26%  29.32%  27.60%  
10_30  52.13%  43.00%  36.99%  32.56%  30.15%  28.29%  26.69%  
10_40  51.47%  42.26%  36.25%  31.85%  29.52%  27.73%  26.16%  
10_50  51.18%  41.18%  35.72%  31.22%  28.85%  27.07%  25.57% 
MAP results (i.e., top 5, 10, 20, 40, 60, 80 and 100 retrieval results) on the FCVID dataset of 256bits. The performance variance with different parameters:
or .IvE Performance Analysis
IvE1 Effect of Components of SSVH
We first test different combinations of the components of our SSVH on the FCVID dataset. While it is impossible to test all the combinations due to the space limit, we focus on the following aspects: 1) Is the hierarchical structure better than stacked LSTM? 2) What is the effect of each component? and 3) What is the performance of our neighborhood structure compared with human labels?
To achieve this, we report the mAP@K (, ) results of different combinations in Table.I. We can observe that: 1) Compared to SSTH, a stacked RNN structure which also uses forward and backward reconstructions, our hierarchical structure (FB) obtains superior performance with less computational cost. 2) Using FB only, our method can achieve promising results, and global loss can further improve the performance. By adding GR to FB, the performance is improved by about 1% for the mAP at different K. GR can also makes the training more stable, as shown in Fig. 6. Neighbor model also obtains excellent performance which proves that it is beneficial to exploit the neighborhood structure of the training data for the task of video retrieval. SSVH achieves the best performance compared with other combinations, which indicates that each component contributes to the good performance of SSVH. 3) Besides, we also construct the similarity matrix using the labels. Obviously, our model with human labels significantly outperforms our SSVH with the unsupervised neighborhood structure, which means that we can improve the accuracy of predicted neighbor similarity to enhance our performance.
IvE2 Trade off between Neighbor Loss and Reconstruction Loss
The hyperparameter in Eq.22 is crucial in our method, which balances the neighborhood similarity loss and the reconstruction loss of the training videos. Therefore, we tune the parameter of from and show the performance in Fig.7. The curves in Fig.7 shows how the mAP at top20, top40, top60, top80, and top100 varies with respect to . When and , i.e., only the neighborhood similarity loss or reconstruction loss is used, SSVH cannot achieve the best performance. Instead, the best performance is obtained when , which indicates that it is necessary to use both information and the neighborhood structure loss contributes more to the performance.
IvE3 Effect of K1 and K2 in Neighborhood structure
In this subsection, we evaluate mAP@K of different combinations of and on the FCVID dataset. We tune both and from , , , , , , and the results are shown in Tab. II. When or is relatively high (e.g.,50 ), the worst performance is achieved. When both and are set as , best performance is achieved. If or is set to a small number, there are not enough neighboring information to preserve; while if or is set to a large number, then similar videos categories will be considered as one category.
IvE4 Comparison with different approximate activation functions
We have different ways to approximate sgn function. One is as we illustrate in Sec.IIID and the another is
. In this subsection, we compare these two activation functions on FCVID dataset, and the experimental results are shown in Tab.
III. Note that the mAP is calculated using top20 retrieval results. From Tab.III, both and get satisfactory performance. And performs slightly better than .64bits  128bits  256bits  

(h)  24.87%  31.58%  35.84% 
26.12%  33.65%  39.25% 
IvE5 Crossdataset evaluation comparison
Tab.IV lists the crossdataset performance of all the hashing methods following [1]. We can observe that all the methods suffer a performance drop when training on FCVID and testing on YFCC. This indicates that the performance is related to the scale of training dataset. When the scale of training dataset decreases, the mAP will drop accordingly. Domain shift is another possible reason for the performance decrease. On the other hand, when we train on YFCC and test on FCVID, the mAP is improved significantly than that training and testing on FCVID. This demonstrate that for unsupervised hashing models, more training data is beneficial for performance gain, even though they are from different domains.
mAP256bits  Submod  MFH  ITQ  DH  SSTH  SSVH  


33.8  24.7  4.76  2.04  11.6  19.51  

20.3  2.38  8.26  3.93  7.58  8.45 
IvE6 Comparison with StateOfTheArts on FCVID
Figure 8 shows the comparison of our SSVH with several stateoftheart video hashing methods. 1) Obviously, SSVH achieves the best performance at all bit lengths on the FCVID dataset. Specifically, it outperforms the stateoftheart method SSTH by 9.6%, 9.3% 9.6%, 9.4% and 9.4% for mAP@K (K=20,40,60,80,100) when the code length is 256. 2) The advantage of our SSVH is obvious when the code length is relatively large, e.g., 32, 64, 128 and 256 bits. However, our method does not show the superiority over SSTH on short codes such as 8 and 16 bits, even though it performs better than the other baselines. One possible reason is that short codes carry too less information for reconstructing the video content, and preserving the neighborhood structure. 3) In general, with the increase of code length, the mAP increases as well. For example, the mAP@20 for our method increases from 16.2% with code length of 8, to 37.9% with the code length of 256. This indicates that the code length plays an important role in video retrieval, and our method is suitable for longer hash codes.
IvE7 Comparison with StateOfTheArts on YFCC
For the YFCC, the proposed SSVH consistently outperforms the other methods as shown in Fig.8. 1) SSVH shows significantly better performance compared with the baselines (ITQ, MFH, Submod, DH and SSTH) for different bits. Specifically, compared to the best counterpart SSTH, the performance is improved by 13% in average in terms of mAP for 256 bits on YFCC. 2) An interesting phenomenon is that the performance improvement of SSVH is more significant for short codes such as 8 or 16 bits, which is different from the FCVID dataset. 3) Similar to the results on the FCVID dataset, the mAP increases for with the increase of code length. SSTH is a strong competitor when the code length is 64, 128 and 256. The performance gap of between our method and SSTH becomes marginal in terms of mAP@60, mAP@80 and mAP@100. Because the training strategy of YFCC is different from FCVID and we ignore lots of similarity information of video pairs due to the limitation of dataset scale.
IvF Qualitative Results
The qualitative results are shown in Fig. 9. The left results are obtained from the FCVID dataset, while the right results are obtained from the YFCC dataset. In this subexperiments, both SSTH and SSVH generate 256 bits hash codes. In addition, videos marked with green indicate correct results, while videos marked with red are wrong results. From Fig. 9, we can see that in general, SSVH can obtain better results. Given two queries “Guitar Performance” and “Ski Slope”, both SSTH and SSVH obtain correct top5 retrieval videos. This indicates that modeling temporal information is beneficial to discriminate concepts that involve human actions. More interestingly, we can observe that SSVH consistently outperforms the SSTH on both FCVID and YFCC datasets. This indicates that SSVH is able to obtain better temporal information for video hashing than SSTH. As another example illustrated in Fig. 9 (i.e., “Tornado” in FCVID and “Patio” in YFCC), it seems that capturing visual appearances is sufficient for retrieving them. This indicates that both SSVH and SSTH are also powerful for video categories that are not likely to be distinguished by temporal information. We also illustrated some failure cases for both methods. Two examples is the food related events like “Making Hotdog” in FCVID dataset and “Pantry” in YFCC dataset. Both methods cannot distinguish these actions from similar videos e.g., “nail paining”.
V Conclusion
In this paper, we have extend the novel unsupervised deep hashing method (SSTH), named selfsupervised video hashing (SSVH). To the best of our knowledge, SSVH is the first method which learns the video hash codes by simultaneous reconstructing the video contents and neighborhood structure. Experiments on real dataset show that SSVH can significantly outperform the others and achieve the stateoftheart performance for video retrieval. However, we also find some shortcomings of our model. We use pretrained VGGNet to extract video frame features which ignore consecutiveness and temporal information of videos. In the future, we will consider extracting motion feature to our model and fusing multiple features to improve video retrieval performance.
Vi Acknowledgments
This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2016J085), the National Natural Science Foundation of China (Grant No. 61772116, No. 61502080, No. 61632007) and the 111 Project (Grant No. B17008).
References
 [1] H. Zhang, M. Wang, R. Hong, and T.S. Chua, “Play and rewind: Optimizing binary representations of videos by selfsupervised temporal hashing,” in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 781–790.

[2]
G. Ye, D. Liu, J. Wang, and S.F. Chang, “Largescale video hashing via
structure learning,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2013, pp. 2272–2279.  [3] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Multiple feature hashing for realtime large scale nearduplicate video retrieval,” in Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011, pp. 423–432.
 [4] J. Song, T. He, L. Gao, X. Xu, and H. T. Shen, “Deep region hashing for efficient largescale instance search from images,” arXiv preprint arXiv:1701.07901, 2017.
 [5] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2916–2929, 2013. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2012.193
 [6] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang, “Supervised hashing with kernels,” in CVPR, 2012.
 [7] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Contentbased image retrieval at the end of the early years,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 12, pp. 1349–1380, 2000.
 [8] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, p. 5, 2008.
 [9] J. Wang, S. Kumar, and S.F. Chang, “Semisupervised hashing for largescale search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393–2406, 2012.
 [10] L. Cao, Z. Li, Y. Mu, and S.F. Chang, “Submodular video hashing: a unified framework towards video pooling and indexing,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 299–308.

[11]
Z. Wu, Y. Fu, Y.G. Jiang, and L. Sigal, “Harnessing object and scene
semantics for largescale video understanding,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 3112–3121.  [12] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
 [13] J. Wang, W. Liu, S. Kumar, and S.F. Chang, “Learning to hash for indexing big data—a survey,” Proceedings of the IEEE, vol. 104, no. 1, pp. 34–57, 2016.
 [14] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.

[15]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [16] R. Girshick, “Fast rcnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
 [17] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4651–4659.
 [18] V. E. Liong, J. Lu, Y.P. Tan, and J. Zhou, “Deep video hashing,” IEEE Transactions on Multimedia, 2016.
 [19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [21] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequencevideo to text,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4534–4542.
 [22] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1029–1038.
 [23] Y. Gu, C. Ma, and J. Yang, “Supervised recurrent hashing for large scale video retrieval,” in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 272–276.
 [24] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [25] J. Wang, T. Zhang, N. Sebe, H. T. Shen et al., “A survey on learning to hash,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [26] W.J. Li, S. Wang, and W.C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” arXiv preprint arXiv:1511.03855, 2015.
 [27] Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” arXiv preprint arXiv:1702.00758, 2017.
 [28] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2475–2483.
 [29] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantizationbased hashing: a general framework for scalable image and video retrieval,” Pattern Recognition, vol. 75, pp. 175–187, 2018.
 [30] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” arXiv preprint arXiv:1606.00185, 2016.
 [31] J. Revaud, M. Douze, C. Schmid, and H. Jégou, “Event retrieval in large video collections with circulant temporal encoding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2459–2466.
 [32] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694–4702.
 [33] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 [34] J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, and H. T. Shen, “Hierarchical lstm with adjusted temporal attention for video captioning,” arXiv preprint arXiv:1706.01231, 2017.
 [35] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky, “Adversarial learning for neural dialogue generation,” arXiv preprint arXiv:1701.06547, 2017.
 [36] D. Britz, A. Goldie, T. Luong, and Q. Le, “Massive exploration of neural machine translation architectures,” arXiv preprint arXiv:1703.03906, 2017.
 [37] K. Kafle and C. Kanan, “Answertype prediction for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4976–4984.
 [38] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 21–29.
 [39] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, 2015, pp. 3156–3164.
 [40] B. James, B. Olivier, B. Frédéric, L. Pascal, and P. Razvan, “Theano: a cpu and gpu math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy).
 [41] Y.G. Jiang, Z. Wu, J. Wang, X. Xue, and S.F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” arXiv preprint arXiv:1502.07209, 2015.
 [42] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.J. Li, “The new data and new challenges in multimedia research,” arXiv preprint arXiv:1503.01817, vol. 1, no. 8, 2015.
 [43] P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel, G. Awad, A. Smeaton, W. Kraaij, and G. Quénot, “Trecvid 2014–an overview of the goals, tasks, data, evaluation mechanisms and metrics,” in Proceedings of TRECVID, 2014, p. 52.
Comments
There are no comments yet.