Temporal Context Aggregation for Video Retrieval with Contrastive Learning

08/04/2020 ∙ by Jie Shao, et al. ∙ ByteDance Inc. FUDAN University 10

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames of a video as individual images or short clips, making the modeling of long-range semantic dependencies difficult. In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism. To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multiple video retrieval tasks, such as CC_WEB_VIDEO, FIVR-200K, and EVVE. The proposed method shows a significant performance advantage ( 17 with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Content-based video retrieval is critical for applications like video recommendation, copyright protection, etc. The content on the Internet has evolved from the previous plain text to various forms of multimedia presentation, such as pictures, audio, and video. In particular, the rapid growth of various video content (such as long video, short video, live broadcast, etc.) has brought huge challenges to video retrieval methods.

Video retrieval approaches mainly follow the scheme of calculating the similarity between videos based on video-level representations or frame-level representations. For those based on video-level representations, code books (Cai et al., 2011; Kordopatis-Zilos et al., 2017a; Liao et al., 2018) or hashing functions (Song et al., 2011, 2013) was employed in the early studies, and later Deep Metric Learning (DML) was also used to train a network with triplet loss to learn a better video-level representation (Kordopatis-Zilos et al., 2017b). Other approaches typically extract frame-level representations to apply frame-to-frame similarity measurement and then aggregate them into video-level similarities (Chou et al., 2015; Liu et al., 2017; Kordopatis-Zilos et al., 2019b; Tan et al., 2009). With more elaborate similarity measures, they typically outperform those based on video-level representations. Recently ViSiL (Kordopatis-Zilos et al., 2019b) trained a subnet to refine the video-level similarity matrix for similarity measurement and reached state-of-the-art performance in several video retrieval tasks, however, the computational cost is heavy.

(a) DML  (Kordopatis-Zilos et al., 2017b)
(b) Ours ()
Figure 1. Visualization of video-level features on a subset of FIVR-5K (Kordopatis-Zilos et al., 2019b) with t-SNE (Maaten and Hinton, 2008). Each color represents samples corresponding to one single query, and distractors are colored with faded gray. Both our method and DML are trained on VCDB (Jiang et al., 2014) dataset. (Best viewed in color)

With video-level representations or frame-level representations, the above approaches lead to two research focus: to learn a better representation or a better similarity measure. Although the latter approach reaches better performance, we argue that a more versatile and efficient way should be to optimize the video representation rather than the similarity measure. Another issue is that for both approaches, the initial frame-level representations are extracted independently as image representations. However, in contrast to images, frames extracted from videos often suffer from motion blur, occlusion and out of focus, and such inferior frames usually convey less information. A natural idea is to exploit the context information in the temporal axis. Considering spatio-temporal video representation and matching, methods based on Recurrent Neural Networks (RNN) 

(Feng et al., 2018; Hu and Lu, 2018) or in the Fourier domain (Baraldi et al., 2018; Revaud et al., 2013; Poullot et al., 2015) achieve high performance in video alignment or copy detection. Although they show poor performance in more general retrieval tasks (Kordopatis-Zilos et al., 2019b), we argue that the idea of exploiting the context information in the temporal axis is promising to optimize the video representation.

In this paper, we address the problem of context-aware video representation learning, denoted as context encoding for video retrieval. The major contributions of this work are: (1) We propose CECL (Context Encoding for video retrieval with Contrastive Learning), a video representation learning network that aggregates the context information of frame-level descriptors. As illustrated in Figure 2, we first extract frame-level features independently following the tradition, then a feature aggregation model is applied to aggregate the contextual information of all frames in a video, resulting in a compact video-level descriptor or a sequence of frame-level descriptors (depending on the aggregation model). (2) We propose a supervised contrastive learning method to train the feature aggregation model with pair-wise labels. As shown in Figure 3, the model is trained to distinguish the positive sample with respect to the anchor sample from distractors contained in a shared memory bank with a contrastive loss. (3) By conducting gradient analysis, the property of automatic hard negative mining is also discovered in the proposed method.

To the best of our knowledge, we are the first to train feature aggregation models for context encoding for video retrieval with contrastive learning. As the aggregation process is conducted in the feature space, the whole video can be easily fed to the GPU, thus capturing the contextual information within a long-range. By conducting gradient analysis, the property of automatic hard negative mining is also discovered in the proposed method. Compared with the commonly used Triplet loss, our method utilizes a large number of distractors of the video retrieval dataset more effectively, and obtain substantially larger gains. Extensive experiments are conducted on multi video retrieval tasks, and the proposed method shows a clear performance advantage over state-of-the-art methods with video-level features, and deliver competitive results with much lower computational cost for similarity measure when compared with frame-level features.

2. Related Work

2.1. Frame Feature Representation

A common strategy is to extract frame-level representations independently as image representations. Early approaches employed handcrafted features including the Scale-Invariant Feature Transform (SIFT) features  (Jiang et al., 2007; Lowe, 2004; Wu et al., 2007), the Speeded-Up Robust Features (SURF)  (Bay et al., 2006; Chou et al., 2015), Colour Histograms in HSV space  (Hao et al., 2016; Jing et al., 2019; Song et al., 2013), and Local Binary Patterns (LBP)  (Zhao and Pietikainen, 2007; Shang et al., 2010; Wu and Aizawa, 2014), etc.

Deep Convolutional Neural Networks (CNNs) have proved to be versatile representation tools in recent approaches. The application of Maximum Activation of Convolutions (MAC) and its variants 

(Razavian et al., 2016; Zheng et al., 2016; Radenović et al., 2016; Tolias et al., 2015; Zheng et al., 2017; Seddati et al., 2017; Gordo et al., 2017)

, which extract frame descriptors from activations of a pre-trained CNN model, have achieved great success in both fine-grained image retrieval and video retrieval tasks 

(Gordo et al., 2017; Kordopatis-Zilos et al., 2017a; Li et al., 2017; Kordopatis-Zilos et al., 2017b, 2019b). Intermediate Maximum Activation of Convolutions (iMAC) (Gordo et al., 2017) applies MAC to different intermediate layers of a CNN then concatenate them. Regional Maximum Activation of Convolutions (R-MAC) (Tolias et al., 2015)

build feature vectors that encode several image regions rather than the whole image, and

-iMAC (Kordopatis-Zilos et al., 2019b) applies R-MAC on the activations of the intermediate convolutional layers, but the regional feature maps are stacked rather than summed. Besides variants of MAC, Sum-Pooled Convolutional features (SPoC) (Babenko and Lempitsky, 2015) and Generalized Mean (GeM) (Hao et al., 2017) pooling are also considerable counterparts.

2.2. Feature Aggregation

Typically, the video feature aggregation paradigm can be divided into two categories: (1) local feature aggregation models (Csurka et al., 2004; Sivic and Zisserman, 2003; Perronnin and Dance, 2007; Jégou et al., 2010) which are derived from traditional local image feature aggregation models, and (2) sequence models  (Hochreiter and Schmidhuber, 1997; Cho et al., 2014; Donahue et al., 2015; Feng et al., 2018; Vaswani et al., 2017; Xia et al., 2019) that model the temporal order of the video representation.

The commonly used local feature aggregation models include Bag-of-Words (Csurka et al., 2004; Sivic and Zisserman, 2003), Fisher Vector (Perronnin and Dance, 2007), and Vector of Locally Aggregated Descriptors (VLAD) (Jégou et al., 2010)

, of which the unsupervised learning of a visual code book is required. The NetVLAD 

(Arandjelovic et al., 2016)

transfers VLAD into a differential version, and the clusters are tuned via back-propagation instead of k-means clustering. NeXtVLAD 

(Lin et al., 2018)

further decomposes the high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time, which is both effective and parameter efficient. In terms of the sequence models, the Long Short-Term Memory (LSTM) 

(Hochreiter and Schmidhuber, 1997)

and Gated Recurrent Unit (GRU) 

(Cho et al., 2014) are commonly used to model contextual information within a long-range for video re-localization and copy detection (Feng et al., 2018; Hu and Lu, 2018). Besides, The effectiveness of self-attention in capturing short and long-range dependency with attention mechanism has been proved with the success of Transformer (Vaswani et al., 2017). For the feature aggregation of videos, this also shows success in video classification (Wang et al., 2018) and object detection (Hu et al., 2018), opening new possibilities for feature aggregation for video retrieval.

2.3. Metric Learning

Metric learning aims to learn an embedding that minimizes the distance between related samples and maximizes it between irrelevant ones. Metric learning have been commonly used in face recognition 

(Schroff et al., 2015; Cao et al., 2013; Wen et al., 2016), image retrieval (Oh Song et al., 2016; Sohn, 2016; Wang et al., 2019; Wu et al., 2017) and video retrieval (Kordopatis-Zilos et al., 2017b, 2019b). With only pair-wise labels available, the triplet loss (Weinberger and Saul, 2009) is commonly used in video retrieval tasks (Kordopatis-Zilos et al., 2017b, 2019b). The classic approach in  (Kordopatis-Zilos et al., 2017b) performs hard negative mining to generate hard triplets, but despite both the off-line triplet generation stage and the training stage are time-consuming, the information that triplets can convey is limited (Sohn, 2016). Although  (Hermans et al., 2017) showed the triplet loss can perform competitively against other popular metric learning approaches with proper hard negative sampling strategy, the proposed PK sampling strategy is only compatible with datasets with class-level labels.

Contrastive learning has become the common training architecture of recent self supervised learning works 

(Oord et al., 2018; Hjelm et al., 2018; Tian et al., 2019; He et al., 2019; Chen et al., 2020), in which the positive and negative sample pairs are constructed with a pretext task in advance, and the model tries to distinguish the positive sample from massive randomly sampled negative samples in a classification manner. The contrastive loss typically performs better in general than triplet loss on representation tasks (Chen et al., 2020), as the triplet loss can only handle one positive and negative at a time. The core of the effectiveness of contrastive learning is the use of rich negative samples (Tian et al., 2019), one approach is to sample them from a shared memory bank (Wu et al., 2018), and (He et al., 2019) replaced the bank with a queue and used a moving-averaged encoder to build a larger and consistent dictionary on-the-fly. Apart from self-supervised learning, supervised contrastive learning for classification tasks is also discussed in  (Khosla et al., 2020), in which a modified batch contrastive loss that supports an arbitrary number of positives is proposed to leverage label information effectively. As we only have pair-wise labels, our supervised contrastive learning approach is more similar to the self-supervised approach, where each anchor is coupled with only one positive.

3. Method

In this section, we first formally define the video representation learning problem (Section 3.1

) and describe the frame-level feature extraction step (Section

3.2). Then, we demonstrate the joint-feature aggregation approach (Section 3.3) and the contrastive learning method based on pair-wise video labels (Section 3.4

), then conduct further analysis on the gradients of the loss function (Section

3.5). And last, we discuss the similarity measure of aggregated video-level and frame-level video descriptors (Section 3.6).

3.1. Problem Setting

Video representation learning is a task of learning an embedding function that transforms the original video descriptor into another representation , which is easier to extract useful information for downstream tasks. As we only consider the RGB data of a video, each video representation can be raw pixels (, where each video contains frames and each frame is dimensional), or some frame-level descriptors (, where is the dimensionality of the frame-level feature) in which each frame is encoded separately and then stacked together, or a compact video-level descriptor (, where is the dimensionality of the video-level feature).

We address the problem of video representation learning for Near-Duplicate Video Retrieval (NDVR), Fine-grained Incident Video Retrieval (FIVR), and Event Video Retrieval (EVR) tasks. For all three tasks, there are no explicit classes as the content of a single video can be complicated, making it hard to apply popular classification-based video representation learning models. What we have are pair-wise labels describing whether two videos are similar (near duplicate, complementary scene, same event, etc.) or not(distractors). Given such pair-wise labels, metric learning can be a good way to tackle.

We view metric learning from a similarity optimization perspective. Take the similarity function as , the similarity of two video descriptor can be denoted as . Given these, our task is to optimize the embedding function , such that is maximized if and

are similar videos, and minimized otherwise. The similarity function is typically euclidean similarity or cosine similarity, but can be any other function within range

or . The embedding function typically takes a video-level descriptor and returns an embedding , where . However, in our work, is a feature aggregation model, thus frame-level descriptors are taken as input, and the output can be both aggregated video-level descriptor () and refined frame-level descriptors ().

3.2. Feature Extraction

Here we consider the frame-level feature extraction process. According to the results reported in  (Kordopatis-Zilos et al., 2019b)(Table 2), we select iMAC (Gordo et al., 2017) and -iMAC (Kordopatis-Zilos et al., 2019b) as our benchmark frame-level feature extraction methods. Given a pre-trained CNN network with convolutional layers, feature maps are generated, where is the dimension of each feature map of the layer, and is the total number of channels.

For iMAC feature, the maximum value of every channel of each layer is extracted to generate feature maps , as formulated in Eq. 1:

(1)

where layer vector is a

-dimensional vector that is derived from max pooling on every channel of feature map

.

For

-iMAC feature, max pooling with different kernel size and stride are applied to every channel of different layers to generate

feature maps . Unlike the setting of -iMAC, we then follow the tradition of R-MAC to sum the feature maps together, then apply -normalization on each channel to form a feature map . This approach keeps the dimensionality low which is equal to the iMAC feature, we denote this approach as -iRMAC. This presents a trade-off between the preservation of fine-trained spatial information and low feature dimensionality.

For both iMAC and -iRMAC, all layer vectors are concatenated to a single descriptor after extraction, then PCA is applied to perform whitening and dimensionality reduction following the common practice (Jégou and Chum, 2012; Kordopatis-Zilos et al., 2019b), finally -normalization is applied on each channel, resulting in a compact frame-level descriptor. By applying this process to each extracted frames of a video, we get the frame-level video descriptor .

Figure 2. Feature extraction and aggregation pipeline. Raw frames are fed to the feature extractor to extract the frame-level video descriptor . Then a compact video-level descriptor or refined frame-level descriptors are generated by the feature aggregation model . The frame-level descriptors can also be compressed into a video-level descriptor by applying average pooling and -normalization.

Feature extraction and aggregation pipeline.

3.3. Feature Aggregation

In this section, we discuss the details about the feature aggregation model/function . After performing feature extraction, a sequence of frame-level descriptors of a video is obtained. This is then passed to the feature aggregation model to generate a video-level descriptor (), or frame-level descriptors (), as illustrated in Figure 2.

Local feature aggregation models

For the NetVLAD (Arandjelovic et al., 2016) and the NeXtVLAD (Lin et al., 2018) model, when applied to the aggregation of frame-level video descriptors, each descriptor is treated as a local image descriptor as  (Miech et al., 2017). Following the setting of  (Miech et al., 2017; Lin et al., 2018), the Context Gating module is used in both models. As the local feature aggregation models do not model the temporal order, we only use them for aggregating compact video descriptors ().

Sequence aggregation models

Typically, a sequence model takes the input sequence one at a time, generating a sequence of hidden states as a function of the previous hidden state and the current input at position . Denote the hidden state at the time step as , the encoding process of the LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) can be written as:

(2)

respectively. Due to natural characteristics and behaviors of recurrent models (LSTM and GRU), the hidden states can encode and aggregate the previous contextual information. By concatenating all the yielded hidden states following time order, we get the aggregated video representation:

(3)

For the Transformer (Vaswani et al., 2017) model, following the setting of (Feng et al., 2018; Xia et al., 2019), only the encoder structure of the sequence models is used. With the parameter matrices written as , the entire video descriptor is first encoded into Query , Key and Value

by three different linear transformations:

, and . This is further calculated by the self-attention layer as:

(4)

The result is then taken to the LayerNorm layer (Ba et al., 2016) and Feed Forward Layer (Vaswani et al., 2017) to get the output of the Transformer encoder, i.e. . The multi-head attention mechanism is also used in our implementation. Although the encoded feature keep the same shape as the input, the contextual information within a longer range of each frame-level descriptor is incorporated. Besides, for both recurrent models (LSTM and GRU) and the Transformer model, by simply averaging the encoded frame-level video descriptors along the time axis, we can also get the compact video-level representation .

3.4. Contrastive Learning

Figure 3. Learning representation with pair-wise labels. For each batch, we take one positive pair from the core dataset and randomly sample negative samples from distractors, then the video-level descriptors are generated with a shared encoder. The negative samples of all batches and all GPUs are concatenated together to form the memory bank. We compare the similarity of the anchor sample against the positive sample and all negatives in the memory bank, resulting in and . Then the loss can be calculated in a classification manner following Eq. 5 and 6.

Learning representation with pair-wise labels.

If we denote as the video-level representation before applying normalization of the anchor, positive, negative examples, we get the similarity scores by: and . Then the InfoNCE (Oord et al., 2018) loss is thus written as:

(5)

where is a temperature hyper-parameter (Wu et al., 2018). To utilize more negative samples for better performance, we borrow the idea of the memory bank in (Wu et al., 2018). For each batch, we take one positive pair from the core dataset and randomly sample negative samples from distractors, then the compact video-level descriptors are generated with a shared encoder. The negative samples of all batches and all GPUs are concatenated together to form the memory bank. We compare the similarity of the anchor sample against the positive sample and all negatives in the memory bank, resulting in and . Then the loss can be calculated in a classification manner. The momentum mechanism (He et al., 2019) is not adopted as we did not see any improvement in experiments. Besides the InfoNCE loss, the recent proposed Circle Loss (Sun et al., 2020) is also considered:

(6)

where is the scale factor(equivalent with the parameter in Eq. 5), and is the relaxation margin. . Compared with the InfoNCE loss, the Circle loss optimizes and separately with adaptive penalty strength and adds within-class and between-class margins.

3.5. One Step Further on the Gradients

In the recent work of Khosla et al. (Khosla et al., 2020), the proposed batch contrastive loss is proved to focus on the hard positives and negatives automatically with the help of feature normalization by conducting gradient analysis, we further reveal that this is the common property of Softmax loss and its variants when combined with feature normalization. For simplicity, we analyze the gradients of Softmax loss, the origin of both InfoNCE loss and Circle loss:

(7)

the notation is as aforementioned. Here we show that easy negatives contribute the gradient weakly while hard negatives contribute greater. With the notations declared in Section 3.4, we denote the normalized video-level representation as , then the gradients of Eq. 7 with respect to is:

(8)

in which , and following the common notation of the softmax function. For an easy negative, the similarity between it and the anchor is close to -1, thus , and therefore

(9)

And for a hard negative, 111This represents the majority of hard negatives, and if the similarity is close to 1, it is too hard and may cause the model to collapse, or due to wrong annotation., and is moderate, thus the above equation is greater than 0, and its contribution to the gradient of the loss function is greater. Former research only explained it intuitively (Wang, 2018), however, we prove this property for the first time by conducting gradient analysis. The derivation process of Eq. 5 and 6 are alike. Comparing with the commonly used Triplet loss in video retrieval tasks (Kordopatis-Zilos et al., 2017b, 2019b) which requires computationally expensive hard negative mining, the proposed method based on contrastive learning takes advantage of the nature of softmax-based loss when combined with feature normalization to perform hard negative mining automatically, and use the memory bank mechanism to increase the capacity of negative samples, which greatly improves the training efficiency and effect.

3.6. Similarity Measure

To save the computation and memory cost, at training stage, all feature aggregation models are trained with the output as -normalized video-level descriptors (), thus the similarity between video pairs are simply calculated by dot product. Besides, for the sequence aggregation models, refined frame-level video descriptors () can also be easily extracted before applying average pooling along the time axis. Following the setting in (Kordopatis-Zilos et al., 2019b), at the evaluation stage, we also use chamfer similarity to calculate the similarity between two frame-level video descriptors. Denote the representation of two videos as , , where , the chamfer similarity between them is:

(10)

and the symmetric version:

(11)

Note that this approach (chamfer similarity) seems to be inconsistent with the training target (cosine similarity), where the frame-level video descriptors are averaged into a compact representation and the similarity is calculated with dot product. However, the similarity calculation process of the compact video descriptors can be written as:

(12)

Therefore, given frame-level features, chamfer similarity averages the maximum value of each row of the video-video similarity matrix, while cosine similarity averages the mean value of each row. It is obvious that hold true, therefore by optimizing the cosine similarity, we are optimizing the lower-bound of the chamfer similarity. As only the compact video-level feature is required, both time and space complexity are greatly reduced as cosine similarity is much computational efficient.

4. Experiments

4.1. Experiment Setting

We evaluate the proposed approach on three video retrieval tasks, namely Near-Duplicate Video Retrieval (NDVR), Fine-grained Incident Video Retrieval (FIVR), and Event Video Retrieval (EVR). In all cases, we report the mean Average Precision (mAP).

Training dataset

We leverage the VCDB (Jiang et al., 2014) dataset and a subset of the FIVR-200K (Kordopatis-Zilos et al., 2019a) dataset as training dataset. The core dataset of VCDB has 528 query videos and 6,139 positive pairs, and the distractor dataset has 100,000 distractor videos, of which we successfully downloaded 99,181 of them. The FIVR-200K dataset includes 225,960 videos and 100 queries, we successfully downloaded 225,950 of them. Three different fine-grained video retrieval tasks: (1) Duplicate Scene Video Retrieval, (2) Complementary Scene Video Retrieval and (3) Incident Scene Video Retrieval. Following the setting in  (Kordopatis-Zilos et al., 2019a), we order the videos based on their publication time and then split them in half, resulting in the former FIVR-TRAIN dataset with 31 queries and the latter FIVR-TEST dataset with 69 queries. We extract all 6,217 positive pairs in the ISVR task of FIVR-TRAIN dataset as positive pairs and all 108,576 other not hit videos as distractors in implementation. For a quick comparison of the different variants, the FIVR-5K dataset as in (Kordopatis-Zilos et al., 2019b) is also used.

Evaluation dataset

For models trained on the VCDB dataset, we test them on the CC_WEB_VIDEO (Wu et al., 2007) dataset for NDVR task, FIVR-200K for FIVR task and EVVE (Revaud et al., 2013) for EVR task, for the models trained on the FIVR-TRAIN dataset, we test them on the FIVR-TEST dataset. The CC_WEB_VIDEO dataset contains 24 query videos and 13,129 labeled videos, we managed to download 13,099 of them. The EVVE dataset consists of 2,375 videos and 620 queries, we successfully downloaded the whole dataset.

Implementation Details

For feature extraction, we extract one frame per second for all videos. For all retrieval tasks, we extract the frame-level features following the scheme in Section 3.2. The intermediate features are all extracted from the output of four residual blocks of ResNet-50 (He et al., 2016). PCA trained on 997,090 randomly sampled frame-level descriptors from VCDB is applied to both iMAC and -iRMAC features to perform whitening and reduce its dimension from 3840 to 1024. Finally -normalization is applied.

For both NetVLAD and NeXtVLAD, the number of clusters is set to 256, context gating mechanism is used with gating_reduction=8, one fully connected layer is used to reduce the dimension of the final flattened representation to 1024, and a dropout layer with drop_rate=0.5 is applied before the fully connected layer. The expansion ratio of NeXtVLAD is set to 2. For both LSTM and GRU, the number of hidden units is set to 1024, the number of layers set to 2 and dropout_rate set to 0.2. For all these four models, batch normalization 

(Ioffe and Szegedy, 2015) is applied before each non-linear layer. For the Transformer, it is implemented with 1 single layer, 8 attention heads, dropout_rate set to 0.5, and the dimension of the feed forward layer set to 2048. No batch normalization is used in the Transformer model as it may speed up over-fitting in practice, interestingly, all other four models won’t converge without it. For both InfoNCE loss and Circle loss, the parameters are set as default: .

During training, all videos are padded to 300 frames(if longer, a random segment with a length of 300 is extracted), and the full video is used in the evaluation stage. We use Adam 

(Kingma and Ba, 2014) as our optimizer, the initial learning rate is set to for NetVLAD and NeXtVLAD, and for sequence models, and cosine annealing learning rate scheduler (Loshchilov and Hutter, 2016) is used. All models are trained with batch size 64, and negative samples sampled from the distractors are sent to the memory bank each batch, with a single device with 4 Tesla-V100-SXM2-32GB GPUs, the size of the memory bank is equal to 4096. The training of all models stops when over-fitting is observed, i.e.

5, 5, 20, 30, 40 epochs for NetVLAD, NeXtVLAD, LSTM, GRU, and Transformer respectively. All models are implemented with PyTorch 

(Paszke et al., 2019), and distributed training is implemented with Horovod (Sergeev and Balso, 2018).

4.2. Feature Aggregation Model Comparison

Model CC_WEB_VIDEO222As in (Kordopatis-Zilos et al., 2019b), we use two evaluation settings on CC_WEB_VIDEO, one measuring performance only on the query sets, and one on the entire dataset. FIVR-200K
2-6 cc_web cc_web* DSVR CSVR ISVR
NetVLAD 0.971 0.944 0.513 0.494 0.412
NeXtVLAD 0.967 0.935 0.495 0.471 0.389
LSTM 0.969 0.937 0.505 0.483 0.400
GRU 0.969 0.940 0.515 0.495 0.415
Transformer 0.972 0.943 0.551 0.532 0.454
Table 1. Comparison between feature aggregation models

This section presents a comparison of the five feature aggregation models. All models are trained on VCDB dataset with iMAC feature to generate compact video-level descriptor, and dot product is used for similarity calculation for both train and evaluation. Table 1 presents the results of the comparison on both CC_WEB_VIDEO and FIVR-200K. As in (Miech et al., 2017), NetVLAD outperforms the classic sequence models (LSTM, GRU), but interestingly the NeXtVLAD show the worst performance. Besides, the Transformer model demonstrate excellent performance in almost all tasks, indicating that with the spatio-temporal information fully utilized, there are huge potential for the aggregation model to improve. We also present comparison between feature extraction methods and loss functions in Table 2, with loss function fixed to Circle loss. -iRMAC show consistent improvement against iMAC, indicating that the local spatial information are leveraged by the -iRMAC feature with lower dimensionality matained. For the loss functions, the InfoNCE loss show notable inferiority compared with Circle with default parameters , with temperature parameter set to (equivalent with in Circle loss), it still show around 0.005 less mAP. Next, we only consider the Transformer model trained with -iRMAC feature and Circle loss in the following experiments, denoted as CECL.

Feature FIVR-200K
2-4 DSVR CSVR ISVR
iMAC 0.547 0.526 0.447
-iRMAC 0.570 0.553 0.473
(a) Feature
Loss FIVR-200K
3-5 DSVR CSVR ISVR
InfoNCE 0.493 0.473 0.394
InfoNCE 0.566 0.548 0.468
Circle 0.570 0.553 0.473
(b) Loss function
Table 2. Comparison between feature and loss functions
Method Bank FIVR-5K
3-5 Size DSVR CSVR ISVR
baseline 4096 0.609 0.617 0.578
triplet - 0.510 0.509 0.455
m 0.1 65536 0.606 0.612 0.569
m 0.9 65536 0.606 0.612 0.569
m 0.99 65536 0.602 0.606 0.561
m 0.999 65536 0.581 0.577 0.520
(a) Training mechanism
Method FIVR-5K
2-4 DSVR CSVR ISVR
0.609 0.617 0.578
0.844 0.834 0.763
0.763 0.766 0.711
0.726 0.735 0.701
(b) Similarity measure
Table 3. Ablation study

4.3. Ablation Study

In this section, we present the ablation study on different training mechanisms and similarity calculation methods. For the training mechanism, we compare the baseline (contrastive learning with memory bank (Wu et al., 2018)) with a triplet based approach with hard negative mining (Kordopatis-Zilos et al., 2017b) and a modified MoCo (He et al., 2019)-like approach, where a large queue is maintained to store the negative samples and the weight of the model is updated in a moving averaged manner. For the triplet-based approach, the training process is extremely time-consuming (5 epochs, 5 hours on 32 Tesla-V100-SXM2-32GB GPUs), yet still show around 10% lower mAP compared with the baseline (40 epochs, 15 minutes on 4 Tesla-V100-SXM2-32GB GPUs), indicating that compared with learning from hard negatives, to utilize a large number of randomly sampled negative samples is not only more efficient, but also more effective. For the MoCo-like approach, we experimented it with different momentum (parameter ) ranging from 0.1 to 0.999, but none of them show better performance than the baseline approach as reported in Table 3(a), we argue that the momentum mechanism is a compromise for larger memory. as the memory bank is big enough in our case, the momentum mechanism is not needed.

For the similarity measures, we evaluate both the aggregated video-level feature and the frame-level feature. For the video-level features, we evaluate with cosine similarity. For the frame-level features, we evaluate the similarity between frames with cosine similarity, and for the generated video-video similarity matrix, we calculate the similarity between videos following the setting of (Kordopatis-Zilos et al., 2019b), i.e. chamfer similarity, symmetric chamfer similarity and chamfer similarity with similarity comparator (the weights are kept as provided by the authors). All four approaches are denoted as (cosine), (chamfer), (symmetric-chamfer), (video comparator) for simplicity. Table 3(b) presents the results on FIVR-5K dataset. Interestingly, the frame-level similarity calculation approach outperforms the video-level approach by a large margin, indicating that frame-level comparison is important for fine-grained similarity calculation between videos. Besides, the comparator network does not show as good results as reported, we argue that this may be due to the bias between features. We did not re-train the comparator because our target is to learn a good video representation, and the similarity measure is expected to be as simple and computationally efficient as possible.

4.4. Comparison with the state-of-the-art

Method mAP Per event class
LAMV+QE (Baraldi et al., 2018) 0.587 0.837 0.500 0.126 0.588 0.455 0.343 0.267 0.142 0.230 0.293 0.216 0.950 0.776
 (Kordopatis-Zilos et al., 2019b) 0.597 0.881 0.643 0.155 0.592 0.333 0.360 0.263 0.142 0.349 0.577 0.358 0.880 0.812
 (Kordopatis-Zilos et al., 2019b) 0.616 0.892 0.690 0.177 0.518 0.456 0.302 0.277 0.183 0.372 0.446 0.314 0.936 0.775
 (Kordopatis-Zilos et al., 2019b) 0.623 0.921 0.720 0.234 0.590 0.356 0.354 0.285 0.175 0.450 0.569 0.391 0.912 0.847
0.598 0.809 0.585 0.110 0.603 0.316 0.328 0.267 0.207 0.371 0.641 0.342 0.884 0.804
0.603 0.929 0.662 0.196 0.634 0.318 0.339 0.226 0.163 0.377 0.644 0.310 0.880 0.806
0.630 0.941 0.711 0.214 0.631 0.409 0.323 0.277 0.238 0.367 0.600 0.209 0.928 0.678
Table 4. mAP comparison of three similarity calculation setups of CECL with the state-of-the-art approaches on EVVE

Near-duplicate Video Retrieval

Method CC_WEB_VIDEO
3-6 cc_web cc_web* *
Video- DML (Kordopatis-Zilos et al., 2017b) 0.971 0.941 0.979 0.959
level 0.973 0.947 0.983 0.965
1-6 CTE (Revaud et al., 2013) 0.996 - - -
 (Kordopatis-Zilos et al., 2019b) 0.984 0.969 0.993 0.987
Frame-  (Kordopatis-Zilos et al., 2019b) 0.982 0.969 0.991 0.988
level  (Kordopatis-Zilos et al., 2019b) 0.985 0.971 0.996 0.993
0.983 0.969 0.994 0.990
0.982 0.962 0.992 0.981
Table 5. mAP on 4 different versions of CC_WEB_VIDEO

We first compare the performance of CECL against state-of-the-art approaches on several versions of CC_WEB_VIDEO (Wu et al., 2007) following the setting in  (Kordopatis-Zilos et al., 2019b). The benchmark approaches are Deep Metric Learning (DML) (Kordopatis-Zilos et al., 2017b), the Circulant Temporal Encoding (CTE) (Revaud et al., 2013), and Fine-grained Spatio-Temporal Video Similarity Learning (ViSiL), we report the best results of the original paper. As listed in Table 5, for the aggregated video-level descriptor, we report the state-of-the-art result on all tasks, for the refined frame-level descriptor, we also report results comparable with . To emphasize again, our target is to learn a good video representation, and the similarity calculation stage is expected to be as simple and computationally efficient as possible, therefore, it is fairer to compare our proposed approach with , as they hold akin similarity calculation approach.

Fine-grained Incident Video Retrieval

Method FIVR-200K FIVR-TEST
3-8 DSVR CSVR ISVR DSVR CSVR ISVR
DML (Kordopatis-Zilos et al., 2017b) 0.398 0.378 0.309 0.465 0.443 0.381
Video- HC (Song et al., 2013) 0.265 0.247 0.193 0.468 0.444 0.382
level 0.570 0.553 0.473 0.607 0.585 0.501
- - - 0.642 0.617 0.528
1-8  (Kordopatis-Zilos et al., 2019b) 0.843 0.797 0.660 - - -
Frame-  (Kordopatis-Zilos et al., 2019b) 0.833 0.792 0.654 - - -
level  (Kordopatis-Zilos et al., 2019b) 0.892 0.841 0.702 - - -
0.877 0.830 0.703 0.885 0.846 0.699
- - - 0.899 0.860 0.715
Table 6. mAP on FIVR-200K and FIVR-TEST
Method FIVR-200K
3-5 DSVR CSVR ISVR
Video-level LBOW (Kordopatis-Zilos et al., 2017a) 0.710 0.675 0.572
0.777 0.791 0.795
1-5 Frame-level 0.913 0.876 0.763
Table 7. mAP with FIVR-200K for both dev. and eval.

For the FIVR task, we evaluate the performance of CECL against the state-of-the-art approaches on FIVR-200K (Kordopatis-Zilos et al., 2019a) dataset. We report the best results reported in the original paper of Deep Metric Learning (DML) (Kordopatis-Zilos et al., 2017b), Layer Bag-of-Words (LBoW) (Kordopatis-Zilos et al., 2017a), Hashing Codes (HC) (Song et al., 2013) and Fine-grained Spatio-Temporal Video Similarity Learning (ViSiL) (Kordopatis-Zilos et al., 2019b). For models trained on VCDB dataset, we report the result on FIVR-200K, and for models trained on FIVR-TRAIN, we report the result on FIVR-TEST. As shown in Table 6, still, the proposed feature aggregation approach show a clear performance advantage over state-of-the-art methods on video-level features (), and deliver competitive results when compared with frame-level features () with low cost for similarity measure. Compared with , we show a clear performance advantage even with a more compact frame-level feature and simpler frame-frame similarity measure, opening up new possibilities of incorporating contextual information of feature-level features. When compared with , we show competitive results with much lower cost for similarity measure. Interestingly, our method slightly outperforms in ISVR task, indicating that our model might show an advantage in modeling semantic information. Besides, when trained on FIVR-TRAIN, all approaches see an improvement between 1% to 4%.

To make a fair comparison against the LBOW, we also report the results that FIVR-200K is used as both development dataset and evaluation dataset (only in this case, is trained on the complete FIVR-200K dataset) in Table 7. Still, we present the best video-level feature, and the results in DSVR and CSVR task are further boosted by a large margin when the frame-level feature is used for similarity calculation. Besides, it is interesting that show a clear performance advantage in ISVR task over , this may indicate that fine-grained frame-level comparison may be only effective for tasks that similar videos share visually similar scenes, and in terms of tasks that similar videos are only semantically similar, the video-level feature is more robust to visually similar distractor frames.

Event Video Retrieval

For EVR, we also compare CECL with the state-of-the-art approaches, i.e. Learning to Align and Match Videos (LAMV) (Baraldi et al., 2018) with Average Query Expansion (AQE) (Douze et al., 2013) and our old friend, Fine-grained Spatio-Temporal Video Similarity Learning (ViSiL) (Kordopatis-Zilos et al., 2019b) on EVVE (Revaud et al., 2013). We report the results of LAMV from the original paper, and the re-evaluated ViSiL as the reported results are evaluated on around of the original EVVE dataset. As shown in Table 4, shown best over all mAP and some of the events, still competitive against that achieve best result on the majority of the events, but with much less computational cost. Surprisingly, our video-level feature version also report notable results, indicating that the temporal information and fine-grained spatial information are not necessary for event video retrieval task.

Method # Epochs # GPU hours
Triplet333Used in both DML (Kordopatis-Zilos et al., 2017b) and ViSiL (Kordopatis-Zilos et al., 2019b). 5 160
Ours 40 1
(a) Train time on VCDB
Method # Seconds
 (Kordopatis-Zilos et al., 2019b) 2211
 (Kordopatis-Zilos et al., 2019b) 2608
116
(b) Eval. time on FIVR-5K
Table 8. Comparison on efficiency

In Table 8, we demonstrate the efficiency of our method. For training, our method (contrastive learning with memory bank) is not only much efficient than the commonly used triplet-based approach, but also show significantly higher performance as reported in Table 3(a). For evaluation, our method is about 22x faster comparing with ViSiL (Kordopatis-Zilos et al., 2019b), while achieving competitive performance. All this shows that our method achieves a good trade-off between efficiency and performance, and holds great potential for application.

5. Conclusion

In this paper, we present CECL (Context Encoding for video retrieval with Contrastive Learning), a video representation learning network that aggregates the context information of frame-level descriptors. To train the feature aggregation models with pair-wise labels, we propose a supervised contrastive learning method, in which the models are trained to distinguish the positive sample with respect to the anchor sample from distractors contained in a shared memory bank with a contrastive loss. By conducting gradient analysis, the property of automatic hard negative mining is also discovered in the proposed method. Extensive experiments are conducted on multi video retrieval tasks, and the proposed method shows a clear performance advantage over state-of-the-art methods with video-level features and delivers competitive results with a much lower computational cost for similarity measure when compared with frame-level features.

References

  • (1)
  • Arandjelovic et al. (2016) Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 5297–5307.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Babenko and Lempitsky (2015) Artem Babenko and Victor Lempitsky. 2015.

    Aggregating local deep features for image retrieval. In

    Proceedings of the IEEE international conference on computer vision. 1269–1277.
  • Baraldi et al. (2018) Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7804–7813.
  • Bay et al. (2006) Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision. Springer, 404–417.
  • Cai et al. (2011) Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. 2011. Million-scale near-duplicate video retrieval system. In Proceedings of the 19th ACM international conference on Multimedia. 837–838.
  • Cao et al. (2013) Qiong Cao, Yiming Ying, and Peng Li. 2013. Similarity metric learning for face recognition. In Proceedings of the IEEE international conference on computer vision. 2408–2415.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
  • Chou et al. (2015) Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. 2015. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia 17, 3 (2015), 382–395.
  • Csurka et al. (2004) Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. 2004. Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Vol. 1. Prague, 1–2.
  • Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2625–2634.
  • Douze et al. (2013) Matthijs Douze, Jérôme Revaud, Cordelia Schmid, and Hervé Jégou. 2013. Stable hyper-pooling and query expansion for event detection. In Proceedings of the IEEE International Conference on Computer Vision. 1825–1832.
  • Feng et al. (2018) Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 51–66.
  • Gordo et al. (2017) Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124, 2 (2017), 237–254.
  • Hao et al. (2017) Yanbin Hao, Tingting Mu, John Y Goulermas, Jianguo Jiang, Richang Hong, and Meng Wang. 2017. Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing 26, 11 (2017), 5531–5544.
  • Hao et al. (2016) Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2016. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia 19, 1 (2016), 1–14.
  • He et al. (2019) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2019. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
  • Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Hu et al. (2018) Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.
  • Hu and Lu (2018) Yaocong Hu and Xiaobo Lu. 2018. Learning spatial-temporal features for video copy detection by the combination of CNN and RNN. Journal of Visual Communication and Image Representation 55 (2018), 21–29.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
  • Jégou and Chum (2012) Hervé Jégou and Ondřej Chum. 2012. Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In European conference on computer vision. Springer, 774–787.
  • Jégou et al. (2010) Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3304–3311.
  • Jiang et al. (2014) Yu-Gang Jiang, Yudong Jiang, and Jiajun Wang. 2014. VCDB: a large-scale database for partial copy detection in videos. In European conference on computer vision. Springer, 357–371.
  • Jiang et al. (2007) Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval. 494–501.
  • Jing et al. (2019) Weizhen Jing, Xiushan Nie, Chaoran Cui, Xiaoming Xi, Gongping Yang, and Yilong Yin. 2019. Global-view hashing: harnessing global relations in near-duplicate video retrieval. World wide web 22, 2 (2019), 771–789.
  • Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. ArXiv abs/2004.11362 (2020).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kordopatis-Zilos et al. (2019a) Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019a. FIVR: Fine-grained incident video retrieval. IEEE Transactions on Multimedia 21, 10 (2019), 2638–2652.
  • Kordopatis-Zilos et al. (2019b) Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. 2019b. ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning. In Proceedings of the IEEE International Conference on Computer Vision. 6351–6360.
  • Kordopatis-Zilos et al. (2017a) Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017a. Near-duplicate video retrieval by aggregating intermediate CNN layers. In International conference on multimedia modeling. Springer, 251–263.
  • Kordopatis-Zilos et al. (2017b) Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017b. Near-Duplicate Video Retrieval with Deep Metric Learning. In 2017 IEEE International Conference on Computer Vision Workshop (ICCVW).
  • Li et al. (2017) Yang Li, Yulong Xu, Jiabao Wang, Zhuang Miao, and Yafei Zhang. 2017. Ms-rmac: Multiscale regional maximum activation of convolutions for image retrieval. IEEE Signal Processing Letters 24, 5 (2017), 609–613.
  • Liao et al. (2018) Kaiyang Liao, Hao Lei, Yuanlin Zheng, Guangfeng Lin, Congjun Cao, Mingzhu Zhang, and Jie Ding. 2018. IR feature embedded bof indexing method for near-duplicate video retrieval. IEEE Transactions on Circuits and Systems for Video Technology 29, 12 (2018), 3743–3753.
  • Lin et al. (2018) Rongcheng Lin, Jing Xiao, and Jianping Fan. 2018. Nextvlad: An efficient neural network to aggregate frame-level features for large-scale video classification. In Proceedings of the European Conference on Computer Vision (ECCV). 0–0.
  • Liu et al. (2017) Hao Liu, Qingjie Zhao, Hao Wang, Peng Lv, and Yanming Chen. 2017. An image-based near-duplicate video retrieval and localization using improved Edit distance. Multimedia Tools and Applications 76, 22 (2017), 24435–24456.
  • Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
  • Lowe (2004) David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91–110.
  • Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.

    Journal of machine learning research

    9, Nov (2008), 2579–2605.
  • Miech et al. (2017) Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017).
  • Oh Song et al. (2016) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4004–4012.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019.

    PyTorch: An Imperative Style, High-Performance Deep Learning Library. In

    NeurIPS.
  • Perronnin and Dance (2007) Florent Perronnin and Christopher Dance. 2007. Fisher kernels on visual vocabularies for image categorization. In 2007 IEEE conference on computer vision and pattern recognition. IEEE, 1–8.
  • Poullot et al. (2015) Sébastien Poullot, Shunsuke Tsukatani, Anh Phuong Nguyen, Hervé Jégou, and Shin’Ichi Satoh. 2015. Temporal matching kernel with explicit feature maps. In Proceedings of the 23rd ACM international conference on Multimedia. 381–390.
  • Radenović et al. (2016) Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In European conference on computer vision. Springer, 3–20.
  • Razavian et al. (2016) Ali S Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. 2016. Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications 4, 3 (2016), 251–258.
  • Revaud et al. (2013) Jérôme Revaud, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2013. Event retrieval in large video collections with circulant temporal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2459–2466.
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  • Seddati et al. (2017) Omar Seddati, Stéphane Dupont, Saïd Mahmoudi, and Mahnaz Parian. 2017. Towards good practices for image retrieval based on CNN features. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1246–1255.
  • Sergeev and Balso (2018) Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. ArXiv abs/1802.05799 (2018).
  • Shang et al. (2010) Lifeng Shang, Linjun Yang, Fei Wang, Kwok-Ping Chan, and Xian-Sheng Hua. 2010. Real-time large scale near-duplicate web video retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 531–540.
  • Sivic and Zisserman (2003) Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In null. IEEE, 1470.
  • Sohn (2016) Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems. 1857–1865.
  • Song et al. (2011) Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia. 423–432.
  • Song et al. (2013) Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Jiebo Luo. 2013. Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia 15, 8 (2013), 1997–2008.
  • Sun et al. (2020) Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. arXiv preprint arXiv:2002.10857 (2020).
  • Tan et al. (2009) Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, and Tat-Seng Chua. 2009. Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the 17th ACM international conference on Multimedia. 145–154.
  • Tian et al. (2019) Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 (2019).
  • Tolias et al. (2015) Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  • Wang (2018) Feng Wang. 2018. Research on Deep Learning Based Face Verification. Ph.D. Dissertation. University of Electronic Science and Technology of China.
  • Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.
  • Wang et al. (2019) Xinshao Wang, Yang Hua, Elyor Kodirov, Guosheng Hu, Romain Garnier, and Neil M Robertson. 2019. Ranked list loss for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5207–5216.
  • Weinberger and Saul (2009) Kilian Q Weinberger and Lawrence K Saul. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, Feb (2009), 207–244.
  • Wen et al. (2016) Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Springer, 499–515.
  • Wu et al. (2017) Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision. 2840–2848.
  • Wu et al. (2007) Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. 2007. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM international conference on Multimedia. 218–227.
  • Wu and Aizawa (2014) Zhipeng Wu and Kiyoharu Aizawa. 2014. Self-similarity-based partial near-duplicate video retrieval and alignment. International Journal of Multimedia Information Retrieval 3, 1 (2014), 1–14.
  • Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733–3742.
  • Xia et al. (2019) Jin Xia, Jie Shao, Cewu Liu, and Changhu Wang. 2019. Weakly Supervised EM Process For Temporal Localization Within Video. In 2019 IEEE International Conference on Computer Vision Workshop (ICCVW).
  • Zhao and Pietikainen (2007) Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence 29, 6 (2007), 915–928.
  • Zheng et al. (2017) Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. IEEE transactions on pattern analysis and machine intelligence 40, 5 (2017), 1224–1244.
  • Zheng et al. (2016) Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, and Qi Tian. 2016. Good practice in CNN feature transfer. arXiv preprint arXiv:1604.00133 (2016).