Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

by   Huy Manh Nguyen, et al.

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Learning Joint Representations of Videos and Sentences with Web Image Search

Our objective is video retrieval based on natural language queries. In a...

Ladder Loss for Coherent Visual-Semantic Embedding

For visual-semantic embedding, the existing methods normally treat the r...

Learning Contextualized Semantics from Co-occurring Terms via a Siamese Architecture

One of the biggest challenges in Multimedia information retrieval and un...

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where relat...

MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retri...

Jointly Modeling Embedding and Translation to Bridge Video and Language

Automatically describing video content with natural language is a fundam...

Efficient Sentence Embedding via Semantic Subspace Analysis

A novel sentence embedding method built upon semantic subspace analysis,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Video has become an essential source for humans to learn and acquire knowledge. Due to the increased demand for sharing and accumulating information, there is a massive amount of video being produced out there in the world every day. However, compared to images, videos usually contain much semantic information, and thus it is hard for humans to organize videos. Therefore, it is critical to developing an algorithm that can efficiently perform multimedia analysis and automatically understand the semantic information of videos.

A common approach for video analysis and understanding is to form a joint embedding space of video and sentence using multimodal learning. Similar to the fact that humans experience the world with multiple senses, the goal of multimodal learning is to develop a model that can simultaneously process multiple modalities, such as visual, text, and audio, in an integrated manner by constructing a joint embedding space. Such models can map various modalities into a shared Euclidean space where distances and directions capture useful semantic relationships. It enables us to not only learn the semantic presentations of videos by leveraging other modalities information but also show great potential in tasks such as cross-modal retrieval and visual recognition.

Recent works in learning an embedding space bridge the gap between sentence and visual information by utilizing advancements in image and language understanding [10, 6]

. Most approaches build the embedding space by connecting visual and textual embedding paths. Generally, a visual path uses a Convolutional Neural Network (CNN) to transform visual appearances into a vector. Likewise, Recurrent Neural Network (RNN) embeds sentences in a textual path. However, capturing the relationship between video and sentence remains challenging. Recent works suffer from extracting visual dynamics in a video 

[26, 15, 11, 41].

In this paper, we propose a novel framework equipped with multiple embedding networks so that we can capture various relationships between video and sentence, leading to more compelling video retrieval. Precisely, one network captures the relationship between an overall appearance in the video and a textual feature. Others consider consecutive appearances or action features. Thus, the networks learn their own embedding spaces. We fuse the similarities measured in the multiple spaces using the weighted summing strategy to produce the final similarity between video and sentence.

The main contribution of this paper is a novel approach to measure the similarity between video and sentence by fusing similarities in multiple embedding spaces. Consequently, we can measure the similarity with multiple understandings and relationships of video and sentence. Besides, we emphasize that the proposed method can quickly expand the number of embedding spaces. We demonstrated the effectiveness of the proposed method by the experimental results. We conducted video retrieval experiments using query sentences on the standard benchmark dataset and demonstrated an improvement of our approach compared to existing methods.

Ii Related Work

Vision and Language Understanding

There have been many efforts in connecting vision and language, which focus on building a joint visual-semantic space [21]

. Various applications in the area of computer vision needs such a joint space to realize tagging 

[10], retrieval [12], captioning [12], and visual question answering [19].

Recent works in image-to-text retrieval embed image and sentence into a joint embedding trained with ranking loss. A penalty is applied when an incorrect sentence is ranked higher than the correct sentence [36, 20, 40, 13, 17]

. Another popular loss function is triplet ranking 

[10, 23, 37, 43]. VSE++ model improves the triplet ranking loss by focusing on the hardest negative samples (the most similar yet incorrect sample in a mini-batch) [7].

Video and Sentence Embedding

As same as image-text retrieval approaches, most video-to-text retrieval methods learn a joint embedding space [44, 34, 32]. The method [26] incorporates web images searched with a sentence into an embedding process to take into account fine-grained visual concepts. However, the method treats video frames as a set of unrelated images and average them out to get a final video embedding vector. Thus, it may lead to inefficiency in learning an embedding space since temporal information of the video is lost.

Mithun et al. tackle this issue [25] by learning two different embedding spaces to consider temporal and appearance. Also, they extract audio features from the video for learning space. This approach achieved accurate performance for the sentence-to-video retrieval task. However, this approach puts equal importance on both embedding spaces. Practically, equal importance does not work well. There are cases that one embedding space is more important than the others in capturing semantic similarity between video and sentence. Therefore, we propose a novel mechanism emphasizing a space so that we can know the critical visual cues.

Iii The Proposed Method

Iii-a Overview

We describe the problem of learning a joint embedding space for video and sentence embedding. Given video and sentence instances which are sequences of frames and words, the aim is to train embedding functions that map them into a joint embedding space. Formally, we use embedding functions and , where and are video and sentence domains, respectively. is a joint embedding space. The similarity between and are calculated in by a certain measurement. Therefore, the ultimate goal is learning and satisfying the following equation: This encourages similarity will increase in a same pair and , whereas it decreases in a different pair and .

As we illustrated the overview of the proposed framework in Fig. 1, there are two networks for embedding videos: the global and the sequential visual networks. These networks have their counterparts that are embedding the sentence. Thus, we develop multiple visual and textual networks that are responsible for embedding a video and a sentence, respectively. We form one embedding space by merging two networks (one visual and one textual) so that we can align a visual feature to textual information. Specifically, the global visual network aligns average visual information of the video to the sentence. Likewise, the sequential visual network aligns a temporal context to the sentence. Consequently, the networks receive a video and sentence as inputs and map them into the joint embedding spaces.

Fig. 1: Overview of the proposed framework. , , , and are embedding functions.

We propose to fuse similarities measured in the embedding spaces to produce the final similarity. One embedding space has the global visual network extracting a visual object feature from the video. Thus, this embedding space describes a relationship between global visual appearances and textual features. The other embedding space captures the relationship between sequential appearances and textual features. The similarity scores of the two embedding spaces are then combined using the weighted sum to put more emphasis on one embedding space than the other space.

Our motivation is twofold. Firstly, since videos and sentences require different attention, we aim to develop a more robust similarity measurement by combining multiple embedding spaces in a weighted manner instead of using a hard-coded average. Secondly, in order to highlight spatial and temporal information of videos according to a sentence, we need to develop a mechanism that utilizes textual information to emphasize spatial and temporal features.

Iii-B Textual Embedding Network

We decompose the sentence to variable-length sequences of tokens. Then, we transform each token into a 300-dimensional vector representation by using the pre-trained GloVe model [27]

. The length of the tokens depends on the sentence. Therefore, in order to obtain a fixed-length meaningful representation, we encode the GloVe vectors using the Gated Recurrent Unit (GRU) 

[3] with hidden states, resulting in a vector . We set . This embedded vector goes to four processing modules: global and sequential visual networks, spatial attention mechanism, and similarity aggregation. We further transform in each processing module.

Iii-C Global Visual Network

The global visual network aims to obtain a vector representation that is a general visual feature over the video frames. We divide the input video frames into chunks, and one frame is sampled from each chunk by random sampling. We set

. We extract visual features from the sampled frames using the ResNet-152 pre-trained on the ImageNet dataset 

[14]. Specifically, we resize the sampled frames to , and then the ResNet encodes them, resulting in 2048-dimensional vectors. Note that we extract the vectors directly from the last fully connected layer of the ResNet. Subsequently, we apply average-pooling to the vectors to merge them. Consequently, we obtain a feature vector containing a global visual feature in the video.

We learn a joint embedding space of the global visual and textual networks. As defined in Eq. (1) and (2), we use embedding functions to embed and into a -dimensional space.


There are learnable parameters , , and . We set in this paper.

We use cosine similarity to measure the similarity

between the video and sentence in the joint embedding space.


Iii-D Sequential Visual Network

Similar to the global visual network, we divide the input video frames into chunks and take the first frames of each chunk as the input of the sequential visual network. Then, we use the ResNet-152 to extract a -dimensional vector from each frame at the last convolution layer of the ResNet. The contains visual features in spatial regions. Considering spatial regions where we should pay further attention may change by sentences, we need to explore relationships between spatial and textual features. Therefore, we incorporate a spatial attention mechanism into the sequential visual network. We apply the attention mechanism to

to emphasize spatial regions. Finally, we capture the sequential information of the video using a single layer Long-Short Term Memory (LSTM) 

[16] with an -dimensional hidden state. We denote the vector embedded by the sequential visual network as .

The sequential visual network uses LSTM to capture a meaningful sequential representation of the video. However, the LSTM transforms all spatial details of a video into a flat representation, resulting in losing its spatial reasoning with the sentence. Therefore, we employ the spatial attention mechanism in order to obtain a spatial relationship between video and sentence. Inspired by the work [19], we develop the spatial attention mechanism to learn with regions in a frame to attend for each word in the sentence.

Fig. 2: The spatial attention mechanism used in the sequential visual network.

We illustrated the spatial attention mechanism in Fig. 2. The mechanism associates the visual feature vector with the textual vector to produce a spatial attention map . Then, we combine with to produce the spatial attended feature . Formally, and are defined in Eq. (5) and (6), respectively.

Specifically, Eq. (4) defines an embedding function of the sequential visual network. represents the LSTM, and is element-wise product. The and mean intermediate outputs, which are 512-dimensional vectors. We have learnable parameters , , , , and .


The joint embedding space of the sequential visual and textual networks is formed by and . We measure the similarity in this joint embedding space. The formulations are described below, where and are learnable parameters.


Iii-E Similarity Aggregation

There are many approaches to aggregate similarities. An average is a straightforward approach. Some cases work well on average. However, the average may cause unexpected behaviors if an inaccurate similarity is considerably high or low. Therefore, we adopt the Mixture of Experts fusion strategy [18] for aggregating similarities with weights that changes according to the input sentence. Consequently, we can emphasize one embedding space using the weights for merging the multiple similarities.

We propose to aggregate the similarities measured in multiple embedding spaces so that we can produce the final similarity with various understandings of videos and sentences. We illustrate the proposed similarity aggregation in Fig. 3. Specifically, we merge the similarities using the weight generated by considering the textual feature. is a learnable parameter. is a function that concatenates given scalar values.

Fig. 3: Similarity aggregation

Iii-F Optimization

As described in Section III-A, we optimize the proposed architecture by enforcing similarity of a video and its counterpart sentence will be greater than similarities of the video and other sentence , such as or . We achieve this by using the triplet ranking loss [22, 2, 9], where is a margin.


Given a dataset , with

pairs, we optimize the following equation by stochastic gradient descent 

[31, 46].


Iv Extendability

The proposed architecture has extendability of new embedding networks. The steps of the extension are straightforward. We build visual and textual networks and then merge them to form a joint embedding space. In this paper, we add embedding networks using the Inflated 3D Convolutional Neural Network (I3D) model [1] so that the networks can capture video activities. We utilize the pre-trained RGB-I3D model to extract embedding vectors from continuous 16 frames of video. Consequently, a 1024-dimensional embedding vector is produced for each video.

The joint space is learnt by using and the textual embedding vector , and the similarity is measured in this joint embedding space. We transform using and . The similarity aggregation is also straightforward to be extended. Specifically, we concatenate all similarities and merge them with a weight , where represents a number of embedding spaces. in this case.


We stress that the extendability is vital since this enables us to incorporate other feature extraction models into the framework quickly. There are abundant approaches to extract features from videos 

[28, 39, 38, 33, 8, 45, 30, 24, 4, 35]. Various aspects and approaches are necessary for the understanding of videos and sentences.

V Experiments

We carried out the sentence-to-video retrieval task on the benchmark dataset to evaluate the proposed method. The task is retrieving the video associated with the query sentence from the test videos. We calculated similarities over the test videos with the query sentence, and then we picked up videos according to the similarities in descending order.

We reported the experimental results using rank-based performance metrics, i.e., Recall@ and Median rank. The Recall@ calculates the percentage of the correct video in top- retrieved results. In this paper, we set . Median rank calculates the median of the ground-truth results in the ranking. For Recall@, the bigger value indicates better performance. When Median Rank is a lower value, retrieved results are closer to the ground-truth items. Therefore, a lower median rank means better retrieval performance.

Following the convention in sentence-to-video retrieval, we used the Microsoft Research Video to Text dataset (MSR-VTT) [42], which is a large-scale video benchmark for video understanding. The MSR-VTT contains 10000 video clips from YouTube with 41.2 hours in total. The dataset provides videos in 20 categories, e.g., music, people, and sport. Each video is associated with 20 different description sentences. The dataset consists of 6513 videos for training, 497 videos for validation, and 2990 videos for testing.

We evaluated four variants of the proposed method: single-space, sequential or I3D dual-space, and triple-space models. Firstly, the single-space model represents the proposed framework composed of the global visual and textual embedding networks. These two networks form a single embedding space. Then, we measure the final similarity in the single embedding space. Secondly, the sequential dual-space model (dual-S) is the proposed framework using two embedding spaces learned by the global and sequential visual networks, and textual embedding networks. We measure the final similarity by merging two similarities, and , as described in Eq. (11). Thirdly, the I3D dual-space model (dual-I) has global visual and I3D embedding networks. Lastly, the triple-space model added the I3D and textual embedding networks into the dual-space model. We produce the final similarity by merging , , and with the similarity aggregation.

V-a Sentence-to-Video Retrieval Results

We summarized the results of the sentence-to-video retrieval task on the MSR-VTT dataset in Table I. We compared the proposed method to the existing methods [22, 7, 25, 5] The proposed method obtained 7.1 (dual) at R@1, 21.2 (triple) at R@5, 32,4 at R@10 (triple), and 29 at MR (triple). These are competitive with [25, 5], which are the state-of-the-art.

Method R@1 R@5 R@10 MR
VSE [22] 5.0 16.4 24.6 47
VSE++ [7] 5.7 17.1 24.8 65
Multi-Cues [25] 7.0 20.9 29.7 38
Cat-Feats [5] 7.7 22 31.8 32
Ours (single) 5.6 18.4 28.3 41
Ours (dual-S) 7.1 19.8 31 30
Ours (triple) 6.7 21.2 32.4 29
TABLE I: The results of sentence-to-video retrieval task on MSR-VTT dataset. The bold and underlined results represent the first- and the second-best, respectively.

VSE and VSE++ adopt a triplet ranking loss, and VSE++ incorporates hard-negative samples into the loss to facilitate practical training [29]. We adopted this strategy. The results show that VSE++ performs significantly better than VSE at every R@. Although the single-space model and VSE++ adopted similar loss functions, we found slight improvements in performance. However, the dual-space model achieves much better performance compared to VSE++. The results demonstrate the importance of using the sequential visual information of videos for learning an embedding space.

Multi-Cues [25] calculates two similarities in separated embedding spaces and then averages them to produce a final similarity. The proposed method has higher performance compared to Multi-Cues. The similarity aggregation strategy is the main difference between the proposed method and Multi-Cues. Note that the average suffers from aligning videos and sentences due to their variations. Thus, some videos need global visual features, and some need sequential visual features. The proposed method has a flexible strategy for merging similarities. The experimental results show that the proposed strategy is more useful for measuring similarity than a naive merging approach with equal importance to each embedding space, such as average.

The Cat-Feats [5]

embeds videos and sentences by concatenating feature vectors extracted by three embedding modules: CNN, bi-directional GRU, and max pooling. Cat-Feats is slightly better than the proposed method at R@1 and R@5, e.g., 7.7 and 7.1 for Cat-Feats and dual-space at R@1, respectively. Whereas, the proposed method (triple-space) outperforms Cat-Feat at R@10 and median rank, such as 32 and 29 by Cat-Feat and the triple, respectively. These results imply that the proposed method and Cat-Feats can assist each other. There is a possibility to improve performance by incorporating the feature concatenation mechanism used in Cat-Feat into the proposed method.

Finally, the proposed method with triple-space achieves better results than single- and dual-space at three metrics: R@5, 10, and median rank. Therefore, The results show that integrating multiple similarities can lead to a better, reliable retrieval performance.

Vi Ablation Study

We carried out ablation studies to evaluate the components of the proposed method. We conducted the following three experiments.

Vi-a Embedding Spaces

We changed the number of embedding spaces and developed the four variants of the proposed architecture. The experimental results are shown in Table II. there are certain improvements from single to multiple spaces at all the metrics. Therefore, we can verify the effectiveness of the proposed method that integrates multiple spaces. Subsequently, we compared the two duals, dual-S is better than the dual-I at R@1 and R@10, whereas dual-I is superior at R@5. Thus, dual-S and dual-I can complement each other. The triple contains the embedding spaces of both of the duals, and it outperforms the single and the duals. Therefore, we can confirm the effectiveness of the combination of multiple embedding spaces, which is the key insight of the proposed method for video and sentence understanding.

Embedding spaces
Global Sequential I3D R@1 R@5 R@10 MR
single 5.6 18.4 28.3 41
dual-S 7.1 19.8 31 30
dual-I 6.8 20.2 30.6 30
triple 6.7 21.2 32.4 29
TABLE II: Evaluation on combinations of embedding spaces

Vi-B Spatial Attention Mechanism

We conducted experiments using dual-S with or without the spatial attention mechanism. The dual-S without the attention encodes each frame using ResNet into a 2048-dimensional vector. Then, the LSTM processed the sequences of the vectors. Table III shows the results. The dual-S with the attention achieves better results than without attention at all metrics, R@1, R@5, R@10, and median rank. Therefore, we can confirm that the proposed spatial attention mechanism improves performance significantly.

Besides, we observed that the performance of the dual-S with attention is almost competitive with dual-I. However, dual-S without the attention is worse than the dual-I. Considering that the I3D model extracts useful spatial and temporal features for action recognition, the dual-S without attention could not extract sequential features effectively. Whereas, the dual-S with attention obtained such features. We stress that this is another supportive evidence showing the effectiveness of the proposed attention mechanism.

R@1 R@5 R@10 MR
w/o attention 5.8 19.8 28.6 34
w/ attention 7.1 19.8 31 30
TABLE III: Effectiveness of the spatial attention

Vi-C Similarity Aggregation

We investigated the impacts of similarity aggregation using average or the proposed weighted sum. We used dual-S and dual-I for this investigating experiment. The experimental results are shown in Table IV. There are improvements by the weighted sum at R@1, R@10, and MR in both of the dual-S and dual-I. Therefore, we confirmed the effectiveness of the proposed similarity aggregation module.

R@1 R@5 R@10 MR
dual-S average 6.6 19.9 29.9 31
weighted 7.1 19.9 31 30
dual-I average 6.7 20.2 30.4 30
weighted 6.8 20.2 30.6 30
TABLE IV: Performance comparison on similarity aggregation using average or the proposed weighted sum

Furthermore, we performed an analysis of the weights in the similarity aggregation. As described in Eq. (12), the weights are flexibly determined according to the given sentence. In other words, the weight represents the importance of embedding spaces. We attempted to go further understanding by analyzing the weights. For simplicity, we used the dual-S model in this analysis. Therefore, the analysis is on the importance of global and sequential visual features. We summarized the statistics of the weights in Table V. The statistics show that the proposed method assigns larger weights to the global features.

Average Min Max
Global 0.52 0.399 0.61
Sequential 0.48 0.393 0.60
TABLE V: Statistics of the weights in similarity aggregation for the global and sequential embedding spaces

We show the accumulative step histogram for the global weight in Fig. 4. The ratio reached 0.75 at the weight 0.5. Thus, 0.75 total instances received larger weights for the global feature. In contrast, the weights of the sequential feature are larger only in the remained 0.25 instances. Therefore, the global feature is more critical than the sequential feature. Thus, dual-S aggressively used global features.

Figure 5 shows examples of videos and sentences with assigned weight to the global feature. Videos containing explicit scenes tend to have larger weights on the global visual feature since objects in the videos are relatively easy for detection. On the other hand, videos containing unclear scenes tend to assign larger weights to the sequential visual feature.

Fig. 4: Accumulative step histogram of the weights of the global visual features used in similarity aggregation
Fig. 5: Examples of video and sentence with assigned weight to the global feature. The numbers and the sentences represent the weight for the global feature and query sentences, respectively.

Vii Conclusion

In this paper, we presented a novel framework for embedding videos and sentences into multiple embedding spaces. The proposed method uses distinct embedding networks to capture various relationships between visual and textual features, such as global appearance, sequential visual, and action features. We produce the final similarity between a video and a sentence by merging similarities measured in the embedding spaces with the weighted sum manner. The proposed method can flexibly determine the weights according to a given sentence. Hence, the final similarity can incorporate an essential relationship between video and sentence.

We carried out sentence-to-video retrieval experiments on the MSR-VTT dataset to demonstrate that the proposed framework significantly improved the performance when the number of embedding spaces increased. The proposed method achieved competitive results compared to the state-of-the-art methods [25, 5]. Furthermore, we verify all the critical components in the proposed method through the ablation experiments. Even though the components are individually useful, their cooperation can generate significant improvements.


This study was partially supported by JSPS KAKENHI Grant Number 18K19772, 19K11848, and Yotta Informatics Project by MEXT, Japan.


  • [1] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 4724–4733. Cited by: §IV.
  • [2] G. Chechik, V. Sharma, U. Shalit, and S. Bengio (2010)

    Large scale online learning of image similarity through ranking

    J. Mach. Learn. Res. 11, pp. 1109–1135. Cited by: §III-F.
  • [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In

    NIPS 2014 Workshop on Deep Learning, December 2014

    Cited by: §III-B.
  • [4] N. Dalal, B. Triggs, and C. Schmid (2006) Human detection using oriented histograms of flow and appearance. In European Conference on Computer Vision, pp. 428–441. Cited by: §IV.
  • [5] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang (2019) Dual encoding for zero-example video retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9338–9347. Cited by: §V-A, §V-A, TABLE I, §VII.
  • [6] M. Engilberge, L. Chevallier, P. Perez, and M. Cord (2018) Finding beans in burgers: deep semantic-visual embedding with localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3984–3993. Cited by: §I.
  • [7] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC), pp. 12. Cited by: §II, §V-A, TABLE I.
  • [8] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) SlowFast networks for video recognition. In IEEE/CVF International Conference on Computer Vision, Vol. , pp. 6201–6210. Cited by: §IV.
  • [9] A. Frome, Y. Singer, F. Sha, and J. Malik (2007) Learning globally-consistent local distance functions for shape-based image retrieval and classification. In IEEE International Conference on Computer Vision, pp. 1–8. Cited by: §III-F.
  • [10] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pp. 2121–2129. Cited by: §I, §II, §II.
  • [11] J. Gao, C. Sun, Z. Yang, and R. Nevatia (2017) TALL: temporal activity localization via language query. In IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285. Cited by: §I.
  • [12] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vision 106 (2), pp. 210–233. Cited by: §II.
  • [13] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7181–7189. Cited by: §II.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §III-C.
  • [15] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017)

    Localizing moments in video with natural language

    In IEEE International Conference on Computer Vision (ICCV), pp. 5804–5813. Cited by: §I.
  • [16] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §III-D.
  • [17] Y. Huang, Q. Wu, W. Wang, and L. Wang (2020) Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (3), pp. 636–650. Cited by: §II.
  • [18] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991) Adaptive mixtures of local experts. Neural Computation 3 (1), pp. 79–87. Cited by: §III-E.
  • [19] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim (2017) TGIF-qa: toward spatio-temporal reasoning in visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1359–1367. Cited by: §II, §III-D.
  • [20] Z. Ji, H. Wang, J. Han, and Y. Pang (2019) Saliency-guided attention network for image-sentence matching. In IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5753–5762. Cited by: §II.
  • [21] A. Karpathy and L. Fei-Fei (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 664–676. Cited by: §II.
  • [22] R. Kiros, R. Salakhutdinov, and R. Zemel (2014) Multimodal neural language models. In

    International Conference on Machine Learning

    Vol. 32, pp. 595–603. Cited by: §III-F, §V-A, TABLE I.
  • [23] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. ArXiv abs/1411.2539. Cited by: §II.
  • [24] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. Cited by: §IV.
  • [25] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ACM on International Conference on Multimedia Retrieval, pp. 19–27. Cited by: §II, §V-A, §V-A, TABLE I, §VII.
  • [26] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya (2016) Learning joint representations of videos and sentences with web image search. In European Conference on Computer Vision (ECCV) Workshops, pp. 651–667. Cited by: §I, §II.
  • [27] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. Cited by: §III-B.
  • [28] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision, pp. 5534–5542. Cited by: §IV.
  • [29] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §V-A.
  • [30] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In International Conference on Neural Information Processing Systems, pp. 568–576. Cited by: §IV.
  • [31] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2, pp. 207–218. Cited by: §III-F.
  • [32] Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1979–1988. Cited by: §II.
  • [33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision, Vol. , pp. 4489–4497. Cited by: §IV.
  • [34] Y. H. Tsai, S. Divvala, L. Morency, R. Salakhutdinov, and A. Farhadi (2019) Video relationship reasoning using gated spatio-temporal energy graph. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10416–10425. Cited by: §II.
  • [35] H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, pp. 3551–3558. Cited by: §IV.
  • [36] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang (2016) A comprehensive survey on cross-modal retrieval. ArXiv abs/1607.06215. Cited by: §II.
  • [37] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013. Cited by: §II.
  • [38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision, pp. 20–36. Cited by: §IV.
  • [39] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §IV.
  • [40] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W. Ma (2019) Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6602–6611. Cited by: §II.
  • [41] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko (2019) Multilevel language and vision integration for text-to-clip retrieval.. In AAAI, pp. 9062–9069. Cited by: §I.
  • [42] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-vtt: a large video description dataset for bridging video and language. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296. Cited by: §V.
  • [43] K. Ye and A. Kovashka (2018) ADVISE: symbolism and external knowledge for decoding advertisements. In European Conference on Computer Vision (ECCV), Cited by: §II.
  • [44] D. Zhang, X. Dai, X. Wang, Y. Wang, and L. S. Davis (2019) MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1247–1257. Cited by: §II.
  • [45] S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang (2020) V4D: 4d convolutional neural networks for video-level representation learning. In International Conference on Learning Representations, Cited by: §IV.
  • [46] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In IEEE International Conference on Computer Vision, pp. 19–27. Cited by: §III-F.