Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

06/06/2019 ∙ by Zhu Zhang, et al. ∙ Zhejiang University 0

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, propose a multi-head self-attention to capture long-range semantic dependencies from video context, and next employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments demonstrate the effectiveness of our proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Multimedia information retrieval is an important topic in information retrieval systems. Recently, query-based video retrieval (Lin et al., 2014; Xu et al., 2015; Otani et al., 2016) has been well-studied, which searches the most relevant video from large collections according to a given natural language query. However, in practical applications, the untrimmed videos often contain multiple complex events that evolve over time, where a large part of video contents are irrelevant to the query and only a small clip satisfies the query description. Thus, as a natural extension, the query-based moment retrieval aims to automatically localize the start and end boundaries of the target moment semantically corresponding to the given query within an untrimmed video. Different from retrieving an entire video, moment retrieval offers more fine-grained temporal localization in a long video, which avoids manually searching for the moment of interests.

However, localizing the precise moment in a continuous, complicated video is more challenging than simply selecting a video from pre-defined candidate sets. As shown in Figure 1, the query  “A man throws the ball and hits the boy in the face before landing in a cup” describes two successive actions, corresponding to complex object interactions within the video. Hence, the accurate retrieval of the target moment requires sufficient understanding of both video and query contents by cross-modal interactions.

Most existing moment retrieval works (Gao et al., 2017; Hendricks et al., 2018, 2017; Liu et al., 2018a, b; Chen et al., 2018; Xu et al., 2019) only focus on one aspect of this emerging task, such as the query representation learning (Liu et al., 2018b), video context modeling (Gao et al., 2017; Liu et al., 2018a) and cross-modal fusion (Chen et al., 2018; Xu et al., 2019), thus fail to develop a comprehensive system to further improve the performance of query-based moment retrieval. In this paper, we consider multiple crucial factors for high-quality moment retrieval.

Firstly, the query description often contains causal temporal actions, thus it is fundamental and crucial to learn fine-grained query representations. Existing works generally adopt widely-used recurrent neural networks, such as GRU networks, to model natural language queries. However, these approaches ignore the syntactic structure of queries. As shown in Figure 

1, the syntactic relation implies dependencies of word pairs, helpful for query semantic understanding. Recently, the graph convolution networks (GCN) have been proposed to model the graph structure (Kipf and Welling, 2017; Velickovic et al., 2018), including the visual relationship graph (Yao et al., 2018) and syntactic dependency graph (Marcheggiani and Titov, 2017). Inspired by these works, we develop a syntactic GCN to exploit the syntactic structure of queries. Concretely, we first build the syntactic dependency graph as shown in Figure 1, and then pass information along the dependency edges to learn syntactic-aware query representations. In detail, we consider the direction and label of dependency edges to adequately incorporate syntactic clues.

Secondly, the target moments of complex queries generally contain object interactions over a long time interval, thus there exist long-range semantic dependencies in video context. That is, each frame is not only relevant to adjacent frames, but also associated with distant ones. Existing approaches often apply RNN-based temporal modeling (Chen et al., 2018), or propose R-C3D networks to learn spatio-temporal representations from raw video streams (Xu et al., 2019). Although these methods are able to absorb contextual information for each frame, they still fail to build direct interactions between distant frames. To eliminate the local restrictions, we propose a multi-head self-attention mechanism (Vaswani et al., 2017) to capture long-range semantic dependencies from video context. The self-attention method can develop the frame-to-frame interaction at arbitrary positions and the multi-head setting ensures the sufficient understanding of complicated dependencies.

Thirdly, query-based moment retrieval requires the comprehensive reasoning of video and query contents, thus the cross-modal interaction is necessary for high-quality retrieval. Early approaches (Gao et al., 2017; Hendricks et al., 2017, 2018)

ignore this factor and only simply combine the query and moment features for correlation estimations. Although recent methods 

(Liu et al., 2018a, b; Chen et al., 2018; Xu et al., 2019) have developed a cross-modal interaction by widely-used attention mechanism, they still remain in the rough one-stage interaction, for example, highlighting the crucial context information of moments by the guidance of queries (Liu et al., 2018a). Different from previous works, we adopt a multi-stage cross-modal interaction method to further exploit the potential relation of video and query contents. Specifically, we first adopt a normal attention method to aggregate syntactic-aware query representations for each frame, then apply a cross gate (Feng et al., 2018) to emphasize crucial contents and weaken inessential parts, and next develop the low-rank bilinear fusion to learn a cross-modal semantic representation.

In summary, the key contributions of this work are four-fold:

  • We design a novel cross-modal interaction networks for query-based moment retrieval, which is a comprehensive system to consider multiple crucial factors of this challenging task: (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction.

  • We propose the syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, and adopt a multi-head self-attention method to capture long-range semantic dependencies from video context.

  • We employ a multi-stage cross-modal interaction to further exploit the potential relation of video and query contents, where an attentive aggregation method extracts relevant syntactic-aware query representations for each frame, a cross gate emphasizes crucial contents and a low-rank bilinear fusion method learn cross-modal semantic representations.

  • The proposed CMIN method achieves the state-of-the-art performance on ActivityCaptions and TACoS datasets.

The rest of this paper is organized as follows. We briefly review some related works in Section 2. In Section 3, we introduce our proposed method. We then present a variety of experimental results in Section 4. Finally, Section 5 concludes this paper.

2. Related Work

In this section, we briefly review some related works on image/video retrieval, temporal action localization and query-based moment retrieval.

2.1. Image/Video Retrieval

Given a set of candidate images/videos and a natural language query, image/video retrieval aims to select the image/video that matches this query. Karpathy et al. (Karpathy and Fei-Fei, 2015)

propose a deep visual-semantic alignment (DVSA) model for image retrieval, which uses the BiLSTM to encode query features and R-CNN detector 

(Girshick et al., 2014) to extract object representations. Sun et al. (Sun et al., 2015) advise an automatic visual concept discovery algorithm to boost the performance of image retrieval. Moreover, Hu et al. (Hu et al., 2016) and Mao et al (Mao et al., 2016) regard this problem as natural language object retrieval. As for video retrieval, some methods (Otani et al., 2016; Xu et al., 2015) incorporate deep video-language embeddings to boost retrieval performance, similar to the image-language embedding approach (Socher et al., 2014). And Lin et al. (Lin et al., 2014) first parse the query descriptions into a semantic graph and then match them to visual concepts in videos. Different from these works, query-based moment retrieval aims to localize a moment within an untrimmed video, which is more challenging than simply selecting a video from pre-defined candidate sets.

Figure 2. The Framework of Cross-Modal Interaction Networks for Query-Based Moment Retrieval. (a) The syntactic GCN module leverages the syntactic structure to learn syntactic-aware query representations. (b) The multi-head self-attention module captures long-range semantic dependencies from context. (c) The multi-stage cross-modal interaction module explores the intrinsic relations between video and query. (d) The moment retrieval module localizes the boundaries of target moments.

2.2. Temporal Action Localization

Temporal action localization is a challenging task to localize action instances in an untrimmed video. Shou et al. (Shou et al., 2016) develop three segment-based 3D ConvNets with localization loss to explicitly explore the temporal overlap in videos. Singh et al. (Singh et al., 2016) propose a multi-stream bi-directional RNN with two additional streams on motion and appearance to achieve the fine-grained action detection.

To leverage the context structure of actions, Zhao et al. (Zhao et al., 2017) advise a structured segment network to model the structure of action instances by a structured temporal pyramid. And Chao et al. (Chao et al., 2018) boost the action localization performance by imitating the Faster RCNN object detection framework (Ren et al., 2015). Although these works have achieved promising performance, they still are limited to a pre-defined list of actions. And query-based moment retrieval tackles this problem by introducing the natural language query.

2.3. Query-Based Moment Retrieval

Query-based moment retrieval is to detect the target moment depicting the given natural language query in an untrimmed video. Early works study this task in constrained settings, including the fixed spatial prepositions (Tellex and Roy, 2009; Lin et al., 2014), instruction videos (Alayrac et al., 2016; Sener et al., 2015; Song et al., 2016) and ordering constraint (Bojanowski et al., 2015; Tapaswi et al., 2015). Recently, unconstrained query-based moment retrieval has attracted a lot of attention (Hendricks et al., 2017; Gao et al., 2017; Liu et al., 2018a; Hendricks et al., 2018; Liu et al., 2018b; Xu et al., 2019; Chen et al., 2018). These methods are mainly based on a sliding window framework, which first samples candidate moments and then ranks these moments. Hendricks et al. (Hendricks et al., 2017) propose a moment context network to integrate global and local video features for natural language retrieval, and the subsequent work (Hendricks et al., 2018) considers the temporal language by explicitly modeling the context structure of videos. Gao et al. (Gao et al., 2017) develop a cross-modal temporal regression localizer to estimate the alignment scores of candidate moments and textual query, and then adjust the boundaries of high-score moments. With the development of attention mechanism in the field of vision and language interaction (Anderson et al., 2018; Zhao et al., 2018), Liu et al. (Liu et al., 2018a) advise a memory attention to emphasize the visual features and simultaneously utilize the context information. And similar attention strategy (Liu et al., 2018b) is designed to highlight the crucial part of query contents. From the holistic view, Chen et al. (Chen et al., 2018) capture the evolving fine-grained frame-by-word interactions between video and query. Xu et al. (Xu et al., 2019) introduce a multi-level model to integrate visual and textual features earlier and further re-generate queries as an auxiliary task.

Unlike these previous methods, we propose a novel cross-modal interaction network to consider three critical factors for query-based moment retrieval, including the syntactic structure of natural language queries, long-range semantic dependencies in video context and the sufficient cross-modal interaction.

3. Cross-Modal Interaction Networks

As Figure 2 illustrates, our cross-modal interaction networks consist of four components: 1) the syntactic GCN module leverages the syntactic structure to enhance the query representation learning; 2) the multi-head self-attention module captures long-range semantic dependencies from video context; 3) the multi-stage cross-modal interaction module aggregates syntactic-aware query representations for each frame, emphasizes crucial contents and learns cross-modal semantic representations; 4) the moment retrieval module finally localizes the boundaries of target moments.

3.1. Problem Formulation

We present a video as a sequence of frames , where is the feature of the -th frame and is the frame number of the video. Each video is associated with a natural language query, denoted by , where is the feature of the -th word and is the word number of the query. The query description corresponds to a target moment in the untrimmed video and we denote the start and end boundaries of the target moment by . Thus, given the training set , our goal is to learn the cross-modal interaction networks to predict the boundary of the most relevant moment during inference.

3.2. Syntactic GCN Module

In this section, we introduce the syntactic GCN module based on a syntactic dependency graph. By passing information along the dependency edges between relevant words, we learn syntactic-aware query representations for subsequent cross-modal interactions.

We first extract word features for the query using a pre-trained Glove word2vec embedding (Pennington et al., 2014), denoted by , where is the feature of the -th word. After that, we develop a bi-directional GRU networks (BiGRU) to learn the query semantic representations. The BiGRU networks incorporate contextual information for each word by combining the forward and backward GRU (Chung et al., 2014) . Sepcifically, we input the sequence of word features to the BiGRU networks, and obtain the contextual representation of each word, given by

(1)

where and represent the forward and backward GRU networks, respectively. And the contextual representation is the concatenation of the forward and backward hidden state at the -th step. Thus, we get the query semantic representations .

Although the BiGRU networks have encoded temporal context of word sequences, they still ignore the syntactic information of natural language, which implies underlying dependencies between word pairs. So we then advise the syntactic graph convolution networks to leverages the syntactic dependencies for better query understanding. We first build the syntactic dependency graph by an NLP toolkit, where each word is regarded as a node and each dependency relation is presented as a directed edge. Formally, we denote a query by a graph , where the node set contains all words and edge set contains all directed syntactic dependencies of word pairs. Note that we add the self-loop for each node into the edge set. As the dependency relations have different types, the directed edges also correspond to different labels, where the self-loop is given a unique label. The original GCN regard the syntactic dependency graph as an undirected graph, denoted by

(2)

where is the transformation matrix,

is the bisa vector and

is the rectified linear unit. The

represents the set of nodes with a dependency edge to node or from node (including self-loop). And the is the original representation of node from the preceding modeling.

Figure 3. Three Directions of Information Passing.

Although the original GCN enhances the word semantic representation by aggregating the clues from its neighbors, it fails to leverage the direction and label information of edges. Thus, we consider a syntactic GCN to exploit the directional and labeled dependency edges between nodes, given by

(3)

where indicates the direction of edge : 1) a dependency edge from node to ; 2) a dependency edge from node to ; 3) a self-loop if . Since there is no reason to assume the information transmits only along the syntactic dependency arcs (Marcheggiani and Titov, 2017), we also allow the information to transmit in the opposite direction of directed dependencies here. The Figure 3 describes the three types of information passing, which correspond to three transformation matrix , and , respectively. On the other hand, the represents the label of edge

to select a distinct bias vector for each type of dependencies. Next, we employ a residual connection 

(He et al., 2016) to keep the original representation of each node, given by

(4)

Furthermore, we stack a multi-layer syntactic GCN to adequately explore the syntactic structure as follows.

(5)

By the syntactic GCN with layers, we obtain the syntactic-aware query representation .

3.3. Multi-Head Self-Attention Module

In this section, we present the multi-head self-attention module to capture long-range semantic dependencies from video context. By the self-attention method, each frame is able to interact not only with adjacent frames but also with distant ones. And the multi-head setting is beneficial to sufficiently understand the complicated dependencies.

We first extract frame features from the untrimmed video by a pre-trained 3D-ConvNet (Tran et al., 2015), denoted by . The is the visual feature of the -th frame. We then introduce the multi-head self-attention based on the scaled dot-product attention, which originally is proposed in the field of machine translation (Vaswani et al., 2017).

Scaled dot-product attention. We assume the input of the scale dot-product attention is a sequence of queries , keys and values , where , and represent the number of queries, keys and values, and , and are respective dimensions. The scaled dot-product attention is then calculated by

(6)

where the operation is performed on every row. The values are aggregated for each query according to the dot-product score between the query and the corresponding key of values.

Multi-head attention. The multi-head attention consists of paralleled scaled dot-product attention layers. For each independent attention layer, the input queries, keys and values are linearly projected to , and dimenstions. Concretely, the result of multi-head attention is given by

(7)

where , , , are linear projection matrices. The , , are the initial input dimensions and is the output dimension.

When the queries , keys and values are set to the same video feature matrix and the input dimension is equal to the output dimension, we get a multi-head self-attention method. Based on it, we obtain the self-attentive video representation , given by

(8)

where a residual connection is applies similar to syntactic GCN. Here we establish frame-to-frame correlation among video sequences and the multi-head setting allows the attention operation to aggregate information from different representation subspaces. But the temporal modeling is still critical for video semantic understanding, we cannot only depend on the multi-head self-attention and ignore contextual representation learning. Hence, we next employ another BiGRU to learn the self-attentive video semantic representations .

3.4. Multi-Stage Cross-Modal Interaction Module

In this section, we introduce the cross-modal interaction module to exploit the potential relations of video and query contents, which consists of the attentive aggregation, cross-gated interaction and low-rank bilinear fusion.

Attentive aggregation. Given the syntactic-aware query representations and self-attentive video semantic representations , we apply a typical attention mechanism to aggregate the query clues for each frame. Concretely, we first compute the attention score between each pair of frame and word, and obtain a video-to-query attention matrix . The attention score of the -th frame and -th word is given by

(9)

where , are parameter matrices, is the bias vector and the is the row vector. We then apply the softmax operation for each row of , given by

(10)

where represents the correlation of the -th frame and -th word. Next, we extract the crucial query clues for each frame based on , given by

(11)

where the represents the aggregated query representation relevant to the -th frame.

Cross-gated interaction. With the aggregated query representation and frame semantic representation , we then apply a cross gate (Feng et al., 2018) to emphasize crucial contents and weaken inessential parts. In the cross gate method, the gate of query representation depends on the frame representation, and meanwhile the frame representation is also gated by its corresponding query representation, denoted by

(12)

where , are parameter matrices, and are the bias vectors,

is the sigmoid function, and

represents element-wise multiplication. If the aggregated query representation is irrelevant to the frame semantic representation , both the two representations are filtered to decrease their influences on subsequent networks. On the contrary, the cross gate can further enhance the effects of relevant frame-query pairs.

Bilinear fusion. After the attentive aggregation and cross gate, we propose a low-rank bilinear fusion method (Kim et al., 2017) to further exploit the cross-modal interaction between and . The original bilinear fusion method is written by

(13)

where represents the -th dimension of the bilinear output at the time step , and the is the fusion result of and . the , are the parameter matrix and the bias vectors for the -th dimension. We can note that the original bilinear fusion method requires too many parameters and suffers from the heavy computation cost. Thus, we replace it with the low-rank version (Kim et al., 2017), given by

(14)

where the is the biliner fusion result at the time step .

Eventually, by the attentive aggregation, cross-gated interaction and low-rank bilinear fusion, we obtain the cross-modal semantic representations for each frame, denoted by .

3.5. Moment Retrieval Module

In this section, we present the moment retrieval module to simultaneously score a set of candidate moments with multi-scale windows at each time step, and further adopt a temporal boundaries regression mechanism to adjust the moment boundaries.

By the cross-modal interaction module, we get the the cross-modal semantic representations . To absorb the contextual evidences, we likewise develop another BiGRU networks to learn the final semantic representations . We then pre-define a set of candidate moments with multi-scale windows at each time step , denoted by , where are the start and end boundaries of the -th candidate moment at time , is the width of -th moment and is the number of moments. Note that we set the fixed window width for -th candidate moment at every time step. Thus, we can simultaneously produce the confidence scores for these moments at time

by a fully connected layer with sigmoid nonlinearity, given by

(15)

where the represents the confidence scores of moments at time and corresponds to the -th moment. Likewise, we produce the predicted offsets for these moments by

(16)

where the represents the predicted offsets of moments at time and corresponds to the -th moment.

Alignment loss. We first adopt an alignment loss to make the moment aligned to the target moment have high confidence scores and the misaligned moment have low confidence scores. Formally, we first compute the IoU (i.e. Intersection over Union) score of each candidate moment with the target moment . If the IoU score of a candidate is less than a clearing threshold , we reset it to 0. Next, we calculate the alignment loss by

(17)

where we consider all candidate moments during alignment training, and apply the concrete IoU score rather than set 0 or 1 according to a threshold value for every candidate. This setting is helpful for distinguishing high-score candidates.

Regression loss. As these multi-scale temporal windows have fixed widths, our candidate moments are restricted to discrete boundaries. To go beyond this limitation, we apply a boundary regression mechanism to adjust the temporal boundaries of high-score moments. Concretely, we fine-tune the localization offsets of high-score moments by a regression loss. First, we define a set of high-score moments which scores are larger than a high-score threshold , then compute the start and end offset values for those high-score moments as follows:

(18)

where are the boundaries of the target moment, and are the boundaries of a high-score moment in . Thus, the denote its ground truth offsets and the predicted offsets are given by preceding fully connected layer. Next, we design the regression loss as follows:

(19)

where is the set of high-score moments, is the size of and represents the smooth L1 function.

With the alignment loss and regression loss, we eventually propose a multi-task loss to train the cross-modal interaction networks in an end-to-end manner, denoted by

(20)

where is a hyper-parameter to control the balance of two losses.

During inference, we simply choose the candidate moment with the highest confidence score. If we need to select multiple moments (i.e. Top K), we first rank all candidates according to their confidence scores and adopt a non-maximum suppression (NMS) to select moments in order.

4. Experiments

4.1. Datasets

We first introduce two public datasets for query-based moment retrieval.

ActivityCaption (Krishna et al., 2017): The ActivityCaption dataset is originally developed for the task of dense video caption, which contains 20,000 untrimmed videos and each video includes multiple natural language descriptions with temporal annotations. The video contents of this dataset are diverse and open. For query-based moment retrieval, each description is regarded as a query and corresponds to a target moment. Since the caption annotations of test data of ActivityCaption are not publically available, we take the val_1 as the validation set and val_2 as test data. The details of the ActivityCaption dataset are summarized in Table 1.

TACoS (Regneri et al., 2013): The TACoS dataset is developed onMPII Compositive (Rohrbach et al., 2012) and only contains 127 videos. But each video of TACoS has a large amount of temporal textual annotations. The contents of TACoS are limited to cooking scenes, thus lack the diversity. Moreover, the videos of TACoS are longer but the target moments are shorter than ActivityCaption, which make the query-based moment retrieval harder. The details of this dataset are also summarized in Table 1.

ActivityCaption
Number Video Time Target Time Query Len
Train 37,421 117.30 35.45 13.48
Valid 17,505 118.23 37.73 13.58
Test 17,031 118.21 40.25 12.02
All 71,957 117.74 37.14 13.16
TACoS
Number Video Time Target Time Query Len
Train 10,146 224.16 5.70 8.69
Valid 4,589 387.46 6.23 9.12
Test 4,083 367.70 6.96 9.00
All 18,818 296.21 6.10 8.86
Table 1. Summaries of ActivityCaption and TACoS Datasets, including number of samples, average video duration, average target moment duration and average query length.
Method R@1 R@1 R@1 R@5 R@5 R@5
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7
MCN 39.35 21.36 6.43 68.12 53.23 29.70
VSA-RNN 39.28 23.43 9.01 70.84 55.52 32.12
VSA-STV 41.71 24.01 8.92 71.05 56.62 34.52
CTRL 47.43 29.01 10.34 75.32 59.17 37.54
ACRN 49.70 31.67 11.25 76.50 60.34 38.57
QSPN 52.13 33.26 13.43 77.72 62.39 40.78
CMIN 63.61 43.40 23.88 80.54 67.95 50.73
Table 2. Performance Evaluation Results on the ActivityCaption Dataset ( and ).

4.2. Implementation Details

In this section, we introduce some implementation details of our CMIN method, including the data preprocessing and model setting.

Data Preprocessing. We first resize every frame of videos to 112 112 and extract the visual features by a pre-trained 3D-ConvNet (Tran et al., 2015). Specifically, we define continuous 16 frames as a unit and each unit overlaps 8 frames with adjacent units. We then input the units to the pre-trained 3D-ConvNet and obtain 4,096 dimension features for each unit. We next reduce the dimensionality of features from 4,096 to 500 using PCA, which is helpful for decreasing model parameters. These 500-d features are used as the frame features of our CMIN. Since some videos are overlong, we uniformly downsample their feature sequences to 200.

For natural language queries, we first extract the syntactic dependency graph using the library of NLTK (Bird and Loper, 2004) and employ the pre-trained Glove word2vec (Pennington et al., 2014) to extract the embedding features for each word token. The dimension of word features is 300.

Model Setting. In our CMIN, we sample candidate moments with multi-scale windows at each time step. Concretely, we set window widths of for the ActivityCaption dataset and window widths of for TACoS. Thus, we have 1,400 samples for each video on ActivityCaption and 800 samples on TACoS. Note that we cut off candidate examples that are beyond the boundaries of videos. We then set the clearing threshold to 0.3, the high-score threshold to 0.7, and the balance hyper-parameter to 0.001. Moreover, the dimension of the hidden state of BiGRU networks is set to 512 (256 for one direction). The dimensions of the linear matrice in the multi-head self-attention and bilinear fusion are also set to 512. During training, we adopt an adam optimizer (Duchi et al., 2011) to minimize the multi-task loss and the learning rate is set to 0.001. We employ a mini-batch method and the batch size is 128.

4.3. Evaluation Criteria

To measure the retrieval performance of our CMIN and baselines, we adopt the “R@n, IoU=m” as evaluation criteria, which are proposed in (Gao et al., 2017). Concretely, we first calculate the IoU (i.e. Intersection over Union) between the selected moment and ground-truth moment, and the “R@n, IoU=m” means the percentage of at least one of top-n selected moments having IoU larger than m. The metric is on the query level, so the overall performance is the average among all the queries, denoted by , where the represents whether one of the top-n selected moments of the query has , and is the total number of testing queries.

Method R@1 R@1 R@1 R@5 R@5 R@5
IoU=0.1 IoU=0.3 IoU=0.5 IoU=0.1 IoU=0.3 IoU=0.5
MCN 3.11 1.64 1.25 3.11 2.03 1.25
VSA-RNN 8.84 10.77 4.78 19.05 13.90 9.10
VSA-STV 15.01 10.77 7.56 32.82 23.92 15.50
CTRL 24.32 18.32 13.30 48.73 36.69 25.42
ACRN 24.22 19.52 14.62 47.42 34.97 24.88
QSPN 25.31 20.15 15.23 53.21 36.72 25.30
CMIN 32.48 24.64 18.05 62.13 38.46 27.02
Table 3. Performance Evaluation Results on the TACoS Dataset ( and ).

4.4. Performance Comparisons

We compare our proposed CMIN method with some existing state-of-the-art methods to verify the effectiveness.

  • MCN (Hendricks et al., 2017): The MCN method adopts a moment context network to integrate local and global moment features for query-based moment retrieval.

  • VSA-RNN and VSA-STV (Gao et al., 2017): The two methods are the extensions of the DVSA model (Karpathy and Fei-Fei, 2015). They both simply transforms the visual features of candidate moments and query features into a common space, and then estimate the correlation scores to select the most relevant one. The VSA-RNN applies a LSTM network to encode queries and VSA-STV adopts the off-the-shelf skip-thought (Kiros et al., 2015) feature extractor.

  • CTRL (Gao et al., 2017): The CTRL method proposes a cross-modal temporal regression localizer to estimate the alignment scores of candidate moments and textual queries by leveraging contextual contents of these moments, and then adjust the start and end boundaries of high-score moments.

  • AMRN (Liu et al., 2018a): The AMRN method emphasizes the visual moment features by attentive contextual contents and develops a cross-modal feature representation.

  • QSPN (Xu et al., 2019): The QSPN method introduces a multi-level model to integrate visual and textual features earlier with an attention mechanism, learn spatio-temporal visual representations and further re-generate queries as the auxiliary task.

The former three approaches only focus on the visual features within each moment and ignore the contextual information. And the latter three approaches incorporate the contextual evidence to improve the retrieval performance, where the AMRN utilizes the attention mechanism to filter irrelevant context and QSPN further develops an early interaction strategy for cross-modal features.

Table 2 and Table 3 show the overall performance evaluation results of our CMIN and all baselines on ActivityCaption and TACoS datasets, respectively. We choose the evaluation criteria “R@n, IoU=m” with , for ActivityCaption and , for TACoS. Note that we report the baseline performance based on either their original paper or our implementation by selecting the higher one. The experimental results reveal a number of interesting points:

  • While modeling moment features, the MCN applies a mean-pooling operation to aggregate all features of video sequences as context of the current moment, which may introduce noises into the moment representations and degrade the retrieval accuracy. Thus, the MCN achieves the worst performance on all criteria.

  • The context based methods CTRL, ACRN, QSPN and CMIN outperform the simple SVA-STV and VSA-RNN, which suggest the context modeling is crucial for high-quality moment retrieval. And the performance of the VSA-STA is slightly better than VSA-RNN, demonstrating the skip-thought feature extractor is helpful for query understanding.

  • The QSPN adopts an attention mechanism to fuse the visual and textual features, which develop the early interactions between moments and queries. The fact that the QSPN achieves better performance than CTRL and ACRN verifies the cross-modal interaction is critical for query-based moment retrieval.

  • On all the criteria of two datasets, the CMIN not only outperforms all previous state-of-the-art baselines, but also achieves tremendous improvements, especially on ActivityCaption. These results verify the effectiveness of our syntactic GCN, multi-head self-attention and multi-stage cross-modal interaction.

Moreover, we can find that the overall experimental results on TACoS are lower than ActivityCaption, and meanwhile, the CMIN can only achieve a smaller improvement on TACoS. That is because the videos are longer and the target moments are shorter on TACoS as shown in Table 1, which increase the number of potential candidate moments and make this task harder. Moreover, the invariant cooking scenes and shorter query description may also improve the retrieval difficulty. Since invariant scenes require better discrimination ability and the short query is hard to describe a moment clearly.

4.5. Ablation Study

To prove the contribution of each component of our CMIN method, we next conduct some ablation studies on the syntactic GCN, multi-head self-attention and multi-stage cross-modal interaction. Concretely, we discard one component at a time to generate an ablation model as follows.

  • CMIN(w/o. GCN): We first remove the syntactic GCN layer from the query representation learning and take the query representations from BiGRU networks as the input of multi-stage cross-modal interaction module.

  • CMIN(w/o. SA): We then discard the multi-head self-attention from the video representation learning to validate the importance of long-range semantic dependency modeling.

  • CMIN(w/o. CG): We next remove the cross gate in the multi-stage cross-modal interaction module, directly applying the bilinear fusion for frame representations and aggregation query representations.

  • CMIN(w/o. BF): We finally replace the low-rank bilinear fusion method with a simple concatenation for query and video features.

The ablation results on ActivityCaption and TACoS datasets are shown in Table 4 and Table 5, respectively. By analyzing the ablation results, we can find several interesting points:

  • The CMIN(full) outperforms all ablation models on both ActivityCaption and TACoS datasets, which demonstrates the syntactic GCN, multi-head self-attention, cross gate and low-rank bilinear fusion are all helpful for query-based moment retrieval.

  • The CMIN(w/o. GCN) achieves the worst performance on AcitivityCaption, and also have poor results on TACoS, indicating the utilization of syntactic structure is critical for query semantic understanding and subsequent modeling.

  • All ablation models still yield better results than all baselines. This fact demonstrates that our comprehensive retrieval framework is suitable for this task and the excellent performance does not only depend on a key component.

Method R@1 R@1 R@1 R@5 R@5 R@5
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7
w/o. GCN 60.12 40.84 21.79 78.23 65.67 45.43
w/o. SA 61.22 41.56 22.36 79.43 66.91 48.12
w/o. CG 60.57 41.21 22.01 78.62 65.99 46.89
w/o. BF 61.32 41.89 22.12 79.27 66.21 47.92
full 63.61 43.40 23.88 80.54 67.95 50.73
Table 4. Performance Evaluation Results of Ablation Model on the ActivityCaption dataset.
Method R@1 R@1 R@1 R@5 R@5 R@5
IoU=0.1 IoU=0.3 IoU=0.5 IoU=0.1 IoU=0.3 IoU=0.5
w/o. GCN 30.54 23.22 10.03 57.69 37.12 26.16
w/o. SA 30.21 23.02 16.87 55.54 36.6 25.37
w/o. CG 31.96 23.59 17.47 61.87 38.11 26.79
w/o. BF 32.01 24.79 17.61 61.59 38.23 26.75
full 32.48 24.64 18.05 62.13 38.46 27.02
Table 5. Performance Evaluation Results of Ablation Model on the TACoS dataset.

Moreover, for the syntactic GCN module, the number of stacked layers is a crucial hyper-parameter. Therefore, We further explore the effect of this hyper-parameter by varying the number of layers from 1 to 5. Figure 4 and Figure 5 shows the impact of layer number on ActivityCaption and TACoS datasets. Here we select “R@1,IoU=0.3” and “R@1,IoU=0.5” as evaluation criteria. From the tables, we note that the CMIN achieves the best performance while the number of layers is set to 2, and stacking too many or too few layers will both affect the performance of query-based moment retrieval. Because only one syntactic GCN layer cannot sufficiently leverage the syntactic dependencies of natural language queries and too many syntactic GCN layers will result in over-smoothing, that is, each word representation converges to the same value.

4.6. Qualitative Analysis

To qualitatively validate the effectiveness of the CMIN method, we display several typical examples of query-based moment retrieval. Figure 6 and Figure 7 show the retrieval results of the CMIN method and the best baseline QSPN on ActivityCaption and TACoS datasets, respectively. We can find that natural language queries are very diverse and often contain successive temporal actions. By intuitive comparison, the CMIN can retrieve more accurate boundaries of target moments than QSPN. Moreover, the retrieval precision on TACoS is lower than ActivityCaption, which is consistent with previous qualitative evaluations.

Furthermore, as the fundamental component of our multi-stage cross-modal interaction module, the video-to-query attentive aggregation builds a bridge between video and query information. Thus, we demonstrate how the attentive aggregation mechanism works to further understand the interaction process. As shown in Figure 8, the video-to-query attention results are visualized using a thermodynamic diagram, where the darker color means the higher correlation of the pair of frame and word representations. We note that each frame can attend the semantically related words and ignore these irrelevant words. For example, the word “line” has the highest attention score over the query for the fourth frame. This suggests the attentive aggregation strategy effectively establishes the relationship between visual and textual information, and is helpful for high-quality moment retrieval.

Figure 4. Effect of the Number of Stacked Syntactic GCN layers on the ActivityCaption Dataset.
Figure 5. Effect of the Number of Stacked Syntactic GCN layers on the TACoS Dataset.
Figure 6. Examples on the ActivityCaption dataset.
Figure 7. Examples on the TACoS dataset.
Figure 8. The Video-to-Query Attention Results in the Multi-Stage Cross-Modal Interaction Module

5. Conclusion

In this paper, we propose a novel cross-modal interaction network for query-based moment retrieval, which considers three critical factors of this task, including the syntactic structure of natural language queries, long-range semantic dependencies in video context and the fine-grained cross-modal interaction. Specifically, we advise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, then propose a multi-head self-attention to capture long-range semantic dependencies from video context, and employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments on ActivityCaption and TACoS datasets demonstrate the effectiveness of our proposed method.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China under Grant No.61602405, No.U1611461, No.61751209 and No.61836002, Sponsored by Joint Research Program of ZJU and Hikvision Research Institute.

References

  • (1)
  • Alayrac et al. (2016) Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 4575–4583.
  • Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Bird and Loper (2004) Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. Association for Computational Linguistics, 31.
  • Bojanowski et al. (2015) Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In Proceedings of the IEEE International Conference on Computer Vision. 4462–4470.
  • Chao et al. (2018) Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the Faster R-CNN Architecture for Temporal Action Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.
  • Chen et al. (2018) Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally Grounding Natural Sentence in Video. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    . ACL, 162–171.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Advances in neural information processing systems.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    12, Jul (2011), 2121–2159.
  • Feng et al. (2018) Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision. 51–66.
  • Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5277–5285.
  • Girshick et al. (2014) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
  • Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.
  • Hendricks et al. (2018) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing Moments in Video with Temporal Language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1380–1390.
  • Hu et al. (2016) Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555–4564.
  • Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
  • Kim et al. (2017) Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for low-rank bilinear pooling. In Proceedings of the International Conference on Learning Representations.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems. 3294–3302.
  • Krishna et al. (2017) Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
  • Lin et al. (2014) Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2657–2664.
  • Liu et al. (2018a) Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15–24.
  • Liu et al. (2018b) Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal Moment Localization in Videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843–851.
  • Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11–20.
  • Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  • Otani et al. (2016) Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Naokazu Yokoya. 2016. Learning joint representations of videos and sentences with web image search. In Proceedings of the European Conference on Computer Vision. Springer, 651–667.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.
  • Regneri et al. (2013) Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013), 25–36.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
  • Rohrbach et al. (2012) Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. Springer, 144–157.
  • Sener et al. (2015) Ozan Sener, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2015. Unsupervised semantic parsing of video collections. In Proceedings of the IEEE International Conference on Computer Vision. 4480–4488.
  • Shou et al. (2016) Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.
  • Singh et al. (2016) Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1961–1970.
  • Socher et al. (2014) Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2, 1 (2014), 207–218.
  • Song et al. (2016) Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, and Henry A Kautz. 2016. Unsupervised Alignment of Actions in Video with Text Descriptions. In

    Proceedings of the International Joint Conference on Artificial Intelligence

    . 2025–2031.
  • Sun et al. (2015) Chen Sun, Chuang Gan, and Ram Nevatia. 2015. Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision. 2596–2604.
  • Tapaswi et al. (2015) Makarand Tapaswi, Martin Bauml, and Rainer Stiefelhagen. 2015. Book2movie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1827–1835.
  • Tellex and Roy (2009) Stefanie Tellex and Deb Roy. 2009. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 38.
  • Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  • Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.
  • Xu et al. (2019) Huijuan Xu, Kun He, L Sigal, S Sclaroff, and K Saenko. 2019. Multilevel Language and Vision Integration for Text-to-Clip Retrieval. In Proceedings of the American Association for Artificial Intelligence, Vol. 2. 7.
  • Xu et al. (2015) Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In Proceedings of the American Association for Artificial Intelligence, Vol. 5. 6.
  • Yao et al. (2018) Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684–699.
  • Zhao et al. (2017) Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision.
  • Zhao et al. (2018) Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhou Yu, Jun Yu, Deng Cai, Fei Wu, and Yueting Zhuang. 2018. Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks.. In Proceedings of the International Joint Conference on Artificial Intelligence. 3683–3689.