Temporal sentence localization in videos is an important yet challenging task in natural language processing, which has drawn increasing attention over the last few years due to its vast potential applications in information retrievalDong et al. (2019); Yang et al. (2020) and human-computer interaction Singha et al. (2018). It aims to ground the most relevant video segment according to a given sentence query. As shown in Figure 1 (a), most parts of video contents are irrelevant to the query (background) while only a short segment matches it (foreground). Therefore, video and query information need to be deeply incorporated to distinguish the fine-grained details of different video segments.
Most previous works Gao et al. (2017); Chen et al. (2018); Zhang et al. (2019); Yuan et al. (2019a); Zhang et al. (2020b); Liu et al. (2021, 2020a, 2020b) follow the top-down framework which pre-defines a large set of segment candidates (a.k.a proposals) in the video with sliding windows, and measures the similarity between the query and each candidate. The best segment is then selected according to the similarity. Although these methods achieve significant performance, they are sensitive to the proposal quality and present slow localization speed due to redundant proposals. Recently, several works Rodriguez et al. (2020); Zhang et al. (2020a); Yuan et al. (2019b) exploit the bottom-up framework which directly predicts the probabilities of each frame as the start or end boundaries of segment. These methods are proposal-free and much more efficient. However, they neglect the rich information between start and end boundaries without capturing the segment-level interaction. Thus, the performance of bottom-up models is behind the performance of top-down counterpart thus far.
To avoid the inherent drawbacks of proposal design in the top-down framework and maintain the localization performance, in this paper, we propose an adaptive proposal generation network (APGN) for an efficient and effective localization approach. Firstly, we perform boundary regression on the foreground frames to generate proposals, where foreground frames are obtained by a foreground-background classification on the entire video. In this way, the noisy responses on the background frames are attenuated, and the generated proposals are more adaptive and discriminative compared to the pre-defined ones. Secondly, we perform proposal ranking to select target segment in a top-down manner upon these generative proposals. As the number of proposals is much fewer than the pre-defined methods, the ranking stage is more efficient. Furthermore, we additionally consider the proposal-wise relations to distinguish their fine-grained semantic details before the proposal ranking stage.
To achieve the above framework, APGN first generates query-guided video representations after encoding video and query features and then predicts the foreground frames using a binary classification module. Subsequently, a regression module is utilized to generate a proposal on each foreground frame by regressing the distances from itself to start and end segment boundaries. After that, each generated proposal contains independent coarse semantic. To capture higher-level interactions among proposals, we encode proposal-wise features by incorporating both positional and semantic information, and represent these proposals as nodes to construct a proposal graph for reasoning correlations among them. Consequently, each updated proposal obtains more fine-grained details for following boundary refinement process.
Our contributions are summarized as follows:
We propose an adaptive proposal generation network (APGN) for TSLV task, which adaptively generates discriminative proposals without handcrafted design, thus making localization both effective and efficient.
To further refine the semantics of the generated proposals, we introduce a proposal graph to consolidate proposal-wise features by reasoning their higher-order relations.
We conduct experiments on three challenging datasets (ActivityNet Captions, TACoS, and Charades-STA), and results show that our proposed APGN significantly outperforms the existing state-of-the-art methods.
2 Related Work
Temporal sentence localization in videos is a new task introduced recently Gao et al. (2017); Anne Hendricks et al. (2017), which aims to localize the most relevant video segment from a video with sentence descriptions. Various algorithms Anne Hendricks et al. (2017); Gao et al. (2017); Chen et al. (2018); Zhang et al. (2019); Yuan et al. (2019a); Zhang et al. (2020b); Qu et al. (2020); Yang et al. (2021) have been proposed within the top-down framework, which samples candidate segments from a video first, then integrates the sentence representation with those video segments individually and evaluates their matching relationships. Some of them Anne Hendricks et al. (2017); Gao et al. (2017) propose to use the sliding windows as proposals and then perform a comparison between each proposal and the input query in a joint multi-modal embedding space. To improve the quality of the proposals, Zhang et al. (2019); Yuan et al. (2019a) pre-cut the video on each frame by multiple pre-defined temporal scale, and directly integrate sentence information with fine-grained video clip for scoring. Zhang et al. (2020b) further build a 2D temporal map to construct all possible segment candidates by treating each frame as the start or end boundary, and match their semantics with the query information. Although these methods achieve great performance, they are severely limited by the heavy computation on proposal matching/ranking, and sensitive to the quality of pre-defined proposals.
Recently, many methods Rodriguez et al. (2020); Chen et al. (2020); Yuan et al. (2019b); Mun et al. (2020); Zeng et al. (2020); Zhang et al. (2020a); Nan et al. (2021) propose to utilize the bottom-up framework to overcome above drawbacks. They do not rely on the segment proposals and directly select the starting and ending frames by leveraging cross-modal interactions between video and query. Specifically, they predict two probabilities at each frame, which indicate whether this frame is a start or end frame of the ground truth video segment. Although these methods perform segment localization more efficiently, they lose the segment-level interaction, and the redundant regression on background frames may provide disturbing noise for boundary decision, leading to worse localization performance than top-down methods.
In this paper, we propose to preserve the segment-level interaction while speeding up the localization efficiency. Specifically, we design a binary classification module on the entire video to filter out the background responses, which helps model focus more on the discriminative frames. At the same time, we replace the pre-defined proposals with the generated ones and utilize a proposal graph for refinement.
3 The Proposed Method
Given an untrimmed video and a sentence query , the TSLV task aims to localize the start and end timestamps of a specific video segment referring to the sentence query. We focus on addressing this task by adaptively generating proposals. To this end, we propose a binary classification module to filter out the redundant responses on background frames. Then, each foreground frame with its regressed start-end boundaries are taken as the generated segment proposal. In this way, the number of the generated proposals is much smaller than the number of pre-defined ones, making the model more efficient. Besides, a proposal graph is further developed to refine proposal features by learning their higher-level interactions. Finally, the confidence score and boundary offset are predicted for each proposal. Figure 2 illustrates the overall architecture of our APGN.
3.2 Feature Encoders
Video encoder. Given a video , we represent it as , where is the -th frame and is the length of the entire video. We first extract the features by a pre-trained network, and then employ a self-attention Vaswani et al. (2017) module to capture the long-range dependencies among video frames. We also utilize a Bi-GRU Chung et al. (2014) to learn the sequential characteristic. The final video features are denoted as , where is the feature dimension.
Query encoder. Given a query , where is the -th word and is the length of the query. Following previous works Zhang et al. (2019); Zeng et al. (2020), we first generate the word-level embeddings using Glove Pennington et al. (2014), and also employ a self-attention module and a Bi-GRU layer to further encode the query features as .
Video-Query interaction. After obtaining the encoded features , we utilize a co-attention mechanism Lu et al. (2019) to capture the cross-modal interactions between video and query features. Specifically, we first calculate the similarity scores between and as:
where projects the query features into the same latent space as the video. Then, we compute two attention weights as:
where and are the row- and column-wise softmax results of , respectively. We compose the final query-guided video representation by learning its sequential features as follows:
where , denotes the BiGRU layers, is the concatenate operation, and is the element-wise multiplication.
3.3 Proposal Generation
Given the query-guided video features , we aim to generate the proposal tuple based on each foreground frame , where denotes the distances from frame to the starting and ending segment boundaries, respectively. To this end, we first perform binary classification on the whole frames to distinguish the foreground and background frames, and then treat the foreground ones as positive samples and regress the segment boundaries on these frames as generated proposals.
In the TSLV task, most videos are more than two minutes long while the lengths of annotated target segments only range from several seconds to one minute (e.g. on ActivityNet Caption dataset). Therefore, there exists much noises from the background frames which may disturb the accurate segment localization. To alleviate it, we first classify the background frames and filter out their responses in latter regression. By distinguishing the foreground and background frames with annotations, we design a binary classification module with three full-connected (FC) layers to predict the classon each video frame. Considering the unbalanced foreground/background distribution, we formulate the balanced binary cross-entropy loss as:
where are the numbers of foreground and background frames. is the number of total video frames. Therefore, we can differentiate between frames from foreground and background during both training and testing.
Boundary regression. With the query-guided video representation and the predicted binary sequence of 0-1, we then design a boundary regression module to predict the distance from each foreground frame to the start (or end) frame of the video segment that corresponds to the query. We implement this module by three 1D convolution layers with two output channels. Given the predicted distance pair and ground-truth distance , we define the regression loss as:
where computes the Intersection over Union (IoU) score between the predicted segment and its ground-truth. After that, we can represent the generated proposal as tuples based on the regression results of the foreground frames.
3.4 Proposal Consolidation
So far, we have generated a certain number of proposals that are significantly less than the pre-defined ones in existing top-down framework, making the final scoring and ranking process much efficient. To further refine the proposal features for more accurate segment localization, we explicitly model higher-order interactions between the generated proposals to learn their relations. As shown in Figure 3, proposal 1 and proposal 2 contain same semantics of “blue" and “hops", we need to model their positional distance to distinguish them and refine their features for better understanding the phrase “second time". Also, for the proposals (proposal 2 and 3) which are local neighbors, we have to learn their semantic distance to refine their representations. Therefore, in our APGN, we first encode each proposal feature with both positional embedding and frame-wise semantic features, and then define a graph convolutional network (GCN) over the proposals for proposal refinement.
Proposal encoder. For each proposal tuple , we represent its segment boundary as . Before aggregating the features of its contained frames within this segment boundary, we first concatenate a position embedding to each frame-wise feature , in order to inject position information on frame as follows:
where denotes the position embedding of the -th position, and is the dimension of . We follow Vaswani et al. (2017) and use the sine and cosine functions of different frequencies to compose position embeddings:
are the even and odd indices of the position embedding. In this way, each dimension of the positional encoding corresponds to a sinusoid, allowing the model to easily learn to attend to absolute positions. Given the frame featuresand a proposal segment
, we encode the vector featureof -th proposal by aggregating the features of the contained frames in the segment as:
where each MLP has two FC layers,
denotes the max-pooling. The frames from each proposal are independently processed bybefore being pooled (channel-wise) to a single feature vector and passed to where information from different frames are further combined. Thus, we can represent the encoded proposal feature as .
Proposal graph. We construct a graph over the proposal features , where each node of the graph is a proposal associated with both positions and semantic features. We full connect all node pairs, and define relations between each proposal-pair for edge convolution Wang et al. (2018) as:
where and are learnable parameters. We update each proposal feature to as follow:
This GCN module consists of stacked graph convolutional layers. After the above proposal consolidation with graph, we are able to learn the refined proposal features.
3.5 Localization Head
After proposal consolidation, we feed the refined features into two separate heads to predict their confidence scores and boundary offsets for proposal ranking and refinement. Specifically, we employ two MLPs on each feature as:
where is the confidence score, and is the offsets. Therefore, the final predicted segment of proposal can be represented as . To learn the confidence scoring rule, we first compute the IoU score between each proposal segment with the ground-truth
, then we adopt the alignment loss function as below:
Given the ground-truth boundary offsets of proposal , we also fine-tune its offsets by a boundary loss as:
where denotes the smooth L1 loss function.
At last, our APGN model is trained end-to-end from scratch using the multi-task loss :
4.1 Datasets and Evaluation
ActivityNet Captions. It is a large dataset Krishna et al. (2017) which contains 20k videos with 100k language descriptions. This dataset pays attention to more complicated human activities in daily life. Following public split, we use 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing, respectively.
TACoS. This dataset Regneri et al. (2013) collects 127 long videos, which are mainly about cooking scenarios, thus lacking the diversity. We use the same split as Gao et al. (2017), which has 10146, 4589 and 4083 sentence-video pairs for training, validation, and testing, respectively.
Charades-STA. Gao et al. (2017) consists of 9,848 videos of daily life indoors activities. There are 12,408 sentence-video pairs for training and 3,720 pairs for testing.
4.2 Implementation Details
Following Zhang et al. (2020b); Zeng et al. (2020), for video input, we apply a pre-trained C3D network for all three datasets to obtain embedded features. We also extract the I3D Carreira and Zisserman (2017) and VGG Simonyan and Zisserman (2014) features on Charades-STA. After that, we apply PCA to reduce their feature dimension to 500 for decreasing the model parameters. We set the length of video to 200 for ActivityNet Caption and TACoS, 64 for Charades-STA. For sentence input, we utilize Glove model to embed each word to 300 dimension features. The dimension is set to 512, is set to 256. The number of graph layer is
. We set the batchsize as 64. We train our model with an Adam optimizer for 100 epochs. The initial learning rate is set to 0.0001 and it is divided by 10 when the loss arrives on plateaus.in the loss function are 0.1, 1, 1, 1 and decided by the weight magnitude.
4.3 Performance Comparison
Compared methods. We compare our proposed APGN with state-of-the-art methods. We group them into: (1) top-down methods: TGN Chen et al. (2018), CTRL Gao et al. (2017), QSPN Xu et al. (2019), CBP Wang et al. (2020), SCDM Yuan et al. (2019a), CMIN Zhang et al. (2019), and 2DTAN Zhang et al. (2020b). (2) bottom-up methods: GDP Chen et al. (2020), LGI Mun et al. (2020), VSLNet Zhang et al. (2020a), DRN Zeng et al. (2020).
Quantitative comparison. As shown in Table 1, 2 and 3, our APGN outperforms all the existing methods by a large margin. Specifically, on ActivityNet Caption dataset, compared to the previous best top-down method 2DTAN, we do not rely on large numbers of pre-defined and outperform it by 4.41%, 2.10%, 1.74%, 1.23% in all metrics, respectively. Compared to the previous best bottom-up method DRN, our APGN brings significant improvement of 4.28% and 12.89% in the strict “R@1, IoU=0.7” and “R@5, IoU=0.7” metrics, respectively. Although TACoS suffers from similar kitchen background and cooking objects among the videos, it is worth noting that our APGN still achieves significant improvements. On Charades-STA dataset, for fair comparisons with other methods, we perform experiments with same features (i.e., VGG, C3D, and I3D) reported in their papers. It shows that our APGN reaches the highest results over all evaluation metrics.
Comparison on efficiency. We compare the efficiency of our APGN with previous methods on a single Nvidia TITAN XP GPU on the TACoS dataset. As shown in Table 4, it can be observed that we achieve much faster processing speeds and relatively less learnable parameters. The reason mainly owes to two folds: First, APGN generates proposals without processing overlapped sliding windows as CTRL, and generates less proposals than pre-defined methods such as 2DTAN and CMIN, thus is more efficient; Second, APGN does not apply many convolution layers like 2DTAN or multi-level feature fusion modules as DRN for cross-modal interaction, thus has less parameters.
4.4 Ablation Study
Main ablation. As shown in Table 5, we verify the contribution of each part in our model. Starting from the backbone model (Figure 2 (a)), we first implement the baseline model ① by directly adding the top-down localization head ((Figure 2 (d))). In this model, we adopt pre-defined proposals as Zhang et al. (2019). After adding the binary classification module in ②, we can find that classification module effectively filters out redundant pre-defined proposals on large number of background frames. When further applying adaptive proposal generation as ③, the generated proposals perform better than the pre-defined one ②. Note that, in ③, we directly encode proposal-wise features by max-pooling, and the classification module also makes the contribution for filtering out the negative generated proposals. To capture more fine-grained semantics for proposal refinement, we introduce a proposal encoder (model ④) for discriminative feature aggregation and a proposal graph (model ⑤) for proposal-wise feature interaction. Although each of them can only bring about 1-3% improvement, the performance increases significantly when utilizing both of them (model ⑥).
Investigation on the video/query encoder. To investigate whether a Transformer Vaswani et al. (2017) can boost our APGN, we replace the GRU in video/query encoder with a simple Transformer and find some improvements. However, it brings larger model parameters and lower speed.
Effect of unbalanced loss. In the binary classification module, we formulate the typical loss function into a balanced one. As shown in Table 7, the model w/ balanced loss has great improvement (2.04%, 1.51%) compared to the w/o variant, which demonstrates that it is important to consider the unbalanced distribution in the classification process.
Investigation on proposal encoder. In proposal encoder, we discard the positional embedding as w/o position, and also replace the max-pooling with the mean-pooling as w/ mean pooling. From the Table 8, we can observe that positional embedding helps to learn the temporal distance (boost 2.46%, 1.95%), and the max-pooling can aggregate more discriminative features (boost 1.49%, 0.78%) than the mean-pooling.
Investigation on proposal graph. In the table 9, we also give the analysis on the proposal graph. Compared to w/ edge convolution model Wang et al. (2018), w/ edge attention directly utilizes co-attention Lu et al. (2016) to compute the similarity of each node-pair and updates them by a weighted summation strategy, which performs worse than the former one.
|binary classification||w/o balanced loss||46.88||27.13|
|w/ balanced loss||48.92||28.64|
|proposal encoder||w/o position||46.46||26.69|
|w/ mean pooling||47.41||27.86|
|w/ max pooling||48.92||28.64|
|proposal graph||w/ edge attention||46.63||26.90|
|w/ edge convolution||48.92||28.64|
|graph layer||1 layer||47.60||27.57|
Number of graph layer. As shown in Table 9, the model achieves the best result with 2 graph layers, and the performance will drop when the number of layers grows up. We give the analysis is that more graph layers will result in over-smoothing problem Li et al. (2018) since the propagation between the nodes will be accumulated.
Plug-and-play. Our proposed adaptive proposal generation can serve as a plug-and-play for existing methods. As shown in Table 10, for top-down methods, we maintain their feature encoders and video-query interaction, and add the proposal generation and proposal consolidation before the localization heads. For bottom-up methods, we first replace their regression heads with our proposal generation process and then add the proposal consolidation process. It shows that our proposal generation and proposal consolidation can bring large improvement on both two types of methods.
4.5 Qualitative Results
To qualitatively validate the effectiveness of our APGN, we display two typical examples in Figure 4. It is challenging to accurately localize the semantic “for a second time" in the first video, because there are two separate segments corresponding to the same object “girl in the blue dress" performing the same activity “hops". For comparison, previous method DRN fails to understand the meaning of phrase “second time", and ground both two segment parts. By contrast, our method has a strong ability to distinguish these two segments in temporal dimension thanks to the positional embedding in the developed proposal graph, thus achieves more accurate localization results. Furthermore, we also display the foreground/background class of each frame in this video. With the help of the proposal consolidation module, the segment proposals of “first time" are filtered out, and all the final ranked top 10 positive frames fall in the target segment.
In this paper, we introduce APGN, a new method for temporal sentence localization in videos. Our core idea is to adaptively generates discriminative proposals and achieve both effective and efficient localization. That is, we first introduce binary classification before the boundary regression to distinguish the background frames, which helps to filter out the corresponding noisy responses. Then, the regressed boundaries on the predicted foreground frames are taken as segment proposals, which decreases a large number of poor quality proposals compared to the pre-defined ones in top-down framework. We further learn higher-level feature interactions between the generated proposals for refinement via a graph convolutional network. Our framework achieves state-of-the-art performance on three challenging benchmarks, demonstrating the effectiveness of our proposed APGN.
This work was supported in part by the National Key Research and Development Program of China under No. 2018YFB1404102, and the National Natural Science Foundation of China under No. 61972448.
Localizing moments in video with natural language.
Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
Quo vadis, action recognition? a new model and the kinetics dataset.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
- Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2, §4.3.
Rethinking the bottom-up framework for query-based video localization.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2, §4.3.
Empirical evaluation of gated recurrent neural networks on sequence modeling. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.2.
- Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- Tall: temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §4.1, §4.1, §4.3.
- Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1.
Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §4.4.
- Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11235–11244. Cited by: §1.
- Reasoning step-by-step: temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 1841–1851. Cited by: §1.
- Jointly cross-and self-modal graph attention network for query-based moment localization. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Cited by: §1.
- DEBUG: a dense bottom-up grounding approach for natural language video localization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §3.2.
- Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS), Cited by: §4.4.
- Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.3.
- Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.2.
- Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4280–4288. Cited by: §2.
- Grounding action descriptions in videos. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Cited by: §4.1.
- Proposal-free temporal moment localization of a natural-language query in video using guided attention. In The IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
- Dynamic hand gesture recognition using vision-based approach for human–computer interaction. Neural Computing and Applications. Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.2, §3.4, §4.4.
- Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §4.3.
- Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics. Cited by: §3.4, §4.4.
- Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §4.3.
- Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 1339–1348. Cited by: §1.
- Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Cited by: §2.
- Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §2, §4.3.
- To find where you talk: temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. Cited by: §1, §2.
- Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.2, §4.1, §4.2, §4.3.
- Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §2, §4.3.
- Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2, §4.2, §4.3.
- Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Cited by: §1, §2, §3.2, §4.1, §4.3, §4.4.