Temporal sentence grounding (TSG) is an important yet challenging task in video understanding, which has drawn increasing attention over the last few years due to its vast potential applications in video summarization song2015tvsum; chu2015video, video captioning jiang2018recurrent; chen2020learning, and temporal action localization shou2016temporal; zhao2017temporal, etc. As shown in Figure 1, this task aims to ground the most relevant video segment according to a given sentence query. It is substantially more challenging as it needs to not only model the complex multi-modal interactions among video and query features, but also capture complicated context information for their semantics alignment.
Most existing works anne2017localizing; ge2019mac; liu2018attentive; zhang2019man; chen2018temporally; zhang2019cross; liu2018cross; yuan2019semantic; xu2019multilevel exploit a proposal-ranking framework that first generates multiple candidate proposals and then ranks them according to their similarities with the sentence query. These methods severely rely on the quality of proposals. Instead of using complex proposals, some recent works rodriguez2020proposal; yuan2019find; chenrethinking; wang2019language; nan2021interventional; mun2020local; zeng2020dense utilize a proposal-free framework that directly regresses the temporal locations of the target segment. Compared to the proposal-ranking counterparts, these works are much efficient.
Although the above two types of methods have achieved impressive results, we still can observe their performance bottlenecks on the rarely appeared video-query samples, as shown in Figure 1. Here, we select certain pairs of video and sentence as rare samples, which have at least one word (nouns, verbs, or adjectives) whose appearing frequency is less than 10. We can observe that all the existing models can achieve well performance on the common cases, but their performances all drop heavily when evaluated on rare cases. This observation conforms to that deep networks tend to forget the rare cases while learning on a dataset distributed off-balance and diverse toneva2018empirical, especially in practical scenarios where the data distribution could be extremely imbalanced. To tackle such a challenge, we aim to better match those video-query pair having rarely appeared word-guided semantic for improving the generalization. However, it is still hard to find a balance between the common and rare samples in the dynamic training process.
To this end, in this paper, we propose to learn and memorize the discriminative and representative cross-modal shared semantic covering all samples, which is implemented by a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net). Given a pair of video and query input, we first encode their contextual features individually and then align their semantic by a cross-modal graph convolutional network. After obtaining the aligned video-query feature pair, we design domain-specific persistent memories in both video and query domains to record cross-modal shared semantic representations which are the most representative. The learned memories are updated and maintained as a compact dictionary shared by all samples. During training, the memory slots in each domain are dynamically associated with both the common and rare samples across mini-batches during the whole training stage, alleviating the forgetting issue. In testing, the rare cases can thus be augmented by retrieving the stored semantic, leading to better generalization. Besides, we also develop a heterogeneous attention module to integrate the augmented multi-modal features in video and query domains by considering their contextual inter-modal interactions and video-based self-calibration.
Our main contributions are summarized as:
We propose a memory-augmented network MGSL-Net for temporal sentence grounding, by learning and memorizing the discriminative and representative cross-modal shared semantics covering all cases. The memory is dynamically associated with both the common and rare samples seen across mini-batches during the whole training, alleviating the forgetting issue on rare samples.
To obtain more domain-specific semantic contexts, we design the memory items in both video and query domains to be persistently read and updated. A heterogeneous attention module is further developed to integrate the enhanced multi-modal features in two domains.
The proposed MGSL-Net achieves state-of-the-art performance on three benchmarks (ActivityNet Caption, TACoS, and Charades-STA), boosting the performance by a large margin not only on the entire dataset but also on the rarely appeared pairwise samples, with limited consumption on computation and memory.
Temporal sentence grounding. Various algorithms anne2017localizing; ge2019mac; liu2018attentive; zhang2019man; chen2018temporally; qu2020fine; liu2021progressively; liu2018cross; liu2021adaptive; liu2022exploring; liu2020jointly; liu2020reasoning have been proposed within the scan-and-ranking framework, which first generates multiple segment proposals, and then ranks them according to the similarity between proposals and the query to select the best matching one. Some of them gao2017tall; anne2017localizing propose to apply the sliding windows to generate proposals and subsequently integrate the query with segment representations via a matrix operation. To improve the quality of the proposals, latest works wang2019temporally; zhang2019man; yuan2019semantic; zhang2019cross; cao2021pursuit directly integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Instead of generating complex proposals, recent works rodriguez2020proposal; yuan2019find; chenrethinking; wang2019language; nan2021interventional; mun2020local; zeng2020dense; liu2022unsupervised directly regress the temporal locations of the target segment. They do not rely on the segment proposals and directly select the starting and ending frames by leveraging cross-modal interactions between video and query. Specifically, they either regress the start/end timestamps based on the entire video representation yuan2019find; mun2020local, or predict at each frame to determine whether this frame is a start or end boundary rodriguez2020proposal; chenrethinking; zeng2020dense. Although the above methods achieve great performances, they tend to forget the rare cases easily while learning on a dataset distributed off balance and diversely. Different from them, we focus on storing and reading the cross-modal semantic memory to enhance the multi-modal feature representations.
Memory Networks. Memory-based approaches have been discussed for solving various problems. NTM graves2014neural is proposed to improve the generalization ability of the network by introducing an attention-based memory module. Memory networks like vaswani2017attention; sukhbaatar2015end have external memory where information can be further written and read. Xiong et al. xiong2016dynamic further improve the memory as dynamic memory networks. Different from these unimodal memory models, we propose a cross-modal shared memory which can alternatively interact with multiple data modalities. Although other works ma2018visual; huang2019acmm also extend memory networks to multi-modal settings, most of them are episodic memory networks that are wiped during each forward process. Different from them, our memory model persistently memorizes cross-modal semantic representations in multi-modal domains with aggregation during the whole training procedure, to better deal with the unbalanced data learning.
Given an untrimmed video and a sentence query , we represent the video as frame-by-frame, where is the -th frame and is the number of total frames. Similarly, the query with words is denoted as word-by-word. The TSG task aims to localize the start and end timestamps of a specific segment in video , which refers to the corresponding semantic of query .
In this section, we propose a Memory-Guided Semantic Learning Network (MGSL-Net) for TSG task. The overall pipeline of the proposed network, as shown in Figure 2, includes four main steps: we first encode both video and query features with contextual information, and align their features with a cross-modal graph convolutional network; then, we utilize the persistent memory items to learn and memorize the cross-modal shared semantic representations in both video and query domains; after getting the memory enhanced multi-modal features, we develop a heterogeneous attention module to consider inter- and intra-modality interactions for multi-modal feature integration; at last, the grounding heads are utilized to localize the segment.
Cross-modal Feature Alignment
Video encoder. For video encoding, we first extract the frame-wise features by a pre-trained C3D network tran2015learning, and then employ a self-attention vaswani2017attention module to capture the long-range dependencies among video frames. We further utilize a BiLSTM Mike1997 to learn the sequential characteristic. We denote the extracted video features as ,where is the feature dimension.
Query encoder. For query encoding, we first generate the word-level features by using the Glove embedding pennington2014glove, and also employ a self-attention module and a BiLSTM layer to further encode the query features as .
Semantic Alignment. Considering the obtained video and query representations are intrinsically heterogeneous, we propose a cross-modal graph convolutional network kipf2016semi to explicitly perform cross-modal alignment. Specifically, we first construct two adjacent matrices by measuring the cross-modal similarity between each frame-word pair with different directions as:
where value , are two modality-specific linear mappings to project one modality feature into the same latent space as the other one. are the normalized adjacent matrices. Therefore, we can get the aligned video representation and the aligned query representation by:
where are the weight matrices. has the same size as query feature , and they are semantically aligned. For the -th row in , it is an aggregated representation weighted by cross-modal similarities between the -th word and all the frames. Therefore, can be viewed as a visual representation of the -th word, sharing the same semantic meaning with the word . Similarly, is semantically aligned with , and can be viewed as a textual representation of the -th frame, sharing the same semantic with the frame .
Based on the aligned representation pairs (, ) and (, ), as shown in Figure 2, we propose a memory network to learn and memorize the cross-modal shared semantic features in both video domain and query domain, respectively.
Memory Representation. The domain-specific cross-modal shared semantic memories in video and query domains are designed as matrices , respectively. Here, are the hyper-parameters that defines the number of memory slots and is the feature dimension. Each memory item or can be updated by intra-domain features with similar semantic meanings, as well as read out to enhance previously obtained intra-domain features.
Given the aligned frame-word feature pair from in video domain, we aim to interact them with each memory item to read and store their shared cross-modal semantic features. Before the interacting process, we first utilize several linear layers to map into memory read key, write key, erase value, and write value, respectively. We denote such items of as , and items of as . We also map the aligned features into and . Details of how to utilize the generated items to update and read memory will be illustrated as follows.
Updating memory. Given the video domain aligned pair and query domain aligned pair , we determine to write and delete which memory items in and . Specifically, we first calculate the memory addressing weights according to the similarity between each input feature and corresponding domain-specific memory as:
where and are the memory write keys,
measures the cosine similarity. Then, we can selectively update memory items by adding new semantic features with write value while deleting old memory with erase value as:
where the erase value
is computed with a sigmoid function, anddenotes element-wise multiplication. In video domain, first updates its memory items with the extracted information from the frame and then from the word. In query domain, first updates its memory items with the extracted information from the word and then from the frame. In fact, the update order can be alternative and does not show a significant impact on the final performance.
Reading memory. During the memory reading, we need to read the most relevant items from domain-specific memory item and to enhance their representations, respectively. To this end, given the and , we first compute the cross-modal read weights and by comparing read keys with memory items like Eq. (3) and (4). Then we can read memory by regarding the obtained read keys of and as queries:
are the read vectors which can be regarded as memory-enhanced representations of video and query domains, respectively.
Heterogeneous Multi-Modal Integration
After obtaining memory enhanced representations from both video and query domains, we generate new video representation and new query representation by concatenation operation, where and . To further integrate these two representations, we develop a heterogeneous attention module to consider their inter- and intra-modality interactions. In particular, we additionally take the original video feature as global context to calibrate the learned memory contents in . As shown in Figure 3, the proposed heterogeneous attention mechanism first utilizes three linear layers to map the three representations in to the same latent space, and then exploits a self-attention unit to capture the semantic-aware intra-modality relations between the enhanced frame-frame pairs and word-word pairs. After that, the inter-modality relationship is captured by interacting the features of frame-word pair. To further calibrate the memory-wise contents, we take as the global signals to supervise the enhanced video feature . These three attentional units are combined in a modular way in defining the heterogeneous attention mechanism, and all the units are based on the dot product attention. Finally, the integrated feature is obtained by concatenating all those output features.
Taking the fine-grained feature as input, we process frame-wise feature
by the grounding module, which consists of three components: boundary regression head, confidence scoring head and IoU regression head. Since TSG task aims to localize a specific segment, the boundary regression head is designed to predict the temporal bounding box at each frame. To select the box that matches the query best, we propose the confidence scoring head to predict scores indicating whether the content in each bbox matches the query semantically. A IoU regression head is also utilized to predict score for directly estimating the IoU between each bbox and the ground truth segment.
Boundary regression head. We implement this head as two 1D convolution layers with two output channels, and we only assign regression targets for positive frames. If location falls inside the ground truth , the regression targets are , where . For the predicted and ground truth , we define as:
where is a smooth loss, the second item is a IoU loss. is the indicator function, being 1 if frame is positive and 0 otherwise. is the number of positive frames.
Confidence scoring head. This head is implemented as two 1D convolution layers with one output channel. For each frame t, if it falls in the ground truth, we think its generated bbox matches the query semantically and denote its label as . If not, we denote it as . We utilize a a binary cross entropy loss for confidence evaluation as:
IoU regression head. We train a three-layer 1D convolution to estimate the IoU between the generated bbox at each frame and the corresponding ground truth. Denoting the ground truth IoU as and predicted one as , we have:
Thus, the final loss is a multi-task loss combing the above three loss functions as:
where and are the hyper-paremeters to balance the training weights on different losses.
Datasets and Evaluation
ActivityNet Caption. ActivityNet Caption krishna2017dense contains 20000 untrimmed videos with 100000 descriptions from YouTube. The videos are 2 minutes on average, and the annotated video clips have much larger variation, ranging from several seconds to over 3 minutes. Following public split, we use 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing respectively.
TACoS. TACoS regneri2013grounding is widely used on TSG task and contain 127 videos. The videos from TACoS are collected from cooking scenarios, thus lacking the diversity. They are around 7 minutes on average. We use the same split as gao2017tall, which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing.
Charades-STA. Charades-STA is built on the Charades dataset sigurdsson2016hollywood
, which focuses on indoor activities. In total, there are 12408 and 3720 moment-query pairs in the training and testing sets respectively.
Evaluation Metric. Following previous works gao2017tall; zeng2020dense; zhang2019learning
, we adopt “R@n, IoU=m” as our evaluation metrics. It is defined as the percentage of at least one of top-n selected moments having IoU larger than m, which is the higher the better.
We utilize the pixels shape of every frame of videos as input, and apply C3D tran2015learning to encode the videos on ActivityNet Caption, TACoS, and I3D carreira2017quo on Charades-STA. We set the length of video feature sequences to 200 for ActivityNet Caption and TACoS datasets, 64 for Charades-STA dataset. As for sentence encoding, we set the length of word feature sequences to 20, and utilize Glove embedding pennington2014glove to embed each word to 300 dimension features. The hidden state dimension of BiLSTM networks is set to 512. The number of memory items are set to (1024,1024), (512,512), (512,512) for three datasets, respectively. We empirically find that further increasing the memory number results in a convergence of the performance. The balanced weights of are
. During the training, we use an Adam optimizer with the leaning rate of 0.0001. The model is trained for 50 epochs to guarantee its convergence with a batch size of 128. All the experiments are implemented on a single NVIDIA TITAN XP GPU.
Comparisons with the State-of-the-Arts
Compared Methods. We compare the proposed MGSL-Net with state-of-the-art TSG methods on three datasets. These methods are grouped into three categories by the viewpoints of proposal-based and proposal-free approach: 1) proposal-based methods: TGN chen2018temporally, CTRL gao2017tall, ACRN liu2018attentive, QSPN xu2019multilevel, CBP wang2019temporally, SCDM yuan2019semantic, CMIN zhang2019cross, 2DTAN zhang2019learning, and CBLN liu2021context. 2) proposal-free methods: GDP chenrethinking, LGI mun2020local, VSLNet zhang2020span, DRN zeng2020dense. 3) others: BPNet xiao2021boundary. Note that all the above methods directly utilize deep networks to learn cross-modal retrieval without considering the rarely appeared video-query samples.
Comparison on ActivityNet Caption. Table 1 summarizes the results on ActivityNet Caption. It shows that our MGSL-Net outperforms all the baselines in all metrics. Specifically, we observe that MGSL-Net works well in even stricter metrics, e.g., it achieved a significant 3.82% and 3.30% absolute improvement in R@1, and R@5, IoU=0.7 compared to the previous state-of-the-art method CBLN, which demonstrates the superiority of our model. It is mainly because our memory can store useful cross-modal shared semantic representations, and thus better associate those rarely appeared video and query in the standard test sets.
Comparison on TACoS. From Table 1, we can also find that our MGSL-Net achieves the best performance on TACoS dataset. Note that our model shows much larger improvements on the TACoS dataset than the ActivityNet Caption dataset. It mainly results from that the fewer training data with low diversity of TACoS dataset cannot guarantee the previous models can well capture the relations among small object. But our model can better exploit the auxiliary resources for better learning.
Comparison on Charades-STA. As shown in Table 1, our MGSL-Net achieves new state-of-the-art performance over all metrics on Charades-STA. Since there exists less rarely appeared samples in this dataset, it has less performance improvements than the other two datasets.
|Method||Run-Time||Model Size||R@1, IoU=0.5|
We evaluate the efficiency of our MGSL-Net, by fairly comparing its running time and parameter size with existing methods on a single Nvidia TITAN XP GPU on TACoS dataset. As shown in Table 2, it can be observed that we achieves much faster processing speeds and relatively less learnable parameters. This attributes to: 1) The proposal generation procedure and proposal matching procedure of proposal-based methods (ACRN, CTRL, TGN, 2DTAN) are quite time-consuming. 2) Regression-based method DRN utilizes much convolutional layers to achieve multi-level feature fusion for cross-modal interaction, which is also cost time. 3) Our MGSL is free from complex and time-consuming operations, showing superiority in both effectiveness and efficiency.
Analysis on the Rare Cases
Analyzing with the rarely appeared video-query samples is not easy since such pair-wise data is not easy to define. Therefore, we first analyze the data distribution of each dataset as shown in Figure 1, 4 and 5, and then we select certain pairs of video and sentence as rare samples, which have at least one word (nouns, verbs or adjectives) whose appearing frequency is less than 10. The other remained samples are treated as common samples. In these three figures, we show the performance of different methods on the rare cases and common cases. Our proposed method is more effective to handle the rare cases, which brings much more improvements than other methods. We also give the qualitative examples of the grounding results as shown in Figure 6, where our method can learn and memorize the semantics of the rare cases and can ground the segment more accurately.
|memory networks||attention module||IoU=0.5||IoU=0.7|
|Video domain||Query domain||R@1,||R@1,|
Main ablation. As shown in Table 3, we first study the influence of each main component in our proposed MGSL-Net. We set the MGSL-Net model without both domain-specific memory networks and heterogeneous multi-modal integration module as the baseline. The table shows that both memory network and heterogeneous attention make great contributions for the final performance, where the memory network brings the largest improvement of 5-6 absolute values.
Investigation on memory network. We investigate the performance comparison with the memory network with different settings as shown in Table 4, where “shared” means utilizing a shared memory slots to learn the semantics of two inputs in each domain. From the table, we can find several points: 1) The two aligned semantics in pairwise data or are all important for memory learning. 2) In each domain, the shared memory performs better than utilizing two separated memories for reading and updating the pairwise data. 3) The memory-based contexts in both video and query domains are all helpful for grounding.
To further investigate what the shared memory actually learns, we reduce the dimensionality of memory slots with PCA, and show their two-dimensional representations (grey nodes) in Figure 7. We can see that all the nodes distribute in a divergent shape, in which the top nodes are more compact while the bottom ones are more scattered. To figure out the semantic meanings of these memory slots, we take several representative nodes (with arrows) as queries to retrieve video-query pair. We find the rarely appeared content is indeed captured and represented as the scattered nodes while more commonly appeared content is captured and represented as the compact ones.
Investigation on heterogeneous attention. We also conduct the ablation studies on the heterogeneous attention module in Table 5, where we set the inter-attention branch as the baseline. The self-attention brings largest improvement, since it not only captures the intra-relations among the elements in each modality but also provides the enhanced video features in video domain. The calibration module also makes contribution to the final performance.
In this paper, we have proposed the Memory-Guided Semantic Learning (MGSL) to handle the rarely appeared pairwise samples in temporal sentence grounding task. The main contributions of this work are: 1) we propose a cross-modal graph convolutional network to align the semantic between video and query, 2) we develop two domain-specific persistent memory items to learn and memorize the cross-modal shared semantic representations, and 3) we devise a heterogeneous attention module to integrate the enhanced multi-modal features in both video and query domains. Experimental results shows the superiority of our method on both effectiveness and efficiency.
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant (No.61972448, No.62172068, No.61802048).