Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video understanding. Although the existing methods train well-designed deep networks with a large amount of data, we find that they can easily forget the rarely appeared cases in the training stage due to the off-balance data distribution, which influences the model generalization and leads to undesirable performance. To tackle this issue, we propose a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG tasks. Specifically, MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module. We first align the given video-query pair by a cross-modal graph convolutional network, and then utilize a memory module to record the cross-modal shared semantic features in the domain-specific persistent memory. During training, the memory slots are dynamically associated with both common and rare cases, alleviating the forgetting issue. In testing, the rare cases can thus be enhanced by retrieving the stored memories, resulting in better generalization. At last, the heterogeneous attention module is utilized to integrate the enhanced multi-modal features in both video and query domains. Experimental results on three benchmarks show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on rare cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

06/18/2020

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which a...
03/29/2021

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

Cross-modal video-text retrieval, a challenging task in the field of vis...
01/06/2020

Learning and Memorizing Representative Prototypes for 3D Point Cloud Semantic and Instance Segmentation

3D point cloud semantic and instance segmentation is crucial and fundame...
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
09/13/2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Video grounding aims to localize the temporal segment corresponding to a...
01/18/2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in c...
01/08/2022

Learning Sample Importance for Cross-Scenario Video Temporal Grounding

The task of temporal grounding aims to locate video moment in an untrimm...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Temporal sentence grounding (TSG) is an important yet challenging task in video understanding, which has drawn increasing attention over the last few years due to its vast potential applications in video summarization song2015tvsum; chu2015video, video captioning jiang2018recurrent; chen2020learning, and temporal action localization shou2016temporal; zhao2017temporal, etc. As shown in Figure 1, this task aims to ground the most relevant video segment according to a given sentence query. It is substantially more challenging as it needs to not only model the complex multi-modal interactions among video and query features, but also capture complicated context information for their semantics alignment.

Figure 1: (a) An illustrative example of the TSG task. (b) Data distribution on the ActivityNet Caption dataset, and the performance comparison on the corresponding rare cases.

Most existing works anne2017localizing; ge2019mac; liu2018attentive; zhang2019man; chen2018temporally; zhang2019cross; liu2018cross; yuan2019semantic; xu2019multilevel exploit a proposal-ranking framework that first generates multiple candidate proposals and then ranks them according to their similarities with the sentence query. These methods severely rely on the quality of proposals. Instead of using complex proposals, some recent works rodriguez2020proposal; yuan2019find; chenrethinking; wang2019language; nan2021interventional; mun2020local; zeng2020dense utilize a proposal-free framework that directly regresses the temporal locations of the target segment. Compared to the proposal-ranking counterparts, these works are much efficient.

Although the above two types of methods have achieved impressive results, we still can observe their performance bottlenecks on the rarely appeared video-query samples, as shown in Figure 1. Here, we select certain pairs of video and sentence as rare samples, which have at least one word (nouns, verbs, or adjectives) whose appearing frequency is less than 10. We can observe that all the existing models can achieve well performance on the common cases, but their performances all drop heavily when evaluated on rare cases. This observation conforms to that deep networks tend to forget the rare cases while learning on a dataset distributed off-balance and diverse toneva2018empirical, especially in practical scenarios where the data distribution could be extremely imbalanced. To tackle such a challenge, we aim to better match those video-query pair having rarely appeared word-guided semantic for improving the generalization. However, it is still hard to find a balance between the common and rare samples in the dynamic training process.

To this end, in this paper, we propose to learn and memorize the discriminative and representative cross-modal shared semantic covering all samples, which is implemented by a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net). Given a pair of video and query input, we first encode their contextual features individually and then align their semantic by a cross-modal graph convolutional network. After obtaining the aligned video-query feature pair, we design domain-specific persistent memories in both video and query domains to record cross-modal shared semantic representations which are the most representative. The learned memories are updated and maintained as a compact dictionary shared by all samples. During training, the memory slots in each domain are dynamically associated with both the common and rare samples across mini-batches during the whole training stage, alleviating the forgetting issue. In testing, the rare cases can thus be augmented by retrieving the stored semantic, leading to better generalization. Besides, we also develop a heterogeneous attention module to integrate the augmented multi-modal features in video and query domains by considering their contextual inter-modal interactions and video-based self-calibration.

Our main contributions are summarized as:

  • We propose a memory-augmented network MGSL-Net for temporal sentence grounding, by learning and memorizing the discriminative and representative cross-modal shared semantics covering all cases. The memory is dynamically associated with both the common and rare samples seen across mini-batches during the whole training, alleviating the forgetting issue on rare samples.

  • To obtain more domain-specific semantic contexts, we design the memory items in both video and query domains to be persistently read and updated. A heterogeneous attention module is further developed to integrate the enhanced multi-modal features in two domains.

  • The proposed MGSL-Net achieves state-of-the-art performance on three benchmarks (ActivityNet Caption, TACoS, and Charades-STA), boosting the performance by a large margin not only on the entire dataset but also on the rarely appeared pairwise samples, with limited consumption on computation and memory.

Related Work

Temporal sentence grounding. Various algorithms anne2017localizing; ge2019mac; liu2018attentive; zhang2019man; chen2018temporally; qu2020fine; liu2021progressively; liu2018cross; liu2021adaptive; liu2022exploring; liu2020jointly; liu2020reasoning have been proposed within the scan-and-ranking framework, which first generates multiple segment proposals, and then ranks them according to the similarity between proposals and the query to select the best matching one. Some of them gao2017tall; anne2017localizing propose to apply the sliding windows to generate proposals and subsequently integrate the query with segment representations via a matrix operation. To improve the quality of the proposals, latest works wang2019temporally; zhang2019man; yuan2019semantic; zhang2019cross; cao2021pursuit directly integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Instead of generating complex proposals, recent works rodriguez2020proposal; yuan2019find; chenrethinking; wang2019language; nan2021interventional; mun2020local; zeng2020dense; liu2022unsupervised directly regress the temporal locations of the target segment. They do not rely on the segment proposals and directly select the starting and ending frames by leveraging cross-modal interactions between video and query. Specifically, they either regress the start/end timestamps based on the entire video representation yuan2019find; mun2020local, or predict at each frame to determine whether this frame is a start or end boundary rodriguez2020proposal; chenrethinking; zeng2020dense. Although the above methods achieve great performances, they tend to forget the rare cases easily while learning on a dataset distributed off balance and diversely. Different from them, we focus on storing and reading the cross-modal semantic memory to enhance the multi-modal feature representations.

Memory Networks. Memory-based approaches have been discussed for solving various problems. NTM graves2014neural is proposed to improve the generalization ability of the network by introducing an attention-based memory module. Memory networks like vaswani2017attention; sukhbaatar2015end have external memory where information can be further written and read. Xiong et al. xiong2016dynamic further improve the memory as dynamic memory networks. Different from these unimodal memory models, we propose a cross-modal shared memory which can alternatively interact with multiple data modalities. Although other works ma2018visual; huang2019acmm also extend memory networks to multi-modal settings, most of them are episodic memory networks that are wiped during each forward process. Different from them, our memory model persistently memorizes cross-modal semantic representations in multi-modal domains with aggregation during the whole training procedure, to better deal with the unbalanced data learning.

Method

Figure 2: Overall pipeline of the proposed MGSL-Net architecture. Given a pair of video and query input, we first encode their features and exploit a cross-modal graph convolutional network (GCN) to align their semantic. Then, for the aligned features in each domain, we utilize a domain-specific persistent memory item to memorize and enhance the cross-modal shared semantic features. After that, we further develop a heterogeneous attention module to integrate multi-modal features in both domains. At last, we locate the target segment by using the regression based grounding heads.

Overview

Given an untrimmed video and a sentence query , we represent the video as frame-by-frame, where is the -th frame and is the number of total frames. Similarly, the query with words is denoted as word-by-word. The TSG task aims to localize the start and end timestamps of a specific segment in video , which refers to the corresponding semantic of query .

In this section, we propose a Memory-Guided Semantic Learning Network (MGSL-Net) for TSG task. The overall pipeline of the proposed network, as shown in Figure 2, includes four main steps: we first encode both video and query features with contextual information, and align their features with a cross-modal graph convolutional network; then, we utilize the persistent memory items to learn and memorize the cross-modal shared semantic representations in both video and query domains; after getting the memory enhanced multi-modal features, we develop a heterogeneous attention module to consider inter- and intra-modality interactions for multi-modal feature integration; at last, the grounding heads are utilized to localize the segment.

Cross-modal Feature Alignment

Video encoder. For video encoding, we first extract the frame-wise features by a pre-trained C3D network tran2015learning, and then employ a self-attention vaswani2017attention module to capture the long-range dependencies among video frames. We further utilize a BiLSTM Mike1997 to learn the sequential characteristic. We denote the extracted video features as ,where is the feature dimension.

Query encoder. For query encoding, we first generate the word-level features by using the Glove embedding pennington2014glove, and also employ a self-attention module and a BiLSTM layer to further encode the query features as .

Semantic Alignment. Considering the obtained video and query representations are intrinsically heterogeneous, we propose a cross-modal graph convolutional network kipf2016semi to explicitly perform cross-modal alignment. Specifically, we first construct two adjacent matrices by measuring the cross-modal similarity between each frame-word pair with different directions as:

(1)

where value , are two modality-specific linear mappings to project one modality feature into the same latent space as the other one. are the normalized adjacent matrices. Therefore, we can get the aligned video representation and the aligned query representation by:

(2)

where are the weight matrices. has the same size as query feature , and they are semantically aligned. For the -th row in , it is an aggregated representation weighted by cross-modal similarities between the -th word and all the frames. Therefore, can be viewed as a visual representation of the -th word, sharing the same semantic meaning with the word . Similarly, is semantically aligned with , and can be viewed as a textual representation of the -th frame, sharing the same semantic with the frame .

Memory Network

Based on the aligned representation pairs (, ) and (, ), as shown in Figure 2, we propose a memory network to learn and memorize the cross-modal shared semantic features in both video domain and query domain, respectively.

Memory Representation. The domain-specific cross-modal shared semantic memories in video and query domains are designed as matrices , respectively. Here, are the hyper-parameters that defines the number of memory slots and is the feature dimension. Each memory item or can be updated by intra-domain features with similar semantic meanings, as well as read out to enhance previously obtained intra-domain features.

Given the aligned frame-word feature pair from in video domain, we aim to interact them with each memory item to read and store their shared cross-modal semantic features. Before the interacting process, we first utilize several linear layers to map into memory read key, write key, erase value, and write value, respectively. We denote such items of as , and items of as . We also map the aligned features into and . Details of how to utilize the generated items to update and read memory will be illustrated as follows.

Updating memory. Given the video domain aligned pair and query domain aligned pair , we determine to write and delete which memory items in and . Specifically, we first calculate the memory addressing weights according to the similarity between each input feature and corresponding domain-specific memory as:

(3)
(4)

where and are the memory write keys,

measures the cosine similarity. Then, we can selectively update memory items by adding new semantic features with write value while deleting old memory with erase value as:

(5)
(6)
(7)
(8)

where the erase value

is computed with a sigmoid function, and

denotes element-wise multiplication. In video domain, first updates its memory items with the extracted information from the frame and then from the word. In query domain, first updates its memory items with the extracted information from the word and then from the frame. In fact, the update order can be alternative and does not show a significant impact on the final performance.

Reading memory. During the memory reading, we need to read the most relevant items from domain-specific memory item and to enhance their representations, respectively. To this end, given the and , we first compute the cross-modal read weights and by comparing read keys with memory items like Eq. (3) and (4). Then we can read memory by regarding the obtained read keys of and as queries:

(9)
(10)

where and

are the read vectors which can be regarded as memory-enhanced representations of video and query domains, respectively.

Figure 3: Illustration of Heterogeneous Attention module. The units of self-attention, inter-attention and calibration are implemented by dot product attention.

Heterogeneous Multi-Modal Integration

After obtaining memory enhanced representations from both video and query domains, we generate new video representation and new query representation by concatenation operation, where and . To further integrate these two representations, we develop a heterogeneous attention module to consider their inter- and intra-modality interactions. In particular, we additionally take the original video feature as global context to calibrate the learned memory contents in . As shown in Figure 3, the proposed heterogeneous attention mechanism first utilizes three linear layers to map the three representations in to the same latent space, and then exploits a self-attention unit to capture the semantic-aware intra-modality relations between the enhanced frame-frame pairs and word-word pairs. After that, the inter-modality relationship is captured by interacting the features of frame-word pair. To further calibrate the memory-wise contents, we take as the global signals to supervise the enhanced video feature . These three attentional units are combined in a modular way in defining the heterogeneous attention mechanism, and all the units are based on the dot product attention. Finally, the integrated feature is obtained by concatenating all those output features.

Grounding Heads

Taking the fine-grained feature as input, we process frame-wise feature

by the grounding module, which consists of three components: boundary regression head, confidence scoring head and IoU regression head. Since TSG task aims to localize a specific segment, the boundary regression head is designed to predict the temporal bounding box at each frame. To select the box that matches the query best, we propose the confidence scoring head to predict scores indicating whether the content in each bbox matches the query semantically. A IoU regression head is also utilized to predict score for directly estimating the IoU between each bbox and the ground truth segment.

Boundary regression head. We implement this head as two 1D convolution layers with two output channels, and we only assign regression targets for positive frames. If location falls inside the ground truth , the regression targets are , where . For the predicted and ground truth , we define as:

(11)

where is a smooth loss, the second item is a IoU loss. is the indicator function, being 1 if frame is positive and 0 otherwise. is the number of positive frames.

Confidence scoring head. This head is implemented as two 1D convolution layers with one output channel. For each frame t, if it falls in the ground truth, we think its generated bbox matches the query semantically and denote its label as . If not, we denote it as . We utilize a a binary cross entropy loss for confidence evaluation as:

(12)

IoU regression head. We train a three-layer 1D convolution to estimate the IoU between the generated bbox at each frame and the corresponding ground truth. Denoting the ground truth IoU as and predicted one as , we have:

(13)

Thus, the final loss is a multi-task loss combing the above three loss functions as:

(14)

where and are the hyper-paremeters to balance the training weights on different losses.

Experiments

Datasets and Evaluation

ActivityNet Caption. ActivityNet Caption krishna2017dense contains 20000 untrimmed videos with 100000 descriptions from YouTube. The videos are 2 minutes on average, and the annotated video clips have much larger variation, ranging from several seconds to over 3 minutes. Following public split, we use 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing respectively.

TACoS. TACoS regneri2013grounding is widely used on TSG task and contain 127 videos. The videos from TACoS are collected from cooking scenarios, thus lacking the diversity. They are around 7 minutes on average. We use the same split as gao2017tall, which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing.

Charades-STA. Charades-STA is built on the Charades dataset sigurdsson2016hollywood

, which focuses on indoor activities. In total, there are 12408 and 3720 moment-query pairs in the training and testing sets respectively.

Evaluation Metric. Following previous works gao2017tall; zeng2020dense; zhang2019learning

, we adopt “R@n, IoU=m” as our evaluation metrics. It is defined as the percentage of at least one of top-n selected moments having IoU larger than m, which is the higher the better.

Implementation Details

We utilize the pixels shape of every frame of videos as input, and apply C3D tran2015learning to encode the videos on ActivityNet Caption, TACoS, and I3D carreira2017quo on Charades-STA. We set the length of video feature sequences to 200 for ActivityNet Caption and TACoS datasets, 64 for Charades-STA dataset. As for sentence encoding, we set the length of word feature sequences to 20, and utilize Glove embedding pennington2014glove to embed each word to 300 dimension features. The hidden state dimension of BiLSTM networks is set to 512. The number of memory items are set to (1024,1024), (512,512), (512,512) for three datasets, respectively. We empirically find that further increasing the memory number results in a convergence of the performance. The balanced weights of are

. During the training, we use an Adam optimizer with the leaning rate of 0.0001. The model is trained for 50 epochs to guarantee its convergence with a batch size of 128. All the experiments are implemented on a single NVIDIA TITAN XP GPU.

Method ActivityNet Captions TACoS Charades-STA
R@1, R@1, R@5, R@5, R@1, R@1, R@5, R@5, R@1, R@1, R@5, R@5,
IoU=0.5 IoU=0.7 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.3 IoU=0.5 IoU=0.5 IoU=0.7 IoU=0.5 IoU=0.7
TGN 28.47 - 43.33 - 21.77 18.90 39.06 31.02 - - - -
CTRL 29.01 10.34 59.17 37.54 18.32 13.30 36.69 25.42 23.63 8.89 58.92 29.57
ACRN 31.67 11.25 60.34 38.57 19.52 14.62 34.97 24.88 20.26 7.64 71.99 27.79
QSPN 33.26 13.43 62.39 40.78 20.15 15.23 36.72 25.30 35.60 15.80 79.40 45.40
CBP 35.76 17.80 65.89 46.20 27.31 24.79 43.64 37.40 36.80 18.87 70.94 50.19
SCDM 36.75 19.86 64.99 41.53 26.11 21.17 40.16 32.18 54.44 33.43 74.43 58.08
GDP 39.27 - - - 24.14 - - - 39.47 18.49 - -
LGI 41.51 23.07 - - - - - - 59.46 35.48 - -
BPNet 42.07 24.69 - - 25.96 20.96 - - 50.75 31.64 - -
VSLNet 43.22 26.16 - - 29.61 24.27 - - 54.19 35.22 - -
CMIN 43.40 23.88 67.95 50.73 24.64 18.05 38.46 27.02 - - - -
2DTAN 44.51 26.54 77.13 61.96 37.29 25.32 57.81 45.04 39.81 23.25 79.33 51.15
DRN 45.45 24.36 77.97 50.30 - 23.17 - 33.36 53.09 31.75 89.06 60.05
CBLN 48.12 27.60 79.32 63.41 38.98 27.65 59.96 46.24 61.13 38.22 90.33 61.69
MGSL-Net 51.87 31.42 82.60 66.71 42.54 32.27 63.39 50.13 63.98 41.03 93.21 63.85
Table 1: Performance compared with the state-of-the-arts on ActivityNet Caption, TACoS, and Charades-STA datasets.

Comparisons with the State-of-the-Arts

Compared Methods. We compare the proposed MGSL-Net with state-of-the-art TSG methods on three datasets. These methods are grouped into three categories by the viewpoints of proposal-based and proposal-free approach: 1) proposal-based methods: TGN chen2018temporally, CTRL gao2017tall, ACRN liu2018attentive, QSPN xu2019multilevel, CBP wang2019temporally, SCDM yuan2019semantic, CMIN zhang2019cross, 2DTAN zhang2019learning, and CBLN liu2021context. 2) proposal-free methods: GDP chenrethinking, LGI mun2020local, VSLNet zhang2020span, DRN zeng2020dense. 3) others: BPNet xiao2021boundary. Note that all the above methods directly utilize deep networks to learn cross-modal retrieval without considering the rarely appeared video-query samples.

Comparison on ActivityNet Caption. Table 1 summarizes the results on ActivityNet Caption. It shows that our MGSL-Net outperforms all the baselines in all metrics. Specifically, we observe that MGSL-Net works well in even stricter metrics, e.g., it achieved a significant 3.82% and 3.30% absolute improvement in R@1, and R@5, IoU=0.7 compared to the previous state-of-the-art method CBLN, which demonstrates the superiority of our model. It is mainly because our memory can store useful cross-modal shared semantic representations, and thus better associate those rarely appeared video and query in the standard test sets.

Comparison on TACoS. From Table 1, we can also find that our MGSL-Net achieves the best performance on TACoS dataset. Note that our model shows much larger improvements on the TACoS dataset than the ActivityNet Caption dataset. It mainly results from that the fewer training data with low diversity of TACoS dataset cannot guarantee the previous models can well capture the relations among small object. But our model can better exploit the auxiliary resources for better learning.

Comparison on Charades-STA. As shown in Table 1, our MGSL-Net achieves new state-of-the-art performance over all metrics on Charades-STA. Since there exists less rarely appeared samples in this dataset, it has less performance improvements than the other two datasets.

Method Run-Time Model Size R@1, IoU=0.5
ACRN 4.31s 128M 14.62
CTRL 2.23s 22M 13.30
TGN 0.92s 166M 18.90
2DTAN 0.57s 232M 25.32
DRN 0.15s 214M 23.17
MGSL-Net 0.10s 203M 32.27
Table 2: Efficiency comparison run on TACoS dataset.

Efficiency Comparison

We evaluate the efficiency of our MGSL-Net, by fairly comparing its running time and parameter size with existing methods on a single Nvidia TITAN XP GPU on TACoS dataset. As shown in Table 2, it can be observed that we achieves much faster processing speeds and relatively less learnable parameters. This attributes to: 1) The proposal generation procedure and proposal matching procedure of proposal-based methods (ACRN, CTRL, TGN, 2DTAN) are quite time-consuming. 2) Regression-based method DRN utilizes much convolutional layers to achieve multi-level feature fusion for cross-modal interaction, which is also cost time. 3) Our MGSL is free from complex and time-consuming operations, showing superiority in both effectiveness and efficiency.

Figure 4: Data distribution on the TACoS dataset, and the performance comparison on its rare cases.
Figure 5: Data distribution on the Charades-STA dataset, and the performance comparison on its rare cases.

Analysis on the Rare Cases

Analyzing with the rarely appeared video-query samples is not easy since such pair-wise data is not easy to define. Therefore, we first analyze the data distribution of each dataset as shown in Figure 1, 4 and 5, and then we select certain pairs of video and sentence as rare samples, which have at least one word (nouns, verbs or adjectives) whose appearing frequency is less than 10. The other remained samples are treated as common samples. In these three figures, we show the performance of different methods on the rare cases and common cases. Our proposed method is more effective to handle the rare cases, which brings much more improvements than other methods. We also give the qualitative examples of the grounding results as shown in Figure 6, where our method can learn and memorize the semantics of the rare cases and can ground the segment more accurately.

Domain-specific Heterogeneous R@1, R@1,
memory networks attention module IoU=0.5 IoU=0.7
43.65 22.48
49.24 28.01
46.19 25.73
51.87 31.42
Table 3: Main ablation study on ActivityNet Caption dataset.
Video domain Query domain R@1, R@1,
shared shared IoU=0.5 IoU=0.7
46.19 25.73
47.64 26.82
48.88 27.90
49.97 28.21
50.56 29.60
51.20 30.85
51.87 31.42
Table 4: Performance comparisons with the memory network in different settings on ActivityNet Caption dataset.
Self- Inter- Calibration R@1, R@1,
attention attention IoU=0.5 IoU=0.7
49.24 28.01
50.33 28.98
51.27 30.56
51.87 31.42
Table 5: Performance comparisons with the heterogeneous attention module in different settings on ActivityNet Caption dataset.

Ablation Study

Main ablation. As shown in Table 3, we first study the influence of each main component in our proposed MGSL-Net. We set the MGSL-Net model without both domain-specific memory networks and heterogeneous multi-modal integration module as the baseline. The table shows that both memory network and heterogeneous attention make great contributions for the final performance, where the memory network brings the largest improvement of 5-6 absolute values.

Investigation on memory network. We investigate the performance comparison with the memory network with different settings as shown in Table 4, where “shared” means utilizing a shared memory slots to learn the semantics of two inputs in each domain. From the table, we can find several points: 1) The two aligned semantics in pairwise data or are all important for memory learning. 2) In each domain, the shared memory performs better than utilizing two separated memories for reading and updating the pairwise data. 3) The memory-based contexts in both video and query domains are all helpful for grounding.

To further investigate what the shared memory actually learns, we reduce the dimensionality of memory slots with PCA, and show their two-dimensional representations (grey nodes) in Figure 7. We can see that all the nodes distribute in a divergent shape, in which the top nodes are more compact while the bottom ones are more scattered. To figure out the semantic meanings of these memory slots, we take several representative nodes (with arrows) as queries to retrieve video-query pair. We find the rarely appeared content is indeed captured and represented as the scattered nodes while more commonly appeared content is captured and represented as the compact ones.

Investigation on heterogeneous attention. We also conduct the ablation studies on the heterogeneous attention module in Table 5, where we set the inter-attention branch as the baseline. The self-attention brings largest improvement, since it not only captures the intra-relations among the elements in each modality but also provides the enhanced video features in video domain. The calibration module also makes contribution to the final performance.

Figure 6: The qualitative results of our proposed method. Rarely appeared words are marked as red.
Figure 7: Two-dimensional visualization of learned memory items. The blue rectangle denotes the target segment. Rarely appeared words are marked as red.

Conclusion

In this paper, we have proposed the Memory-Guided Semantic Learning (MGSL) to handle the rarely appeared pairwise samples in temporal sentence grounding task. The main contributions of this work are: 1) we propose a cross-modal graph convolutional network to align the semantic between video and query, 2) we develop two domain-specific persistent memory items to learn and memorize the cross-modal shared semantic representations, and 3) we devise a heterogeneous attention module to integrate the enhanced multi-modal features in both video and query domains. Experimental results shows the superiority of our method on both effectiveness and efficiency.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant (No.61972448, No.62172068, No.61802048).

References