Temporal sentence grounding (TSG) is a fundamental but important task in multimedia understanding [54, 84, 50, 61, 17, 52, 47, 1, 70, 82, 85, 51, 48, 62, 86, 13, 12, 14]. As shown in Fig. 1(a), given an untrimmed video, this task aims to predict a specific segment containing the activity related to the semantics of a sentence query. Traditional TSG approaches can be divided into two categories: 1) Top-down approaches [2, 7, 83, 71, 81, 30]: These methods first pre-define multiple segment proposals and align them with the query for cross-modal semantic matching. The best proposal with the highest similarity score is selected as the predicted segment. 2) Bottom-up approaches [8, 72, 40, 77, 34]
: These methods directly regress the start and end boundary frames of the target segment or predict boundary probabilities frame-wisely. The predicted segment is obtained through post-processing steps that group or aggregate all frame-wise predictions. Although the above two types of works have achieved significant performances, they are not end-to-end, and still suffer from the redundant proposal generation/matching process (top-down) and complex post-processing steps (bottom-up) to refine the grounding results.
Recently, transformer-based approaches [4, 79, 63] are newly introduced to handle the TSG task in an end-to-end manner. Different from the top-down and bottom-up approaches, they capture more fine-grained interaction between the video-query input and directly output the segment predictions via the effective transformer encoder-decoder architecture [53, 68, 69, 28, 38] without using any time-consuming pre- and post-processing operation. The general transformer-based pipeline is shown in Fig. 1(b). It first simply feeds the video frames and query words into the transformer to equally align the semantics between each frame-word pair. Then, the transformer decoder with a direct set prediction [5, 59, 27] is utilized to predict a few learnable segment candidates with corresponding confidence scores. Thanks to such a simple pipeline and the multi-modal relationship modeling capabilities in a transformer, these transformer-based approaches are both effective and computationally efficient.
However, we argue that existing transformer-based approaches are limited by the bottleneck in capturing the complete correspondence between visual contents and sentence semantics. Since previous transformer encoders equally interact with each frame-frame/frame-word/word-word pair, they fail to explore both query-related low-level entities and long-range dependent high-level contexts during the semantic alignment. Thus, they are easy to get stuck into the local minima of partial semantic matching problems, e.g., activating most discriminative parts instead of the full event extend or only partially matching a few semantics in the sentence query. For instance, as shown in Fig. 1(b), without capturing the individual semantics of different phrases and correlating them for complete reasoning, existing transformer-based TSG methods only focus on the most discriminative semantics “jump up and down” and lead to the partial semantic grounding. Considering many real-world video-query pairs involve different levels of granularity, such as frames and words or clips and phrases with distinct semantics, it is crucial to first capture the local query context for modeling the corresponding visual activity and then comprehend the global sentence semantics by correlating all local activities.
To this end, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained cross-modal representations. As shown in Fig. 1 (c), we first split the video and query into individual frames and words to compose and learn their clip- and phrase-related dependency via a local temporal transformer encoder. A global temporal transformer encoder is also utilized to correlate the local semantics for complete sentence understanding. Then, we develop a global-local transformer encoder to further learn the complicated interactions between the extracted local- and global-level semantics. Besides, we develop a new cross-modal cycle consistency loss in the transformer decoder to enforce better cross-modal semantic alignment. Finally, we design a cross-modal parallel transformer decoder to integrate visual and textual features in parallel to reduce the computational cost. By capturing both local and global granularities in the multi-modal information, our HLGT can capture the complete query semantics for more accurate video grounding.
In summary, the main contributions of our works are:
We present a novel Hierarchical Local-Global Transformer (HLGT), which captures different levels of granularity in both video and query domains to reason the complete semantics for fine-grained grounding. To the best of our knowledge, it is the first time that a multi-level interaction network is proposed to alleviate the limitations of existing transformer-based TSG methods.
We design a cross-modal parallel transformer decoder with a brand-new cross-modal cycle-consistency loss to encourage semantic alignment between visual and language features in the joint embedding space.
We conduct extensive experiments on three challenging benchmarks (ActivityNet Captions, TACoS and Charades-STA) where our proposed HLGT outperforms the state-of-the-arts with clear margins, demonstrating its effectiveness and computational efficiency.
Ii Related Work
Ii-a Traditional TSG Methods
As a new multimedia task introduced recently [15, 2], temporal sentence grounding (TSG) aims to identify the start and end timestamps of the most relevant video segment from an untrimmed video with a sentence query. Most works [2, 7, 83, 71, 81, 43, 30] have been proposed within a top-down framework, which first samples candidate segment proposals from the untrimmed video, then integrates the sentence query with these segments individually, and finally matches them with the query. Although these methods achieve good performances in some cases, they are severely proposal-dependent and time-consuming, which limits their applications.
Although these top-down methods achieve good performances, they are severely proposal-dependent and time-consuming. Without using proposals, the latest methods [8, 72, 40, 77, 34] have been proposed within a bottom-up framework, which directly regresses the start and end timestamps of the target segment after interacting the whole video with the query.
Although the above two types of works have achieved significant performances, they are not end-to-end. These methods might suffer from the redundant proposal generation/matching process (top-down) and complex post-processing steps (bottom-up) to refine the grounding results.
Ii-B Transformer-Based TSG Methods
Recently, some transformer-based TSG approaches [4, 79, 63] are newly proposed in an end-to-end manner [74, 16, 19]. Different from the top-down and bottom-up approaches, they capture more fine-grained interaction between the video-query input and directly output the segment predictions via the effective transformer encoder-decoder architecture  without time-consuming pre- and post-processing operations. These transformer-based methods first simply feed the video frames and query words into the transformer to equally align the semantics between each frame-word pair. Then, the transformer decoder with a direct set prediction  is used to predict a few learnable segment candidates with corresponding confidence scores. Based on a simple pipeline and the multi-modal relationship modeling capabilities in a transformer, this approach is more effective than traditional methods.
Since many video-query pairs involve different levels of granularity (e.g., frame-word pairs and clip-phrase pairs), it is crucial to first capture the local query context for modeling the corresponding visual activity and then comprehend the global sentence semantics by correlating all local activities.
Ii-C Cycle-Consistent Learning
By utilizing transitivity as the training objective, cycle-consistent learning aims to explore task correlations to regularize training , which is widely used in various multimedia fields, such as vision-language navigation [55, 6]
, text-to-image synthesis[56, 49], and image-text matching [37, 78]. For example, based on the assumption of cyclical structure, Wang et al.  learn visual supervision by tracking forward and backward. In the visual question answering task, Shah et al.  enforce consistency between the generated and the original question by using cycle-consistency for temporal video alignment. Although these methods address multi-modal tasks, they perform cycle-consistent learning only in the visual domain, which limits their performance. Thus, they cannot apply to our complex TSG task. In a broader sense, our study is the first attempt that explores cycle-consistent learning to the TSG task.
Iii The Proposed HLGT Network
As a significant multimedia task, temporal sentence grounding (TSG) aims to localize the precise boundary of a specific segment from an untrimmed video semantically corresponding to a given query , where denotes the -th word, denotes the word number, and denote the start and end timestamps of the specific segment, denotes the -th frame, denotes the frame number, respectively. Recently, some transformer-based approaches [4, 79, 63] have shown their strong performance to handle the TSG task via the effective transformer encoder-decoder architecture in an end-to-end manner. However, they still suffer from the vanilla transformer design and fail to explore different levels of granularity with distinct but fine-grained semantics in both video and query. Therefore, how to effectively capture and integrate these multi-level cross-modal contexts for better grounding is an emerging issue.
In this section, we present a novel Hierarchical Local-Global Transformer (HLGT), which leverages this hierarchy information and model the interactions between different levels of granularity and multiple modalities for learning more fine-grained multi-modal representations. As shown in Fig. 2, the proposed HLGT model consists of four parts, including the multi-modal feature extractors, multi-level transformer encoder, cycle-consistent transformer decoder, and the boundary prediction head. Given the paired video-query input, we first split the video/query into the clips/phrases, and extract their internal frame- and word-level features via the multi-modal feature extractors. Then, we capture the relationship between frame/word features within each clip/phrase based on a temporal transformer to integrate the local clip-/phrase-level features. Meanwhile, we also feed the whole frames/words into another temporal transformer to encode the corresponding global representations. We fuse the local and global visual and textual tokens by two global-local transformers to learn the contextualized individual modal semantics. After that, we introduce a cross-modal parallel transformer decoder to interact the video and query features for semantic alignment in parallel. Specifically, we develop a new cross-modal cycle consistency (CMCC) loss to assist the multi-modal interaction. Then, the boundary prediction head is utilized to predict final temporal segments based on the interacted multi-modal representations. Finally, we present the details of each module.
Iii-B Feature Extraction
Video extractor. For video encoding, we first sample every 16 consecutive frames as a clip. Then, we use a pre-trained Resnet-152 network  to extract the frame-level visual features in each clip. We denote the extracted video features as , where denotes the frame number in the total video, denotes the -th frame, denotes the -th clip and denotes the visual feature dimension.
Query extractor. For query encoding, we first utilize the Glove embedding  to generate the word-level features. The extracted query features are denoted as , where denotes the word number in the whole query, denotes the -th word feature in the query, denotes the phrase number, and denotes the textual feature dimension that is the same as the visual feature dimension in the video extractor. For the -th phrase, , where denotes the -th word in the phrase and denotes the word number in the -th phrase. Then, we follow 
to split the given query into multiple phrases. The detailed splitting approach is as follows: to discover the potential phrase-level features, we apply 1D convolutions on the word-level features with different window sizes. At each word location, we compute the inner product of the word feature vectors with convolution filters of three kinds of window sizes, which captures three different-scale phrase features. To maintain the sequence length after convolution process, we zero-pad the sequence vectors when convolution window size is larger than one. The output of the-th word location with window size is formulated by , where Conv1d() operates on the windowed features with kernels. is the phrase-level feature corresponding to -th word location with window size
. To find the most contributed phrase at each word location, we then apply max-pooling to obtain the final phrase-level featureby . Thus, we can split the query into multiple phrases.
For the extracted visual features, we focus on two-level frame features in the latter reasoning: frame features in each clip and all the frame features in the whole video. Similarly, we also focus on two-level word features in the latter reasoning: word features in each phrase and all the word features in the given query.
Iii-C Transformer Encoder
For the transformer encoder, we first feed the extracted frame/word features within each clip/phrase to a temporal transformer followed by a feature fusion module, which can fuse and generate corresponding clip-level/phrase-level features. Then, since these clip-level/phrase-level features can only learn local semantic information of the whole video/query, we also feed all the frame/word features within the whole video/query to the temporal transformer followed by a feature fusion module for learning the global semantic information. Finally, for each modality, we integrate both the global information and the local information by proposing a global-local transformer to generate more contextual features.
Temporal transformer. As shown in Fig. 2
, given the extracted frame/word representations within each clip/phrase, we introduce a temporal transformer network with standard attention-blocks to learn the correlations between frames/words for latter clip/phrase-level fusion. Fig.3 shows the details of the temporal transformer.
For ease of description, we first introduce the notation of a standard transformer (called TRM). Considering that transformer architecture contains multi-head self-attention blocks for multi-inputs correlating and updating, we define TRM as:
where FFN is the feed forward network; is the feature dimension of the multi-head block; , and are the individual query, key and value in TRM respectively, and they are computed by:
where , and are the embedding weight matrices; denotes any sequence (e.g., , , and ) as the input of TRM.
Then, in each modality, all the temporal transformers share the same weight. In the visual branch, if the input of the temporal transformer is the frame-level clip feature , and we denote its corresponding output as , which is obtained by:
where , and are the query, key and value in the clip-based temporal transformer respectively. To obtain the frame-level video representation , we employ the same temporal transformer on the whole video, and its output is .
Similarly, in the textual branch, if the input of the temporal transformer is , its output is , which is the word-level phrase representation. Likewise, the word-level query representation is .
In the TSG task, both video and query naturally represent at different levels of granularity: a video/query is composed of several clips/phrases, and each clip/phrase contains multiple frames/words. These frames/words in each clip/phrase only contain partial visual/textual information of the specific clip/phrase, which is only the local representation for the given video/query. To obtain the global representation of the given video/query, we also feed all the frame/word features to the temporal transformer to obtain the global visual feature and the global textual feature .
Feature fusion. To integrate these frame/word features within each clip/phrase, we introduce a feature fusion module based on an attention-aware approach. For any sequence , the corresponding attention matrix is calculated by:
where and are two learnable transformation weight matrices; and
are two biases; Gelu is the activation function GELU. Each element in denotes the pixel-wise attention score whether the frame/word contents contribute to the clip-/phrase- (or video-/query) level semantics. It helps to highlight the foregrounds and filter out the backgrounds for better representation learning. Thus, based on matrix , we can obtain the fused feature by:
where is element-wise multiplication; is the frame/ word number in the clip/phrase (or video/query); is the output of temporal transformer with the input of ; and are the -th attention column of and , respectively.
In the visual branch, for the -th clip, is the input of the attention-aware feature fusion module. Based on Eq. (4) and (5), we fuse all the frame features in into the clip-level feature . Similarly, we also fuse all the frame features in the whole video into the global visual feature .
For the textual branch, we obtain phrase-level features by fusing the word-level features in the phrase. For instance, we fuse all the word feature in the -th phrase into the phrase-level feature based on Eq. (5). Besides, we obtain the global textual feature by fusing all the word features in the given query.
Global-local transformer. In each modality, the above generated global features (video- or query-level) and local features (clip- or phrase-level) are at different levels. To effectively integrate these cross-level features for more fine-grained feature fusion, we propose a global-local transformer shown in Fig. 3. Specifically, the global-local transformer contains two modules: local transformer and global transformer . Both the visual branch and the textual branch have the same global-local transformer architecture. For each modality, the local transformer is used to learn the short-term interactions between low-level semantics (adjacent dependency between clips/phrases). The global transformer aims to model the long-term interactions between local and global representations (global dependency between clip-video or phrase-query).
For , its core components include: a multi-head attention, a feed-forward layer and a normalization layer. Following , we append the positional embedding (PE) by using sine function and cosine function of different frequencies:
are the even and odd indices of the positional embedding;denotes the positional embedding of the -th position, and is the dimension of . Therefore, the output of the local transformer is:
where is the visual output of the local transformer and is the textual output of the local transformer.
The keys/values in are from the output of the normalization layer in , the query is the matrix from the global representation, and we feed both global representation and local representation as input to the multi-head attention block to learn the cross-level correlating and updating. As a result, the TRM block in the global transformer () generates attention features to the global representation conditioned on the local representation. We set , , , . Thus, the corresponding output of the global transformer is:
where is the visual output of the global transformer and is the textual output of the local transformer. Similar to the local transformer, we also add a feed-forward layer and a normalization layer to encode and . Finally, for each modality, we concatenate the local and global representations to generate the final fine-grained visual/textual features as follows:
where is the final visual feature and is the final textual feature.
Iii-D Transformer Decoder
After obtaining the fine-grained visual and textual features and , we need a transformer decoder to handle these cross-modal interactions. Supposing we need to predict segment candidates, as the additional input, segment queries [58, 10, 4] are utilized to learn a possible segment by aligning the semantics between and . Based on , we develop a cross-modal parallel transformer to integrate these features from different modalities in parallel. To further assist the multi-modal semantic alignment and interaction, we also design a new cross-modal cycle consistency loss in this decoder for supervision.
Cross-modal parallel transformer. Given the visual features , we employ several linear layers on it to generate a set of video-specific key and video-specific value . Similarly, we can also obtain the query-specific key and query-specific value as:
where , , and are learnable parameters. Based on the modality-specific key and value, we design a modality-specific attention module to fuse multi-modal features by two parallel branches (i.e., two MultiAtt modules) in Fig. 3, where MultiAtt is the standard Multi-head Attention module [53, 67], which is defined as:
where is the attention output in the visual branch and is the attention output in the textual branch, denotes the enhanced segment queries by the self-attention operation. To model fine-grained cross-modal interaction, we integrate these two attentions as follows:
where denotes the additive sum with learnable weights. is the sum with learnable weights in the visual branch, and is the sum with learnable weights in the textual branch. Note that the main computational cost of the cross-modal parallel transformer is matrix multiplication (i.e., “matmul” in Fig. 3). Based on Eq. (17) and (18), we can calculate the visual attention and the textual attention in parallel, which improves the computational efficiency.
Cross-modal cycle consistency. In the TSG task, a phrase often corresponds to a specific clip. To enforce better semantic alignment between clips and phrases, we design a new cross-modal cycle-consistency loss during cross-modal interaction. In general, if a clip and a phrase are identified as semantically aligned, their representations are nearest neighbors in the learned common spaces. After obtaining clip-level features or phrase-level features , we design a cross-modal cycle-consistency constraint for better cross-modal alignment.
Firstly, given phrase , we find the visual soft nearest neighbor (i.e., the most relevant clip) by:
where is used to compute the similarity score of clip and phrase .
Then, we cycle back from to phrase sequence by calculating the soft phrase location as follows:
To learn semantically consistent representations, we penalize deviations from cycle-consistency for sampled sets of clips and sentences based on the following loss function:
denotes the cosine similarity ofand .
Iii-E Boundary Prediction
After obtaining by Eq. (18), we utilize multiple feed forward networks to obtain a series of fixed-length boundary predictions , where contains the -th predicted segment coordinate and the corresponding confidence score . We denote the ground truth as , which contains the ground-truth segment coordinate .
Based on the fixed-length boundary predictions and ground-truth boundary, we design the set prediction loss as follows:
where and are weighting parameters; is a scale-invariant generalized intersection over union. By minimizing Eq. (22), we can determine the optimal prediction segment slot from multiple boundary predictions. Assuming that the -th slot is optimal, we denote the corresponding optimal prediction as .
Thus, the final loss is as follows:
where parameter is utilized to control the balance.
Inference: The inference process of our proposed HLGT is very simple. Without predefined threshold values or time-consuming post-processing processes, we generate the predicted segment boundary in only one forward pass. The predicted segment with the highest confidence score will be selected as the final predicted segment.
Iv-a Datasets and Evaluation Metric
ActivityNet Captions. ActivityNet Captions  contains 20k untrimmed videos with 100k language descriptions from YouTube . These videos are mainly about complicated human activities in daily life. These videos are 2 minutes on average, and these annotated video segments have much larger variation, ranging from several seconds to over 3 minutes. Since the test split is withheld for competition, following the public split , we use 37421, 17505, and 17031 query-video pairs for training, validation, and testing respectively.
Charades-STA. Built upon the Charades  dataset, Charades-STA  contains 6672 videos and involves 16128 video-query pairs, which pays attention to daily life indoors activities. It is collected for video action recognition and video captioning, and contains 6672 videos and involves 16128 video-query pairs. Following [15, 35], we utilize 12408 video-query pairs for training and 3720 pairs for testing.
TACoS. Collected by 
for video grounding and dense video captioning tasks, TACoS consists of 127 long videos, which are mainly about cooking scenarios. It consists of 127 videos on cooking activities with an average length of 4.79 minutes. In video grounding task, it contains 18818 video-query pairs. For fair comparisons, we follow the same split of the dataset as, which has 10146, 4589, and 4083 video-query pairs for training, validation, and testing respectively.
Evaluation metric. Following , we adopt “R@, IoU=” proposed by  for the metric, which calculates the ratio of at least one of top- selected segments having an intersection over union (IoU) larger than . The larger metric means the better performance. In our experiments, we utilize for all datasets, for ActivityNet Captions and Charades-STA, for TACoS.
Iv-B Implementation Details
For video encoding, we define continuous 16 frames as a clip and each clip overlaps 8 frames with adjacent clips. Then, we use a pre-trained Resnet-152 network  to extract the frame-level visual features in each clip. Since some videos are overlong, we uniformly downsample frame-feature sequences to . For query encoding, we utilize the Glove embedding 
to embed each word to 300-dimension features. For both visual and textual branches, we set the hidden state size as 1024 and the attention head as 8 in all the transformer and attention blocks. We train the whole model for 80 epochs with the batch size of 16 and the early stopping strategy. The hyperparameters, and are set to 0.8, 0.5 and 0.2 respectively according to empirical study. We perform the parameter optimization by Adam optimizer  with leaning rate for ActivityNet Captions and Charades-STA and
Iv-C Comparison with State-of-the-Arts
Compared methods. We compare the proposed HLTG with state-of-the-art TSG methods on three datasets. These methods are grouped into three categories by the viewpoints of top-down, bottom-up and transformer-based methods. To make a fair comparison with these TSG methods, following , we cite their results from corresponding works:
(i) Top-down approach: CTRL , ACRN , QSPN , SCDM , BPNet , CMIN , 2DTAN , DRN , FIAN , CBLN . These methods first sample multiple candidate video segments, and then directly compute the semantic similarity between the query representations with segment representations for ranking and selection.
(iii) Transformer-based approach: VIDGTR , De-VLTrans-MSA , GTR . Different from the above top-down and bottom-up approaches, they capture more fine-grained interaction between the video-query input and directly output the segment predictions via the effective transformer encoder-decoder architecture without using any time-consuming pre- and post-processing operation.
Comparison on ActivityNey Captions. We compare our proposed HLGT with the state-of-the-art top-down, bottom-up, and transformer-based TSG methods on the ActivityNey Captions dataset in Table I, where our HLGT reaches the best performance over all the metrics. Particularly, compared with the best top-down approach CBLN, our HLGT achieves 7.56%, 6.65%, 4.87% and 4.02% improvements on all the metrics, respectively. Our HLGT also obtains an even larger improvement over the best bottom-up method IVG-DCL in metrics R@1, IoU=0.5 and R@1, IoU=0.5 by 11.84% and 7.15%. It verifies the benefits of utilizing our end-to-end architecture to effectively model the fine- grained visual-language alignment between video and language query. Besides, HLGT outperforms other transform-based methods by a large margin in all the metrics. Particularly, compared to the best transformer-based method GTR, our proposed HLGT brings the improvement by 5.51%, 5.14%, 3.76% and 2.29% in all the metrics. The significant improvement demonstrates the effectiveness of our multi-level interaction network and brand-new cross-modal cycle-consistency loss.
Comparison on Charades-STA. As shown in Table II We also compare our proposed HLGT with the state-of-the-art top-down, bottom-up, and transformer-based TSG methods on the Charades-STA dataset. Obviously, our HLGT beats all the other methods over all evaluation metrics. Compared to the best top-down method CBLN, our HLGT outperforms it by 4.18% and 4.17% absolute improvement in terms of R@1, IoU=0.5 and R@5, IoU=0.5, respectively. Compared to the best bottom-up method ACRM, our HLGT improves the performance by 7.78% and 3.05% in terms of R@1, IoU=0.5 and R@1, IoU=0.7, respectively. Besides, our HLGT beats the best transformer-based method GTR by 2.73% and 2.88% absolute improvement in terms of R@1, IoU=0.5 and R@5, IoU=0.5, respectively.
Comparison on TACoS. To further compare our proposed HLGT with the state-of-the-art top-down, bottom-up, and transformer-based TSG methods, we present the results in Table III. We can find that HLGT still outperforms all the other TSG methods in terms of all the metrics. Compared to the best top-down method CBLN, our HLGT outperforms it by 10.35%, 11.52%, 9.1% and 12.7% in terms of all metrics, respectively. HLGT also beats the best bottom-up method ACRM and brings the improvements by 10.54% and 12.13% in terms of R@1, IoU=0.3 and R@1, IoU=0.5, respectively. Compared to the best transformer-based method De-VLTrans-MSA, HLGT brings the improvements of 1.60% and 1.43% in terms of R@1, IoU=0.5 and R@5, IoU=0.3, respectively.
Efficiency comparison. We further evaluate the efficiency of our proposed HLGT by fairly comparing its inference speed (QPS) with state-of-the-art methods on the challenging ActivityNet Captions dataset in Fig. 4. HLGT can process more than queries per second, which shows that our HLGT can efficiently process these challenging multi-modal data. Compared with other state-of-the-art methods, our HLGT runs faster and achieves better grounding performance. Particularly, our HLGT is 5.24% better than transformer-based method VIDTR with 38.71% faster speed, which shows the effectiveness and efficiency of our HLGT. Our satisfactory performance attributes to: (i) Our cross-modal parallel transformer is able to process visual features and textual features in parallel, which effectively reduces the time consumption of processing multi-modal features. (ii) For each modality, our global-local transformer can learn the interactions between the local and global semantics for better multi-modal reasoning and more accurate video grounding. Therefore, our HLGT will have wider real-world multimedia applications, due to its efficiency and effectiveness on the challenging large-scale ActivityNet Captions dataset.
Iv-D Ablation Study
To examine the effectiveness of each component in our HLTG, we perform in-depth ablation studies on three challenging datasets: ActivityNet Captions, Charades-STA and TACoS.
|Model||Feature Extraction||Transformer Encoder||Transformer Decoder||Boundary Prediction||R@1,||R@1,||R@5,||R@5,|
|Model||Feature Extraction||Transformer Encoder||Transformer Decoder||Boundary Prediction||R@1,||R@1,||R@5,||R@5,|
|Model||Feature Extraction||Transformer Encoder||Transformer Decoder||Boundary Prediction||R@1,||R@1,||R@5,||R@5,|
Main ablation study. We first conduct the main ablation study to examine the effectiveness of all the modules in our model, including multi-level feature extractions (local and global), multi-level transformer encoders (local and global), the transformer decoder and the boundary predict module. The ablation results are reported in Table IV: 1) Model i is the baseline model without temporal transformer and feature fusion, where we directly employ these frame- and word-level features for grounding. 2) For each modality, Model ii only uses the local features and ignores the global features for grounding. 3) On the contrary, Model iii only utilizes the global features for grounding. 4) As for Model iv, we use both local and global features for grounding. 5) In Model v, we add the CMCC loss to Model iii. 6) Model Full is our full HLGT.
From Table IV, we can find that: (i) Model Full performs the best and Model i the worst. (ii) Compared to Model i, Model ii and iii achieve the improvement by 1.18% and 2.61% respectively in terms of “R@1, Iou=0.5” on the ActivityNet Captions dataset. It shows that local and global features can be used to align visual and textual representations. (iii) Compared to Model ii and iii, Model iv improves the performance by 1.49% and 1.82% respectively in terms of “R@1, Iou=0.7” on the Charades-STA dataset. It is because both local and global features are significant to learn the full visual/textual representation. (iv) As for Model v, it outperforms Model iv by 2.23% in terms of “R@5, Iou=0.5” on the TACoS dataset. It is because our cross-modal cycle consistency can encourage semantic alignment between visual and language features in the joint embedding space for video grounding. (v) Compared to Model v, Model Full achieves the performance improvement by 1.36% in terms of “R@1, Iou=0.7” on the Charades-STA dataset.
Training process of different ablation models. Following , we try to analyze the training process and grounding performance of different ablation models. Fig. 4 shows the experimental results. We can obtain the following representative observations: (i) On each epoch, HLGT(full) outperforms other ablation models, which demonstrates the effectiveness of each module. For example, compared to the second-best model HLGT(v), HLGT(full) improves the performance by 3.27%. (ii) HLGT(full) converges faster than ablation models, which shows that our full model is more efficient on time-consuming. For instance, HLGT(full) converges within 14 epochs, while HLGT(i) converges after 18 epochs. Thus, our full HLGT can process these challenging datasets more efficiently.
|Our HLGT(Plain Transformer)||55.01||33.72||83.59||67.02|
Analysis on different visual feature extractor network. Most of previous methods use pre-trained C3D or I3D to obtain the visual features. However, both C3D and I3D only obtain the clip-level features not the frame-level features. Different from them, we utilize a Resnet-152 network to obtain the fine-grained frame-level feature. In this subsection, we conduct an ablation study experiment to analyze the performance of our used Resnet-152 network. Table V shows the results, where the plain transformer is the baseline network. Obviously, our proposed HLGT on the Resnet-152 network performs better than other clip-level pre-trained feature extractor network (C3D and I3D). Specifically, compared to HLGT(C3D), our used HLGT(Resnet-152) significantly improves the grounding performance by 0.39%, 0.41%, 0.42% and 0.30% over all metrics. The satisfactory performance improvement illustrates the effectiveness of our used Resnet-152 network.
|Different Transformer Modules on TACoS|
|②||Local Transformer ||42.51||30.87||82.51||83|
|Global Transformer ||44.87||31.79||117.28||251|
|Different Transformer Layers on Charades-STA|
Effect of different transformers. To examine the effect of our transformers (i.e., three designed transformers in Fig. 3), we replace our designed transformers with some state-of-the-art transformer modules. For this ablation study, we consider two aspects: different transformer modules on TACoS and different transformer layers on Charades-STA111Since TACoS is the most challenging dataset, we choose it to better compare the performance of different transformer modules. Considering that Charades-STA is a large-scale dataset, we utilize it to show the performance difference for different transformer layers.. Table VI shows the experimental results.
For different transformer modules on TACoS, compared to these state-of-the-art transformer modules, our transformers perform better (R@1) and run faster (VPS) in all cases. Particularly, for module ①, our temporal transformer uses far fewer parameters and runs faster (larger VPS) than other three transformers (TokShif, MVDeTr, and ViTAE). Besides, our temporal transformer significantly outperforms them, where our temporal transformer beats the second-best transformer ViTAE by 0.41% in terms of “R@1, IoU=0.5”. As for module ②, compared to other transformers, our global-local transformer obtains significant performance improvement by a large margin. Especially, our global-local transformer gets better performance than the second-best transformer SGLANet by 3.41% in terms of “R@1, IoU=0.5”. For module ③, our cross-modal parallel transformer beats other transformers over all the metrics. For instance, compared to the second-best transformer De-VLTrans-MSA, our cross-modal parallel transformer achieves performance improvement by 1.12% in terms of “R@1, IoU=0.5”. The satisfactory performance of our transformers shows the effectiveness of our transformers, each of which contributes to the model performance.
About the impact of different transformer layers, it shows that for each component, the multi-layer transformer only performs marginally better than the single-layer transformer with the higher computational cost (smaller VPS and larger Para.) of transformer operations. An interesting finding is that the multi-layer global-local transformer can bring more improvement than the other two transformers. It is because the global-local transformer can effectively integrate multi-grained features, which improves the grounding performance. For the temporal transformer and the cross-modal parallel transformer, “layer=1” achieves similar performance compared to “layer=3” but significantly decreases the computation (Para.). Therefore, for all the transformers, we set “layer=1”, which is the suggested value on our TSG task.
|wo./ Segment Queries||64.88||41.02||94.37||64.48|
|w./ Segment Queries||65.31||41.38||94.50||64.72|
|Transformer||Our Shared Weight||65.31||41.38||94.50||64.72|
|Fusion||Our Shared Weight||65.31||41.38||94.50||64.72|
|Loss||R@1, IoU=0.5||R@1, IoU=0.7||R@5, IoU=0.5||R@5, IoU=0.7|
|Loss||R@1, IoU=0.5||R@1, IoU=0.7||R@5, IoU=0.5||R@5, IoU=0.7|
|Loss||R@1, IoU=0.3||R@1, IoU=0.5||R@5, IoU=0.3||R@5, IoU=0.5|
Impact of segment queries. To investigate the effectiveness of segment queries on the transformer decoder, we conduct the ablation study on the Charades-STA dataset. We can find that with segment queries, our HLGT achieves the significant performance improvement. Specifically, compared with the first option (wo./ segment queries), the second option (w/ segment queries) improves the performance by 0.43%, 0.36%, 0.13% and 0.24% over all metrics, respectively. The significant performance improvement shows the effectiveness of our used segment queries.
Impact of shared weight. To analyze the impact of shared weight in temporal transformer and feature fusion components, we conduct an ablation study on the Charades-STA dataset. For the setting of unshared weight, we assign an independent weight to each temporal transformer and feature fusion component. As shown in Table VIII, by introducing the shared weight approach, our HLGT can significantly improve the grounding performance. Compared to the setting of unshared weight, for the temporal transformer, our setting of shared weight achieves the performance improvement by 0.37% in terms of “R@1, IoU=0.5”. As for the feature fusion, our setting improves the performance by 0.49 % in terms of “R@1, IoU=0.5”. The improvement shows the effectiveness of the shared weight in temporal transformer and feature fusion components.
Choices for CMCC loss. Cross-modal cycle consistency (CMCC) is a significant assistance to enforce the cross-modal interaction. We compare our proposed CMCC loss ( in Eq. (21)) with three following popular loss functions: 1) L1 loss ; 2) L2 loss ; 3) cosine similarity loss . Table IX reports the results on all the datasets. Obviously, dramatically improves grounding performance than other losses, which illustrates the effectiveness of our . Especially, compared to the second-best loss , our proposed CMCC loss improves the performance by 1.68% in terms of “R@5, IoU=0.5” on the TACoS dataset. Therefore, we choose the CMCC loss as the final loss of our cross-modal cycle consistency.
Analysis of the hyperparameters. To achieve the best performance, we analyze the impact of three hyperparameters: , and . Table X shows the experimental results. It can be observed that, with the increase of , and , their performance follows a general trend, i.e., rises at first and then starts to decline. The optimal values of , and are 0.8, 0.5, and 0.2 respectively, where all the hyperparameters obtain the best performance. Therefore, in our paper, we set , and .
Performance of different datasets. To analyze the generalization ability of our HLGT, we test its running speed on different datasets. Table XI shows its performance on three datasets. On the one hand, although these datasets are under different scales, our HLGT has a similar parameter scale for differnt datasets, which shows that HLGT can deal with different types of datasets with little or no model changes. On the other hand, on the ActivityNet Captions and TACoS datasets, HLGT has the similar running speed (VPS). On the ActivityNet dataset, our HLGT deal with 190.76 videos per second, while it processes 196.54 videos on the TACoS dataset per second. It is because ActivityNet Captions and TACoS have similar average video length. The Charades-STA dataset has shorter video length, which leads to larger VPS.
To investigate the grounding results of our HLGT, we provide two qualitative examples of HLGT and VIDGTR in Fig. 5. We can observe that HLGT achieves more precise localization than the state-of-the-art method VIDGTR. The main reason is that VIDGTR only focuses on the frame-level visual features for encoding and ignores higher-level features (video- and clip-level features), which fails to capture the subtle cross-modal multi-granularity details and understand the complicated background visual content. Different from VIDGTR, HLGT can learn multi-level cross-modal interaction (clip-phrase and video-query interaction), thus capturing more fine-grained visual contexts for more accurate grounding.
In this paper, we proposed a novel Hierarchical Local-Global Transformer (HLGT) for temporal sentence grounding, which leverage this hierarchy information and model the interactions between multiple levels of granularity and different modalities for learning more fine-grained multi-modal representations. Experimental results on three challenging datasets (ActivityNet Captions, Charades-STA and TACoS) validate the effectiveness of our HLGT. In the future, we will apply HLGT to other tasks/datasets [26, 25] to further improve its generalization. Meanwhile, we also will introduce HLGT in the weakly-supervised manner to explore how to use more unannotated data in supervised manner.
-  (2022) Dense video captioning with early linguistic information fusion. IEEE TMM. Cited by: §I.
Localizing moments in video with natural language. In
Proceedings of the IEEE international conference on computer vision, pp. 5803–5812. Cited by: §I, §II-A.
-  (2015) Activitynet: a large-scale video benchmark for human activity understanding. In CVPR, pp. 961–970. Cited by: §IV-A.
-  (2021) On pursuit of designing multi-modal transformer for video grounding. In EMNLP, pp. 9810–9823. Cited by: §I, §II-B, §III-A, §III-D, §IV-C, TABLE I, TABLE II, TABLE III, TABLE VI.
-  (2020) End-to-end object detection with transformers. In ECCV, pp. 213–229. Cited by: §I, §II-B.
-  (2022) Boosting vision-and-language navigation with direction guiding and backtracing. ACM TOMM. Cited by: §II-C.
-  (2018) Temporally grounding natural sentence in video. In EMNLP, pp. 162–171. Cited by: §I, §II-A.
-  (2020) Rethinking the bottom-up framework for query-based video localization. In AAAI, Cited by: §I, §II-A, §IV-C, TABLE II.
-  (2019) Semantic proposal for activity localization in videos via sentence query. In AAAI, Vol. 33, pp. 8199–8206. Cited by: TABLE II.
-  (2021) Decoupling zero-shot semantic segmentation. arXiv preprint arXiv:2112.07910. Cited by: §III-D.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: TABLE VI.
-  (2020) Vh: view variation and view heredity for incomplete multiview clustering. IEEE TAI 1 (3), pp. 233–247. Cited by: §I.
-  (2021) Unbalanced incomplete multi-view clustering via the scheme of view evolution: weak views are meat; strong views do eat. IEEE TETCI. Cited by: §I.
-  (2021) ANIMC: a soft approach for autoweighted noisy and incomplete multiview clustering. IEEE TAI 3 (2), pp. 192–206. Cited by: §I.
-  (2017) Tall: temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pp. 5267–5275. Cited by: §II-A, §IV-A, §IV-A, §IV-A, §IV-A, §IV-C, TABLE I, TABLE II, TABLE III.
-  (2020) Coot: cooperative hierarchical transformer for video-text representation learning. NeurIPS 33, pp. 22605–22618. Cited by: §II-B.
-  (2021) TaoHighlight: commodity-aware multi-modal video highlight detection in e-commerce. IEEE TMM 24, pp. 2606–2616. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §III-B, §IV-B.
-  (2021) End-to-end video object detection with spatial-temporal transformers. In ACM MM, pp. 1507–1516. Cited by: §II-B.
-  (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §III-C.
-  (2021) Multiview detection with shadow transformer (and view-coherent data augmentation). In ACM MM, pp. 1673–1682. Cited by: TABLE VI.
-  (2016) Natural language object retrieval. In CVPR, pp. 4555–4564. Cited by: §IV-A.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
-  (2017) Dense-captioning events in videos. In ICCV, pp. 706–715. Cited by: §IV-A, §IV-A.
-  (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. In ECCV, Cited by: §V.
-  (2020) HERO: hierarchical encoder for video+ language omni-representation pre-training. In EMNLP, Cited by: §V.
-  (2021) Referring transformer: a one-step approach to multi-task visual grounding. NeurIPS 34. Cited by: §I.
-  (2022) Contextual transformer networks for visual recognition. IEEE TPAMI. Cited by: §I.
-  (2020) Weakly-supervised video moment retrieval via semantic completion network. In AAAI, Vol. 34, pp. 11539–11546. Cited by: §IV-D.
-  (2022) Memory-guided semantic learning network for temporal sentence grounding. In AAAI, Cited by: §I, §II-A.
-  (2021) Context-aware biaffine localizing network for temporal sentence grounding. In CVPR, pp. 11235–11244. Cited by: §IV-C, TABLE I, TABLE II, TABLE III.
-  (2021) Adaptive proposal generation network for temporal sentence localization in videos. In EMNLP, pp. 9292–9301. Cited by: §III-C.
-  (2020) Jointly cross-and self-modal graph attention network for query-based moment localization. In ACM MM, pp. 4070–4078. Cited by: §III-B.
-  (2022) Unsupervised temporal video grounding with deep semantic clustering. In AAAI, Cited by: §I, §II-A.
-  (2022) Exploring motion and appearance information for temporal sentence grounding. In AAAI, Cited by: §IV-A.
-  (2018) Attentive moment retrieval in videos. In SIGIR, pp. 15–24. Cited by: §IV-C, TABLE I, TABLE II, TABLE III.
-  (2019) CycleMatch: a cycle-consistent embedding network for image-text matching. PR 93, pp. 365–379. Cited by: §II-C.
-  (2022) Visualizing and understanding patch interactions in vision transformer. arXiv preprint arXiv:2203.05922. Cited by: §I.
-  (2020) Isia food-500: a dataset for large-scale food recognition via stacked global-local attention network. In ACM MM, pp. 393–401. Cited by: TABLE VI.
-  (2020) Local-global video-text interactions for temporal grounding. In CVPR, pp. 10810–10819. Cited by: §I, §II-A, §IV-C, TABLE I.
-  (2021) Interventional video grounding with dual contrastive learning. In CVPR, pp. 2765–2775. Cited by: §IV-C, TABLE I, TABLE II, TABLE III.
-  (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §III-B, §IV-B.
-  (2020) Fine-grained iterative attention network for temporal language localization in videos. In ACM MM, pp. 4280–4288. Cited by: §II-A, §IV-C, TABLE I, TABLE II, TABLE III.
-  (2013) Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1, pp. 25–36. Cited by: §IV-A, §IV-A.
-  (2019) Cycle-consistency for robust visual question answering. In CVPR, Cited by: §II-C.
-  (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In ECCV, pp. 510–526. Cited by: §IV-A.
-  (2022) Contextual attention network for emotional video captioning. IEEE TMM. Cited by: §I.
-  (2021) Spatial-temporal graphs for cross-modal text2video retrieval. IEEE TMM. Cited by: §I.
Enhancing neural machine translation with dual-side multimodal awareness. IEEE TMM. Cited by: §II-C.
-  (2021) Frame-wise cross-modal matching for video moment retrieval. IEEE TMM. Cited by: §I, §IV-C, TABLE II, TABLE III.
-  (2021) Graph-based multimodal sequential embedding for sign language translation. IEEE TMM. Cited by: §I.
-  (2021) Regularized two granularity loss function for weakly supervised video moment retrieval. IEEE TMM 24, pp. 1141–1151. Cited by: §I.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §I, §II-B, §III-D, TABLE VI.
-  (2022) Cross-modal dynamic networks for video moment retrieval with text query. IEEE TMM 24, pp. 1221–1232. Cited by: §I.
-  (2022) Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. arXiv preprint arXiv:2203.16586. Cited by: §II-C.
-  (2021) Cycle-consistent inverse gan for text-to-image synthesis. In ACM MM, pp. 630–638. Cited by: §II-C.
-  (2020) Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI, Cited by: §IV-C, TABLE I, TABLE II, TABLE III.
-  (2022) RCL: recurrent continuous localization for temporal action detection. arXiv preprint arXiv:2203.07112. Cited by: §III-D.
-  (2021) Exploring sequence feature alignment for domain adaptive detection transformers. In ACM MM, pp. 1730–1738. Cited by: §I.
-  (2019) Learning correspondence from the cycle-consistency of time. In CVPR, pp. 2566–2576. Cited by: §II-C.
-  (2021) Weakly supervised temporal adjacent network for language grounding. IEEE TMM. Cited by: §I.
-  (2022) Siamese alignment network for weakly supervised video moment retrieval. IEEE TMM. Cited by: §I.
-  (2022) Explore and match: end-to-end video grounding with transformer. arxiv. Cited by: §I, §II-B, §III-A, §IV-C, TABLE I, TABLE II, TABLE VI.
-  (2022) Blind image restoration based on cycle-consistent network. IEEE TMM. Cited by: §II-C.
-  (2021) Boundary proposal network for two-stage natural language video localization. In AAAI, Cited by: §IV-C.
-  (2019) Multilevel language and vision integration for text-to-clip retrieval. In AAAI, Vol. 33, pp. 9062–9069. Cited by: §IV-C, TABLE I, TABLE II, TABLE III.
-  (2022) MDAN: multi-level dependent attention network for visual emotion analysis. arXiv preprint arXiv:2203.13443. Cited by: §III-D.
-  (2022) Dual vision transformer. arXiv preprint arXiv:2207.04976. Cited by: §I.
-  (2022) Wave-vit: unifying wavelet and transformers for visual representation learning. arXiv preprint arXiv:2207.04978. Cited by: §I.
Dual attention on pyramid feature maps for image captioning. IEEE TMM 24, pp. 1775–1786. Cited by: §I.
-  (2019) Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NeurIPS, Cited by: §I, §II-A, §IV-C, TABLE I, TABLE II, TABLE III.
-  (2019) To find where you talk: temporal sentence localization in video with attention based location regression. In AAAI, Vol. 33, pp. 9159–9166. Cited by: §I, §II-A.
-  (2020) Dense regression network for video grounding. In CVPR, pp. 10287–10296. Cited by: §IV-C, TABLE I, TABLE II.
-  (2018) Cross-modal and hierarchical modeling of video and text. In ECCV, pp. 374–390. Cited by: §II-B.
-  (2021) Token shift transformer for video classification. In ACM MM, pp. 917–925. Cited by: TABLE VI.
-  (2021) Natural language video localization: a revisit in span-based question answering framework. IEEE TPAMI. Cited by: §IV-C.
-  (2020) Span-based localizing network for natural language video localization. In ACL, pp. 6543–6554. Cited by: §I, §II-A, §IV-A, §IV-C, TABLE I, TABLE II, TABLE III.
-  (2022) Unified adaptive relevance distinguishable attention network for image-text matching. IEEE TMM. Cited by: §II-C.
Multi-stage aggregated transformer network for temporal language localization in videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12669–12678. Cited by: §I, §II-B, §III-A, §IV-C, TABLE I, TABLE III, TABLE VI.
-  (2022) ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv preprint arXiv:2202.10108. Cited by: TABLE VI.
-  (2020) Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, Cited by: §I, §II-A, §IV-C, TABLE I, TABLE II, TABLE III.
-  (2020) Dense video captioning using graph-based sentence summarization. IEEE TMM 23, pp. 1799–1810. Cited by: §I.
-  (2019) Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR, pp. 655–664. Cited by: §I, §II-A, §IV-C, TABLE I, TABLE III.
-  (2020) Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE TMM 23, pp. 3306–3317. Cited by: §I.
-  (2021) Conditional sentence generation and cross-modal reranking for sign language translation. IEEE TMM 24, pp. 2662–2672. Cited by: §I.
Multimedia intelligence: when multimedia meets artificial intelligence. IEEE TMM, pp. 1823–1835. Cited by: §I.