Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.



There are no comments yet.


page 1

page 2

page 3

page 7


Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal ground...

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

The task of language-guided video temporal grounding is to localize the ...

Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos

The task of temporally grounding textual queries in videos is to localiz...

Weakly Supervised Temporal Adjacent Network for Language Grounding

Temporal language grounding (TLG) is a fundamental and challenging probl...

Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding

Temporal grounding aims to localize temporal boundaries within untrimmed...

Position-aware Location Regression Network for Temporal Video Grounding

The key to successful grounding for video surveillance is to understand ...

A Survey on Temporal Sentence Grounding in Videos

Temporal sentence grounding in videos(TSGV), which aims to localize one ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Temporal video grounding (TVG) is an important yet challenging task in video understanding, which has drawn increasing attention due to its vast potential applications, such as activity detection zhao2017temporal and human-computer interaction singha2018dynamic. As depicted in Figure 1 (a), it aims to localize a segment in a video according to the semantic of a sentence query.

Figure 1: (a) An example of temporal video grounding. (b) Different from the fully-supervised (paired video-query and detailed segment annotations) and weakly-supervised (only paired video-query knowledge) settings in TVG, there is no supervised information in unsupervised setting.

Most previous methods wang2020temporally; liu2021context; zhang2019man; liu2020jointly; Alpher10; chen2020rethinking proposed for this task are under fully-supervised setting. Some of them liu2021context; Alpher33; liu2020jointly match the pre-defined segment proposals with the query and then select the best candidate. Others Alpher08; Alpher11; Alpher12; chen2020rethinking directly predict the temporal boundary of the video segment. However, these methods are data-hungry, requiring a large amount of fully annotated data. For instance, the widely used ActivityNet Captions dataset contains 20,000 videos and 100,000 matched sentence queries with their corresponding segment boundaries. Manually annotating such a huge amount of data is very time-consuming and labor-intensive. To alleviate this problem, recent works explore a weakly-supervised setting Alpher14; Alpher15; Alpher17; zhang2020regularized where only paired videos and queries are available in the training stage. Though they leave out the segment annotations in training, these methods still need to access the abundant knowledge of the matched video-query pairs.

In this paper, we focus on how to learn a video grounding model without any supervision, which excludes both paired video-query knowledge and corresponding segment annotations, as shown in Figure 1 (b). Considering there is no annotated information, what we can access to is only the internal information in the queries and videos. As different words or phrases in different queries may share potentially similar semantic, we can mine all deep semantic representations from the whole query set, and then compose the possible activities in each video according to these language semantic for further grounding. Therefore, the crucial and challenging part of our work lies in how to capture the deep semantic features of the queries and how to aggregate different semantics for composing the contents of the target segments.

To this end, we propose a novel approach to solve this problem, called Deep Semantic Clustering Network (DSCNet), which mines the deep semantic features from the query set to compose possible activities in each video. Specifically, we first leverage an encoder-decoder model to build a language-based semantic mining module for query encoding, where the learned hidden states are taken as the extracted deep semantic features. In particular, we collect such semantic features from the whole query set and then cluster them to different semantic centers, where features of similar meanings are adjacent. Subsequently, a video-based semantic aggregation module, containing a specific attention branch and a foreground attention branch, is further developed to compose corresponding activities guided by the extracted deep semantic features. For the specific attention branch, it aggregates different semantic for matching and composing the contents of the activity segments. We utilize this branch to generate better video representations by ensuring that the composed activities containing the same semantic have closer distance than the dissimilar ones, and the positive-negative activity in the same video have large distance. To further filter out the background information in each video, the foreground attention branch is designed to distinguish the foreground frames. The details of our main grounding process is shown in Figure 2. During the training stage, we utilize the pseudo labels, which are obtained from the deep semantic features, as guidance to refine the video grounding model with an iterative learning procedure. To sum up, our main contributions are as follows:

  • To the best of our knowledge, this is the first work to address temporal video grounding in the unsupervised setting. Without supervision, we solve the task with the proposed DSCNet, which learns to compose the activity contents guided by the deep language semantic.

  • We use an encoder-decoder module to obtain the semantic features for all queries and divide them into different clusters to represent different semantic meanings. Then we propose a two-branch video module, where specific attention branch aggregates the query semantic to match the segment, and the foreground attention branch is utilized to distinguish the foreground-background activities.

  • We conduct comprehensive experiments on the ActivityNet Captions and Charade-STA datasets. The results demonstrate the effectiveness of our proposed method, where DSCNet achieves decent results and outperforms most weakly-supervised methods.

Related Work

Fully-supervised temporal video grounding. Most of the existing methods refer to fully-supervised setting where all video-sentence pairs are labeled in details, including corresponding segment boundaries. Therefore, the main challenge in such setting is how to align multi-modal features well to predict precise boundary. Some works qu2020fine; liu2021progressively; liu2021adaptive; liu2020reasoning; liu2022exploring; liu2022memory integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Although these methods achieve good performances, they severely rely on the quality of the proposals and are time-consuming. Without using proposals, the latest methods nan2021interventional; Alpher11; Alpher12 are proposed to leverage the interaction between video and sentence to predict the starting and ending frames directly. These methods are more efficient than the proposal-based ones, but achieve lower performance.

Figure 2: The main idea of our proposed method, where we only show several semantic clusters for example.

Weakly-supervised temporal video grounding. To ease the human labelling efforts, several works Alpher14; Alpher15; Alpher16; Alpher17; zhang2020regularized; ma2020vlanet; tan2021logan consider a weakly-supervised setting which only access the information of matched video-query pairs without accurate segment boundaries. Alpher15 utilize the dependency between video and sentence as the supervision while abandon the temporal ordered information. Their text-guided attention provides scores for segment proposals. Alpher16 put forward a module to reconstruct sentences and a proposal reward is based on the loss calculated using the target sentence and reconstructed one. Though these weakly-supervised methods do not rely on the temporal annotations, they still need the dependency between video and sentence as supervision. Different from them, we are the first to attempt to solve this task with an unsupervised approach that does not require any video-query dependency.

Figure 3: The overall architecture of the proposed DSCNet for unsupervised TVG task. Given a query set, we first develop a language-based semantic mining module to learn the deep semantic for all queries by an encoder-decoder model. Then a video-based semantic aggregation module is proposed to compose the possible activities referring to the deep semantic clusters.

Unsupervised Learning. Recently, unsupervised methods soomro2017unsupervised; laina2019towards; su2019deep; gong2020learning receive increasing attention in multi-modal retrieval task. Laina laina2019towards and Su su2019deep embed both video and text into a shared latent space, then maximally reconstruct the joint-semantics relations. Soomro soomro2017unsupervised and Gong gong2020learning propose unsupervised action localization and transform the task into a unsupervised frame-wise classification problem with the pre-defined action categories. Different from these retrieval methods, the unsupervised TVG task requires fine-grained video-query alignment for better predicting accurate start-end timestamps. To learn deep semantics of sentence queries, we utilize widely used encoder-decoder architecture Alpher18; Alpher19; Alpher20; Alpher21; Alpher22 to learn its unsupervised representations, which consists of an encoder to extract feature representations and a decoder to reconstruct the input data from the representations. Then we compose the activity contents among the video guided by these learnt deep semantic features.

Deep Semantics Clustering Network


In the TVG task, we are provided with a training set of untrimmed videos and sentence queries , where and are the -th video and query, and are their corresponding numbers. Since our method is under unsupervised setting, we drop all label information between and including both their correspondence and annotated segment boundaries.

The overall architecture of our proposed Deep Semantic Clustering Network (DSCNet) is shown in Figure 3

. We first develop an independent encoder-decoder network to learn deep semantic features for the whole query set. In particular, we extract the hidden representations of each sentence as deep semantic called “neck”, and gather the necks of all queries into different semantic clusters by a clustering algorithm. Furthermore, we devise a video-based aggregation model with two attention branches: specific attention branch and foreground attention branch. Specifically, the specific attention branch is devised to match the frame-wise feature with each semantic cluster, and the foreground attention branch is developed to distinguish the foreground-background events.

Language-based Semantic Mining

In this section, we make an attempt to extract the internal information of each query by reconstructing its main meaning with an independent encoder-decoder network. Meanwhile, a well-designed loss function

in Eq. (3) is proposed to learn the discriminative hidden feature of input queries, which is also named as deep semantic “neck” thereafter.

For encoder, given a sentence query, we first employ off-the-shelf Glove model Alpher25 to obtain its word-level embedding , where is the sentence length and is the embedding dimension. Then we feed into a two-layer LSTM network and use the last hidden unit as the sentence-level representation . To separate different latent semantic representations which describe different aspects in , we further feed the sentence-level representation

into multiple two-layer perceptrons to obtain the hidden features of the encoder-decoder model, called “necks”, which serve as the implicit semantics of queries. Specifically, we denote the necks of a query as

, where is the neck number and is dimension.

To ensure that the learned necks have contained the most crucial information of the sentence, we further adopt a decoder module to reconstruct the original sentence with these necks. In details, we first aggregate the information from all necks to output a new sentence-level representation by other multiple two-layer perceptrons with further concatenation. Then, we feed it into another two-layer LSTM network and a linear layer to construct the word-level sentence, which is expected to be the same as the input one. Specifically, for the

-th output word, the decoder outputs its score vector

, where is the size of vocabulary in the whole query set, and

means the probability distribution of

-th word in the vocabulary appearing at the

-th output word. Suppose the predicted probabilities of the word-level groundtruth location in original

-length input sentence as , we calculate Cross Entropy Loss with softmax function to supervise the output sentence as:


Besides, to ensure both the input and output sentences have the same sentence-level semantic meaning, we add a semantic loss calculated by the Mean Square Error function between both sentence-level representations and . More importantly, since we expect each neck in the query has unique semantic, we adopt a regularization term Alpher26 to enforce necks be different from each other:


where denotes L2-norm, controls the extent of overlap between different necks.

is an identity matrix. By combining the above three losses with balanced parameters

, , we can adopt a multi-task loss function in the semantic mining module as follows:


In a similar way, we can get the neck features for all sentences in , where each sentence has necks. Subsequently, for the

-th neck of all queries, we implement K-means clustering algorithm

na2010research to get centers upon them. Formally, the centers of the -th necks are recorded as , is the dimension of each center feature. These centers can be regarded as discriminative semantic features. In the following video-based semantic aggregation module, such central semantic representations can be utilized for activity content composing.

Video-based Semantic Aggregation

To compose possible activities referring to the central semantic features from the queries, we develop a video-based semantic aggregation module which consists of a specific attention branch and a foreground attention branch. During the training of the video module, we initialize a pseudo label for video frames as weak guidance to assist grounding, and utilize an iterative learning strategy to update and refine the pseudo labels for better training.

Video feature encoding. Given a video, we first utilize a C3D Alpher27 network with a multi-head self-attention module vaswani2017attention to extract the frame-wise features as , where is the number of frames in one video and is the channel dimension of frame-wise representation.

Pseudo labels. For specific semantic cluster , we initialize the pseudo labels to all frames, where each label denotes whether a specific frame matches the semantic cluster center . Specifically, we implement N-cut Alpher24 clustering with Gaussian kernel upon the concatenated feature to assign the binary label label for each frame. Such clustering process can obtain coarse activity information. The fine-grained label would be obtained through the iterative learning process.

Specific attention branch. In this branch, we aim to aggregate different semantics of the query set for better composing possible activities among the video. After getting the semantic centers of -th neck of the whole query set, we first project both language and video features into a joint embedding space, and denote their new features as and , respectively. Then, we calculate the correlations between all frame-semantic pairs as the specific attention matrix :


where each row of denotes the similarities of all frames to a specific semantic center, and those frames with the corresponding highest scores will be composed into the activities.

In general, given a batch of training videos, we randomly sample semantic cluster centers and videos. For -th video feature , we can calculate its specific positive activity features guided by -th semantic cluster center as:


We can also generate its negative specific feature by:


where here is a matrix with the same shape as filled by integer 1. Dividing by is for the purpose of normalization. Since we expect the integrated positive features (different videos about the same semantic center ) having the same semantic information to be similar while the positive and negative features of the same video (video ) to be distinct, our loss function of specific attention branch can be formulated as follows where is the cosine distance:


where and denote margins, is a weight coefficient. In this way, the specific attention branch can aggregate most relevant frame-wise features corresponding to the specific semantic cluster center for activity composing.

Foreground attention branch. Only using specific attention is not enough to provide accurate grounding, as it can not filter out all background activities composed by the irrelevant semantics features. Therefore, as shown in Figure 3, we design a foreground attention branch with foreground features and softmax attention output , which can be obtained with several CNN layers. We first develop a triplet loss to distinguish the foreground-background frame-wise features according to the learned pseudo labels, where we aim to pull the frame representations of foreground (frames ) closer and push the frame representations of foreground-background (frames ) further in the feature space. Specifically, frame is the foreground frame which has the minimum distance to , and frame is the background frame which has the maximum distance to . The triple loss function can be formulated as:


To predict the matching score of each frame by learning both specific attention matrix and foreground attention matrix with the supervision of pseudo labels , we first calculate the score of -th frame referring to semantic center and the foreground attention matrix , and then formulate the grounding loss of each frame as:


where denotes whether the frame is the positive frame referring to the semantic center , and is utilized to determine whether the frame is the foreground. Then, we calculate the grounding loss for whole frames as:


Combining the aforementioned three losses in two attention branches, we get the overall multi-task loss in the video grounding model as:


where and are hyper-parameters.

Input: All semantic cluster centers of the whole query set; video feature .

1:  Init pseudo label based on and
2:  for iteration to do
3:    for neck to do
4:      Execute specific attention branch with to obtain and ;
5:      Execute foreground attention branch to obtain ;
6:      Generate the training samples by pseudo labels, and calculate the overall loss for back-propagation;
7:      Generate the new feature , and utilize it to update the pseudo labels;
8:    end
9:  end
Algorithm 1 Iterative learning process of video module

Iterative learning. We use an iterative optimization strategy to train our video module. In each iteration: (1) we update the pseudo label on each frame by applying the cluster algorithm on the new feature , where is the learned feature in the specific branch as it contains more semantic-aware contexts. (2) we calculate and back-propagate the loss for updating the video model. The overall training process is shown in Algorithm 1. During the iterative training, the grounding module gradually finds the important frames of the video and yields a better frame-wise feature representation. Such precise feature representations can further lead to more precise pseudo labels obtained from the clustering process, and in turn provides better supervisions for the grounding. We show the effectiveness of the iterative learning process in our experiments.


When testing, we directly utilize the generated necks feature of the input query as semantic cluster centers, and feed them into the video module to match with frame-wise features for generating corresponding specific attention and foreground attention for activity composing. Specifically, we element-wisely multiply and

for each neck, and feed the results of all necks to softmax layers with a further element-wise multiplication. The final attention scores of size

denotes a joint probability of all semantics. Finally, we locate the frame with the highest score as the basic predicted segment, and add the left/right frames into the segment if the ratio of their scores to the frame score of the closest segment boundary is less than a threshold. We repeat this step until no frame can be added.

Experimental Results

Datasets and Evaluation

Charades-STA. This dataset is built from the Charades Alpher28 dataset and transformed into video temporal grounding task by Alpher06. It contains 16128 video-sentence pairs with 12408 pairs used for training and 3720 for testing. The videos are about 30 seconds on average. The annotations are generated by sentence decomposition and keyword matching with manually check.

ActivityNet Captions. This dataset is built from ActivityNet v1.3 dataset Alpher29

for dense video captioning. It contains 20000 YouTube videos with 100000 queries. We follow the public split of the dataset that contains a training set and two validation sets val 1 and val 2. On average, videos are about 120 seconds and queries are about 13.5 words.

Methods Mode Charades-STA ActivityNet Captions
R@1 R@1 R@1 R@5 R@5 R@5 R@1 R@1 R@5 R@5
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.3 IoU=0.5
Rondom FS 20.12 8.51 3.03 68.42 37.12 14.06 18.64 7.73 52.78 29.49
VSA-RNN FS - 10.50 4.32 - 48.43 20.21 39.28 24.43 70.84 55.52
VSA-STV FS - 16.91 5.81 - 53.89 23.58 41.71 24.01 71.05 56.62
CTRL FS - 23.62 8.89 - 58.92 29.52 - - - -
TGN FS - - - - - - 43.81 27.93 54.56 44.20
EFRC FS 53.00 33.80 15.00 94.60 77.30 43.90 - - - -
2D-TAN FS - 39.81 23.25 - 79.33 52.15 59.45 44.51 85.53 77.13
DRN FS - 45.40 26.40 - 88.01 55.38 - 45.45 - 77.97
TGA WS 32.14 19.94 8.84 86.58 65.52 33.51 - - - -
SCN WS 42.96 23.58 9.97 95.56 71.80 38.87 47.23 29.22 71.45 55.69
CTF WS 39.80 27.30 12.90 - - - 44.30 23.60 - -
MARN WS 48.55 31.94 14.18 90.70 70.00 37.40 47.01 29.95 72.02 57.49
VGN WS - 32.21 15.68 - 73.50 41.87 50.12 31.07 77.36 61.29
DSCNet US 44.15 28.73 14.67 91.48 70.68 35.19 47.29 28.16 72.51 57.24
Table 1: Performance comparisons for video grounding on both Charades-STA and ActivityNet Captions datasets, where FS: fully-supervised setting, WS: weakly-supervised setting and US: unsupervised setting.
Method Language Module Video Module R@1 R@1 R@1 R@5 R@5 R@5
IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7
baseline 28.38 16.12 7.86 76.18 48.65 20.21
+ 33.54 20.07 9.09 80.16 55.41 23.72
+ 36.78 23.14 10.65 82.98 60.29 26.31
+ + + 40.23 25.32 12.11 86.83 65.74 30.45
+ + + 41.97 26.01 12.39 87.61 67.33 32.60
+ + 44.15 28.73 14.67 91.48 70.68 35.19
Table 2: The analysis of each component in DSCNet, via ablation study on Charades-STA.

Evaluation. Following prior work Alpher06, we adopt “R@, IoU=

” as our evaluation metrics, which is defined as the percentage of at least one of top-

selected moments having IoU scores larger than


Implementation Details

In order to make a fair comparison with previous works, we utilize C3D to extract video features and Glove to obtain word embeddings. As some videos are too long, we set the length of video feature sequences to 128 for Charades-STA and 256 for ActivityNet Captions. We fix the query length to 10 in Charades-STA and 20 in ActivityNet Captions. We set neck number to 4 for Charades-STA and 8 for ActivityNet Captions, and set cluster number to 16. The LSTM Layers in language encoder and decoder are both 2 layers architecture with 512 hidden size. The dimension of joint embedding space is set to 1024. We utilize Adam optimizer with the initial learning rate as 0.0001 for language module and 0.0005 for video module. The hyper-parameters are set as 1.0, 0.0001, 0.0001, 0.5. is set to 0.5. And in Eq. (3) and (15) are all set as 0.5. The inference threshold is set to 0.8 in ActivityNet and 0.9 in Charades-STA.

Comparisons with state-of-the-arts

Comparison on Charades-STA. We first compare our model DSCNet with the state-of-the-art methods on Charades-STA dataset, shown in Table 1. Specifically, for metrics R@1, IoU{0.3,0.5,0.7}, the results achieved by our method in the unsupervised setting (US) are comparable to the results obtained by the state-of-the-art fully-supervised (FS) and weakly-supervised (WS) methods. For R@5, we also have similar observations.

Comparison on ActivityNet Captions. We also presents the results on ActivityNet Captions, shown in Table 1. We compare our method DSCNet with other recent state-of-the-art FS and WS video grounding methods. Even without any video-query annotations, our method is able to achieve 47.29%, 28.16%, 72.51%, 57.24% in all metrics, showing competitive performance comparing to almost all weakly supervised methods.

Ablation Study

In this section, we conduct ablation study to validate the effectiveness of each components in our methods. All experiments are conducted on Charades-STA dataset.

Effect of each component. To analyze how each model component contributes to the task, we perform ablation study as shown in Table 2. We use the model in which the language-based encoder-decoder is only combined with reconstruction loss , and the video module is combined with grounding loss as our baseline. Then we add the regularization loss to the baseline model and improve R@1 IoU=0.3 from 28.38% to 33.54%, R@1 IoU=0.5 from 16.12% to 20.07%, demonstrating the importance of learning different semantic features. The sentence-level semantic loss can further improve the hidden features learning of the auto-encoder model. The special attention branch loss and triple loss also bring significant improvements by yielding better frame-wise representations. From the table, we can see that jointly combining all the loss functions achieves the superior overall performance.

How to select the cluster center. As shown in Table 3, we investigate how the strategy of selecting the center affects the grounding results. We can find that randomly selection without clustering performs poorly, since it lacks sufficient semantics to compose all possible activities. The “cluster center” (we directly choose the averaged cluster center) performs much better than the “cluster sample”, where we randomly choose one sample in each cluster. The reason could be the central embedding in each cluster contains the most representative semantic to cover other samples.

random 23.79 11.24 73.61 42.58
cluster sample 39.67 23.96 86.13 66.49
cluster center 44.15 28.73 91.48 70.68
Table 3: Ablation study on the selection of the semantic cluster center .
2 necks 33.06 19.70 79.39 53.53
4 necks 44.15 28.73 91.48 70.68
8 necks 40.90 25.03 87.56 65.40
1 iteration 29.38 15.61 81.13 60.08
3 iteration 39.98 24.87 88.09 66.77
5 iteration 44.15 28.73 91.48 70.68
7 iteration 44.06 28.39 91.82 70.41
Table 4: Ablation study on the neck number and iteration number on Charades-STA dataset.

Comparing different neck numbers. For the number of necks, as shown in Table 4, using 2 necks which contains complex information leads to poor scores. Using 4 necks achieves the best performance while using 8 necks is slightly lower. Therefore, we choose 4 as the neck number on Charades-STA dataset in our experiments.

Comparing different iterative learning times. Table 4 also show the performance on different iterations. As the number of iterations increases, the performance becomes better. Our model achieves the best results with 5 iterations. We do not see more improvements by increasing iterations after 5.

Visualization Results

Figure 4 shows some grounding results of our DSCNet on Charades-STA dataset. Figure 5 shows the semantic clustering results of the language module, in which we partially visualize the clusters related to actions and objects by utilizing T-SNE Alpher37. We select sentences containing specific words from the test set of Charades-STA for visualization. For class actions, we take “drink”, “eat”, “run”, “walk” as the examples. Through clustering, the actions “drink”, “eat” are quite different from actions “run”, “walk” since they generally appear in different scenarios (indoor vs. outdoor). Meanwhile, we can observe that there is a distinct margin between the “drink”, “eat” and “run”, “walk” as shown in the left figure. Furthermore, we can also find that actions “drink” and “eat” are well separated while “run” and “walk” are not, this is because: actions “run” and “walk” are quite close in semantics, and can be substituted for each other in some circumstances. The right figure shows the clustering results on multiple objects. It illustrates that the objects “cup”, “glass” are the intra-pairs, and the objects “window”, “door” are the inter-pairs.

Figure 4: Qualitative results on charades-STA dataset. “GT” is the annotation of the ground-truth segment and “Prediction” is our grounding result. The score value in the red curve of each video denotes the probability of each frame.
Figure 5: Example of the semantic clustering results of the language-based semantic mining module.


In this paper, we propose a novel Deep Semantic Clustering Network (DSCNet), to solve the temporal video grounding (TVG) task under the unsupervised setting. We first mine deep semantic features from all sentences and apply clustering on them to obtain the universal textual representation of the whole query set. Then, we compose the possible activities among the videos guided by the extracted deep semantic features. Specifically, we design two attention branches with the novel loss function for grounding. Our method is evaluated on two benchmark datasets and achieves decent performances, compared with most fully/weekly supervised baselines. The future work includes applying DSCNet to other tasks/datasets li2020hero; lei2020tvr, and leveraging local/global features to learn better video-text representations. Following the idea of DSCNet, we would like to explore how to use more unannotated data in supervised manner.


This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant (No.61972448, No.62172068, No.61802048).