Extensive work has been done on temporal action/activity localization Shou et al. (2016); Zhao et al. (2017); Dai et al. (2017); Buch et al. (2017); Gao et al. (2017c); Chao et al. (2018), where an action of interest is segmented from long, untrimmed videos. These methods only identify actions from a pre-defined set of categories, which limits their application to situations where only unconstrained language descriptions are available. This more general problem is referred to as natural language localization (NLL) Hendricks et al. (2017); Gao et al. (2017a). The goal is to retrieve a temporal segment from an untrimmed video based on an arbitrary text query. Recent work focuses on learning the mapping from visual segments to the input text Hendricks et al. (2017); Gao et al. (2017a); Liu et al. (2018); Hendricks et al. (2018); Zhang et al. (2018) and retrieving segments based on the alignment scores. However, in order to successfully train a NLL model, a large number of diverse language descriptions are needed to describe different temporal segments of videos which incurs high human labeling cost.
We propose Weakly Supervised Language Localization Networks (WSLLN) which requires only video-sentence pairs during training with no information of where the activities temporally occur. Intuitively, it is much easier to annotate video-level descriptions than segment-level descriptions. Moreover, when combined with text-based video retrieval techniques, video-sentence pairs may be obtained with minimum human intervention. The proposed model is simple and clean, and can be trained end-to-end in a single stage. We validate our model on ActivityNet Captions and DiDeMo. The results show that our model achieves the state-of-the-art of the weakly supervised approach and has comparable performance as some supervised approaches.
2 Related Work
Temporal Action Localization in long videos is widely studied in both offline and online scenarios. In the offline setting, temporal action detectors Shou et al. (2016); Buch et al. (2017); Gao et al. (2017c); Chao et al. (2018) predict the start and end times of actions after observing the whole video, while online approaches De Geest et al. (2016); Gao et al. (2017b); Shou et al. (2018b); Xu et al. (2018); Gao et al. (2019) label action class in a per-frame manner without accessing future information. The goal of temporal action detectors is to localize actions in pre-defined categories. However, activities in the wild is very complicated and it is challenging to cover all the activities of interest by using a finite set of categories.
Natural Language Localization in untrimmed videos was first introduced in Gao et al. (2017a); Hendricks et al. (2017), where given an arbitrary text query, the methods attempt to localize the text (predict its start and end times) in a video. Hendricks et al. proposed MCN Hendricks et al. (2017) which embeds the features of visual proposals and sentence representations in the same space and ranks proposals according their similarity with the sentence. Gao et al. proposed CTRL Gao et al. (2017a), where alignment and regression are conducted for clip candidates. Liu et al. introduced TMN Liu et al. (2018) which measures the clip-sentence alignment guided by the semantic structure of the text query. Later, Hendricks et al. proposed MLLC Hendricks et al. (2018) that explicitly reasons about temporal clips of a video. Zhang et al. proposed MAN Zhang et al. (2018) which utilizes Graph Convolutional Networks Kipf and Welling (2016) to model temporal relations among visual clips. Although these methods achieve considerable success, they need segment-level annotations for training. Duan et al. proposed WSDEC to handle weakly supervised dense event captioning in Duan et al. (2018) by alternating between language localization and caption generation iteratively. WSDEC generates language localization as intermediate results and can be trained using video-level labels. Thus, we set it as a baseline, although it is not designed for NLL.
Weakly Supervised Localization has been studied extensively to use weak supervisions for object detection on images and action localization in videos Oquab et al. (2015); Bilen and Vedaldi (2016); Tang et al. (2017); Gao et al. (2018); Kantorov et al. (2016); Li et al. (2016); Jie et al. (2017); Diba et al. (2017); Papadopoulos et al. (2017); Duchenne et al. (2009); Laptev et al. (2008); Bojanowski et al. (2014); Huang et al. (2016); Wang et al. (2017); Shou et al. (2018a). Some methods use class labels to train object detectors. Oquab et al. discussed that object locations may be freely obtained when training classification models Oquab et al. (2015). Bilen et al. proposed WSDDN Bilen and Vedaldi (2016), which focuses on both object recognition and localization. Their proposed two-stream architecture inspired several weakly supervised approaches Tang et al. (2017); Gao et al. (2018); Wang et al. (2017) including our method. Li et al. presented an adaptation strategy in Li et al. (2016) which uses the output of a weak detector as pseudo groundtruth to train a detector in a fully supervised way. OICR Tang et al. (2017) integrates multiple instance learning and iterative classifer refinement in a single network. Some works use other types of weak supervisions to optimize detectors. In Papadopoulos et al. (2017), Papadopoulos et al. used clicks to train detectors. Gao et al. utilized object counts for weakly supervised object detection Gao et al. (2018). Instead of using temporally labeled segments, weakly supervised action detectors use weaker annotations, e.g., movie script Duchenne et al. (2009); Laptev et al. (2008), the order of the occurring action classes in videos Bojanowski et al. (2014); Huang et al. (2016) and video-level class labels Wang et al. (2017); Shou et al. (2018a).
3 Weakly Supervised Language Localization Networks (WSLLN)
3.1 Problem Statement
Following the setting of its strongly supervised counterpart Gao et al. (2017a); Hendricks et al. (2017), the goal of a weakly supervised language localization (WSLL) method is to localize the event that is described by a sentence query in a long, untrimmed video. Formally, given a video consisting of a sequence of image frames, , and a text query , the model aims to localize a temporal segment, , which semantically aligns best with the query. and indicate the start and end times, respectively. The difference is that WSLL methods only utilize video-sentence pairs, , for training, while supervised approaches have access to the start and end times of the queries.
3.2 The Proposed Approach
Taking frame sequences, , as inputs, the model first generates a set of temporal proposals, , where consists of temporally-continuous image frames. Then, the method aligns the proposals with the input query and outputs scores for proposals, , indicating their likelihood of containing the event.
Feature Description. Given a sentence query of arbitrary length, sentence encoders can be used to extract text feature, , from the query. For a video, , features, , are extracted from each frame. Following Hendricks et al. (2017), the visual feature, , of a proposal is obtained using Eq. 1, where means average pooling features from time to , indicates concatenation, / indicates start/end times of the proposal and means time is normalized to .
We see that the feature of each proposal contains the information of its visual pattern, the overall context and its relative position in the video.
Following Gao et al. (2017a), features of the sentence and a visual proposal are combined as in Eq. 2. The feature, , will be used to measure the matching between a candidate proposal and the input query.
The workflow of WSLLN is illustrated in Fig. 1. Inspired by the success of the two-stream structure in the weakly supervised object and action detection tasks Bilen and Vedaldi (2016); Wang et al. (2017), WSLLN consists of two branches, i.e., alignment branch and selection branch. The semantic consistency between the input text and each visual proposal is measured in the alignment branch. The proposals are compared and selected in the detection branch. Scores from both branches are merged to produce the final results.
Alignment Branch produces the consistency scores, , for proposals of the video-sentence pair. in Eq. 3, measures how well each proposal matches the text. Different proposal scores are calculated independently where indicates applying the softmax function over the last dimension.
Detection Branch performs proposal selection. The selection score, in Eq. 4, is obtained by applying softmax function over proposals. Through softmax, the score of a proposal will be affected by those of other proposals, so this operation encourages competition among segments.
Score Merging is applied to both parts to obtain the results by dot production, i.e., , for proposals. is used as the final segment-sentence matching scores during inference.
. To utilize video-sentence pairs as supervision, our model is optimized as a video-sentence matching classifier. We compute the matching score of a given video-sentence pair by summingover proposals, . Then, is obtained in Eq. 5 by measuring the score with the video-sentence match label . Positive video-sentence pairs can be obtained directly. We generate negative ones by pairing each video with a randomly selected sentence in the training set. We ensure that the positive pairs are not included in the negative set.
Results can be further refined by adding an auxiliary task in Eq. 6 where indicates the index of the segment that best matches the sentence during training. The real segment-level labels are not available, thus we generate pseudo labels by setting . This loss further encourages competition among proposals.
The overall objective is minimizing in Eq. 7, where is a balancing scalar. is cross-entropy loss.
4.1 Experimental Settings
Implementation Details. BERT Devlin et al. (2018)
is used as the sentence encoder, where the feature of ‘[CLS]’ at the last layer is extracted as the sentence representation. Visual and sentence features are linearly transformed to have the same dimension,. The hidden layers for both branches have 256 units. For ActivityNet Captions, we take the proposals over multiple scales of each video provided by Duan et al. (2018) and use the C3D Tran et al. (2015) features provided by Krishna et al. (2017). For DiDeMo, we use the proposals and VGG Simonyan and Zisserman (2014) features (RGB and Flow) provided in Hendricks et al. (2017).
Evaluation Metrics. Following Gao et al. (2017a); Hendricks et al. (2017), R@k,IoU=th and mIoU are used for evaluation. Proposals are ranked according to their matching scores with the input sentence. If the temporal IoU between at least one of the top-k proposals and the groundtruth is bigger or equal to , the sentence is counted as matched. R@k,IoU=th means the percentage of matched sentences over the total sentences given and . mIoU is the mean IoU between the top-1 proposal and the groundtruth.
4.2 Experiments on ActivityNet Captions
Dataset Description. ActivityNet Captions Krishna et al. (2017) is a large-scale dataset of human activities. It contains 20k videos including 100k video-sentences in total. We train our models on the training set and test them on the validation set. Although the dataset provides segment-level annotation, we only use video-sentence pairs during training.
Baselines. We compare with strongly supervised approaches, i.e., CTRL Gao et al. (2017a), ABLR Yuan et al. (2018) and WSDEC-S Duan et al. (2018) to see how much accuracy it sacrifices when using only weak labels. Originally proposed for dense-captioning, WSDEC-W Duan et al. (2018) achieves state-of-the-art performance for weakly supervised language localization. Although showing good performance, WSDEC-W involves complicated training stages, and alternates between sentence localization and caption generation for iterations.
4.2.1 Comparison Results
Comparison results are displayed in Tab. 1. It shows that WSLLN largely outperforms WSDEC-W by . When comparing with strongly supervised methods, WSLLN outperforms CTRL by over . Using the metric, our model largely outperforms all the baselines including strongly and weakly supervised methods which means that when a scenario is flexible with the IoU coverage, our method has great advantage over others. When , our model has comparable results as WSDEC-W and largely outperforms CTRL. The overall results demonstrate good performance of WSLLN, even though there is still a big gap between weakly supervised methods and some supervised ones, i.e., ABLR and WSDEC-S. (meanstd) of WSLLN across 3 runs is which demonstrates the robustness of our method.
4.2.2 Ablation Study
Effect of . We evaluate the effect of (see Eq. 7) in Tab. 2. As it shows, our model performs stable when is set from to . When , the refining module is disabled and the performance drops. When is set to a big number, e.g., , the contribution of is reduced and the model performance also drops.
Effect of Sentence Encoder. WSDEC-W uses GRU Cho et al. (2014) as its sentence encoder, while our method uses BERT. It seems an unfair comparison, since BERT is powerful than GRU in general. However, we uses pretrained BERT model without fine tuning on our dataset, while WSDEC-W uses GRU but performed an end-to-end training. So, it is unclear which setting is better. To resolve this concern, we replace our BERT with GRU following WSDEC-W. The results when is set to be 0.1, 0.3 and 0.5 are 74.0, 42.3 and 22.5, respectively. The mIoU is 31.8. It shows that our model with GRU has comparable results as that with BERT.
Effect of Two-branch Design. We create two baselines, ie, Align-only and Detect-only, to demonstrate the effectiveness of our design. To perform fair comparison, both of them are trained using only video-sentence pairs.
Align-only contains only the alignment branch. For positive video sentence pair, we give positive labels to all proposals. Negative pairs have negative labels for all the proposals. Loss is calculated between proposal scores and the generated segment-level labels.
Detect-only contains only the detection branch. Loss is calculated using the highest detection score over proposals and the video-level label at each training iteration.
Comparison results are displayed in Tab. 3. It shows that the two baselines underperform WSLLN by a large margin, which demonstrates the effectiveness of our design.
4.3 Experiments on DiDeMo
Dataset Description. DiDeMo was proposed in Hendricks et al. (2017) for the language localization task. It contains 10k, 30-second videos including 40k annotated segment-sentence pairs. Our models are trained using video-sentence pairs in the train set and tested on the test set.
Baselines. To the best of our knowledge, no weakly supervised method has been evaluated on DiDeMo. So, we compare with some supervised methods, i.e., MCN Hendricks et al. (2017) and LOR Hu et al. (2016). MCN is a supervised NLL model. LOR is a supervised language-object retrieval model. It utilizes much more expensive (object-level) annotations for training. We follow the same setup of LOR as in Hendricks et al. (2017) to evaluate LOR for our task.
Comparison Results are shown in Tab. 4. WSLLN performs better than LOR in terms of . We also observe that the gap between our method and the supervised NLL model is much larger on DiDeMo than on ActivityNet Captions. This may be due to the fact that DiDeMo
is a much smaller dataset which is a disadvantage for weakly supervised learning.
We propose WSLLN– a simple language localization network. Unlike most existing methods which require segment-level supervision, our method is optimized using video-sentence pairs. WSLLN is based on a two-branch architecture where one branch performs segment-sentence alignment and the other one conducts segment selection. Experiments show that WSLLN achieves promising results on ActivityNet Captions and DiDeMo.
- Weakly supervised deep detection networks. In , Cited by: §2, §3.2.
- Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision, pp. 628–643. Cited by: §2.
- SST: single-stream temporal action proposals. In CVPR, Cited by: §1, §2.
- Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, Cited by: §1, §2.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: §4.2.2.
- Temporal context network for activity localization in videos. In ICCV, Cited by: §1.
- Online action detection. In ECCV, Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
- Weakly supervised cascaded convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5131–5139. Cited by: §2.
- Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems, pp. 3059–3069. Cited by: §2, §4.1, §4.2, Table 1.
- Automatic annotation of human actions in video.. In ICCV, Vol. 1, pp. 3–2. Cited by: §2.
- Tall: temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275. Cited by: §1, §2, Figure 1, §3.1, §3.2, §4.1, §4.2.
- RED: reinforced encoder-decoder networks for action anticipation. In BMVC, Cited by: §2.
- TURN TAP: temporal unit regression network for temporal action proposals. ICCV. Cited by: §1, §2.
- C-wsl: count-guided weakly supervised localization. In ECCV, Cited by: §2.
- StartNet: online detection of action start in untrimmed videos. arXiv preprint arXiv:1903.09868. Cited by: §2.
Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812. Cited by: §1, §2, §3.1, §3.2, §4.1, §4.1, §4.3, §4.3, Table 4.
Localizing moments in video with temporal language..
Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
- Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564. Cited by: §4.3.
- Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision, pp. 137–153. Cited by: §2.
- Deep self-taught learning for weakly supervised object localization. IEEE CVPR. Cited by: §2.
- Contextlocnet: context-aware deep network models for weakly supervised localization. In European Conference on Computer Vision, pp. 350–365. Cited by: §2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
- Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), Cited by: §4.1, §4.2.
- Learning realistic human actions from movies. In CVPR, Cited by: §2.
- Weakly supervised object localization with progressive domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- Temporal modular networks for retrieving complex compositional activities in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568. Cited by: §1, §2.
Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694. Cited by: §2.
- Training object class detectors with click supervision. CVPR. Cited by: §2.
- Autoloc: weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171. Cited by: §2.
- Online action detection in untrimmed, streaming videos-modeling and evaluation. In ECCV, Cited by: §2.
- Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, Cited by: §1, §2.
- Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §4.1.
- Multiple instance detection network with online instance classifier refinement. CVPR. Cited by: §2.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §4.1.
- Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334. Cited by: §2, §3.2.
- Temporal recurrent networks for online action detection. arXiv:1811.07391. Cited by: §2.
- To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint arXiv:1804.07014. Cited by: §4.2.
- MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. arXiv preprint arXiv:1812.00087. Cited by: §1, §2.
- Temporal action detection with structured segment networks. In ICCV, Cited by: §1.