Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.
Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training. Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and most real-world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to demonstrate the ability of our model on both dense event captioning and sentence localization in videos.READ FULL TEXT VIEW PDF
This note describes the details of our solution to the dense-captioning
This paper focuses on a novel and challenging vision task, dense video
We propose weakly supervised language localization networks (WSLLN) to d...
Multi-modal learning, particularly among imaging and linguistic modaliti...
Dense video captioning is an extremely challenging task since accurate a...
We propose a novel model for temporal detection and localization which a...
Audio-visual event localization requires one to identify theevent which ...
Weakly Supervised Dense Event Captioning in Videos, i.e. generating multiple sentence descriptions for a video in a weakly-supervised manner.
Dramatic improvements have been made on video understanding due to the development of deep neural networks and large-scale video datasets[1, 2, 3]. Among the wide variety of applications on video understanding, the video captioning task is attracting more and more interests in recent years [4, 5, 6, 7, 8, 9, 10, 11]. In video captioning, the machine is required to describe the video content in the natural language form, which makes it more meticulous and thus challenging compared to other tasks describing the video content using a few tags or labels, such as video classification and action detection [12, 13].
The current trend on video captioning is to perform Dense Event Captioning (DEC, also called Dense-Captioning Event in videos in 
). As one video usually contains more than one event of interest, the goal of DEC is to locate all events in the video and perform captioning for each of them. Clearly, such dense captioning enriches the information we obtained and is beneficial for more in-depth video analysis. Nevertheless, to achieve this goal, we need to collect the caption annotation for each event along with its temporal segment coordinate (i.e., the start and end times) for network training, which is source-consuming and impractical.
In this paper, we introduce a new problem, Weakly Supervised Dense Event Captioning (WS-DEC)111More specifically, the term “weakly” in our paper refers to the incompleteness of the supervision rather than the amount of information., which aims at dense event captioning only using the caption annotations for training. In the training dataset, only a paragraph or a set of sentences is available to describe each video, but the temporal segment coordinate of each event and its correspondence to the captioning sentence is not given. For testing, the model is able to detect all events of interest and provides the caption for each event. One obvious advantage of the weak supervision is the significant reduction of the annotation cost. This benefit becomes more demanded if we attempt to make use of the videos in the wild (e.g. the videos on the web) to enlarge the training set.
We solve the problem by unitizing the one-to-one correspondence assumption: each caption describes one temporal segment, and each temporal segment has one caption. We decompose the problem into a cycle of dual problems: caption generation and sentence localization. During the training phase, we perform sentence localization from the given caption annotation, to obtain the associated segment that is then fed to the caption generator to reconstruct the caption back. The objective is to minimize the reconstruction error. Our cycle process repeatedly optimizes caption generator and sentence localizer without any ground-truth segment. During the testing phase, it is infeasible to apply the cycle process in the same way as the training phase, as the caption is unknown. Instead, we first perform caption generation on a bunch of randomly initialized candidate segments and then map the resulting captions back to the segment space. The output segments by this cycle process will get closer to the ground-truths if certain properties are satisfied. We thus formulate an extra loss for the training to enforce our model to meet these properties. Based on the detected segment, we are able to perform event captioning on it, and thus achieve the goal of dense event captioning.
We summarize our contributions as follow. I. We propose to solve the DEC task without the need of temporal segments annotation, thus introduce a new problem WS-DEC, aiming at making use of the huge amount of data in the web and thus reducing the cost of annotation. II. We develop a flexible and efficient method to address WS-DEC by exploring the one-to-one correspondence between the temporal segment and event caption. III. We evaluate the performance of our approach on the widely-used benchmark ActivityNet Captions . The experimental results verify the effectiveness of our method regarding the dense event captioning ability and sentence localization accuracy.
We briefly review recent advances on video captioning, dense even captioning and sentence localization in videos in the next few paragraphs.
Video captioning. Early researchers simply aggregate frame-level features by mean pooling and then use similar pipelines as image captioning 
to generate caption sentences. This mean-pooling strategy works well for short video clips, but will easily crash with the increase of video length. Recurrent Neural Networks (RNNs) along with attention mechanisms are thus employed[5, 6, 7, 8], among which S2VT exhibits more desirable efficiency and flexibility. Since a single sentence is far from enough to describe the dynamics of untrimmed real-world video, some researchers attempt to generate multiple sentences or a paragraph to describe the given video [9, 14, 15]. Among them, the work by  aims at providing diverse captions corresponded to different spatial regions in a weakly supervised manner. Despite the similar weakly-supervised setting to this work, our paper differently is to localize different events temporally and perform captioning for each detected event, which generates descriptions based on meaningful events instead of bewildering visual features.
Dense Event Captioning. Recent attention have been paid on dense event captioning in videos [10, 11]. Current works all follow the "detection and description" framework. The model proposed by  resorts to the DAP method for event detection and enhance the caption ability by applying the context-aware S2VT. Meanwhile,  employs a grouping schema based on their previous video highlight detector to perform event detection, and the attribute-augmented LSTM (LSTM-A) for caption generation. Most recently, [19, 20] try to boost the event proposal with generated sentence, while  tries to leverage bidirectional SST instead of DAP. Also,  proposes to use bidirectional attention for dense captioning. In contrast to these fully-supervised works, we address the task without the guidance of temporal segments during training. Specifically, instead of detecting all event using the one-to-many-mapping event detector[23, 22], we try to localize them one by one using our sentence localizer and caption generator.
. Due to the development of deep learning, several models have been proposed to work on real-world videos[27, 28, 29]. The approaches by [27, 28] are categorized into the typical two-stage framework as “scan and localize”. To elaborate a bit, the work by 
employs a Moment Context Network(MCN) for matching candidate video clip and sentence query, while the model in proposes a Cross-modal Temporal Regression Localizer (CTRL) to make use of coarsely sampled clips for computation reduction. In contrast,  opens up a different direction by regressing the temporal coordinate given learned video representation and sentence representation. In our framework, the sentence localization is originally formulated as an intermediate task to enable weakly supervised training for dense event captioning. Actually, our model also provides an unsupervised solution to sentence localization.
We start this section by presenting the fundamental formulation of our method and follow it up with providing the details on model architecture.
Notations. Prior to further introduction, we first provide the key notations used in this work. We denote the given video by with indexing the image frame at time . We define the event of interest as a temporally-continues segment of and denote all the events by their temporal coordinate as , where is the number of events, and denote the temporal center and width, respectively. The temporal coordinates for all events are normalized to be within throughout this paper. Let the caption for the segment be the sentence where denotes the -th word, and is the length of caption sentence.
Formally, the conventional event captioning models [10, 11] first locate the temporal segments of the events by the event proposal module, and then generate the caption for each segment through the caption generator. Here, for our weak supervision, the segment labels are unprovided and only the caption sentences (could be multiple for a single video) are available.
The biggest difficulty of our task lies in that it’s impossible to perform weakly supervised event proposal which in nature is a one-to-many mapping problem and is too noisy for weakly learning. Instead, we try a novel new direction that makes use the bidirectional one-to-one mapping between caption sentence and temporal segment. Formally, we formulate a pair of dual tasks of sentence localization and event captioning. Conditioned on a target video , the dual tasks are defined as:
Sentence localization: this task is to localize segment corresponded to the given caption , i.e., learning the mapping , associated with parameter ;
Event Captioning: the event captioning inversely generates caption for the given segment , i.e., learning the function , associated with parameter .
The dual problems exist simultaneously once the correspondence between and is one-to-one, which is the case in our problem for that and are tied together by their corresponding event.
If we nest these dual functions together, then any valid caption and segment pair becomes a fixed-point solution of the following functions:
More interestingly, Eq. (1) derives an auto-encoder for where the segment gets vanished. This gives us a solution to train the parameters of both functions of and , by formulating the loss as
where is a loss distance.
A remaining issue that it is still infeasible to perform dense event captioning in the testing phase by applying or since both the temporal segment and caption sentence are unknown. To tackle the testing issue, we introduce the concept of the fixed-point iteration  as follow.
We define the iteration as
where will converge to the fixed-point solution i.e. , if there exists a sufficiently small satisfying and the function is locally Lipschitz continuous around with Lipschitz constant .
Note that the proof has already been derived previously. For better readability, we include them in the supplementary material.
With the application of the fixed-point-iteration, we can solve the event captioning task without any caption or segment during testing. We sample a random bunch of candidate segments for the target video as initial guesses, and then perform the iteration in Eq. (4) on these candidates. After sufficient iterations, the outputs will converge to the fixed-point solutions (i.e. the valid segments) . In our experiments, we only use one-round iteration by and find it sufficient to deliver promising results. With the refined segments at hand, we are able to generate the captions as and thus solve the dense event captioning task.
As introduced afterward, both and are stacked by multiple neural layers which not naturally satisfy the local-Lipschitz-continuity in Proposition 1. We thus apply the idea of denoising auto-encoder in , where we generate noisy data by adding a Gaussian noise to the training data and minimize the reconstruction of the noisy data to the true ones. Explicitly, we enforce the temporal segments around the true data to converge to the fixed-point solutions by one-round iteration. Recalling that for weakly supervised constraint, we do not have the ground-truth segments during training, we thus apply as the approximated segment, and minimize the following loss:
where is a Gaussian noise. The Gaussian smooth (Eq. (5)) does not theoretically hold the Lipschitz continuity, but it practically enforces the random proposals to converge to the positive segments as verified by our experiments.
The core of our framework as illustrated in Fig. 1 consists of a Sentence Localizer (i.e. ) and a Caption Generator (i.e. ). Any differential model can be applied to formulate the sentence localizer and caption generator. Here, we introduce the ones that we use. Besides, we omit the RNN-based video and sentence feature extractors, leaving the details of them in the supplementary material. In the following several paragraph, suppose we have obtained the features , hidden states for each video, and the features , hidden states for each caption sentence. and are the lengths of the video and caption.
Sentence Localizer. Performing localization requires to model the correspondence between the video and caption. We absorb the ideas from [29, 28], and propose a cross-attention multi-model feature fusion framework. Here, we develop a novel attention mechanism named as Crossing Attention, which contains two sub-attention computations. The first one computes the attention between the final hidden state of the video and the caption feature at each time step, namely,
where denotes the matrix transposition and is the learnable matrix. The other one is to calculate the attention between the final hidden state of the caption and the video features, i.e.,
where is the learnable matrix.
Then, we apply the multi-model feature fusion layer in  to fuse two sub-attentions as
where is the element-wise multiplication, is a Fully-Connected (FC) layer, and denotes the column-wise concatenation.
One can regress the temporal segment directly by adding an FC layer on the mixed feature , which however is easy to get suck in local minimums if the initial output is far away from the valid segment. To allow our prediction to move between two distant locations efficiently, we first relax the regression problem to a classification task. Particularly, we evenly divide the input video into multiple anchor segments under multiple scales, and train a FC layer on the to predict the best anchor that produces the highest Meteor score  of the generated caption sentence. We then conduct regression around the best anchor that gives the highest score. Formally, we attain
where is the best anchor segment and are the regression output by performing a FC layer on .
Caption Generator. Given the temporal segment, we can perform captioning on the frames clipped from the video. However, such clipping operation is non-differential, making it intractable for end-to-end training. Here, we perform a soft clipping by defining a continues mask function with respect to the time . This mask is defined by
where is the temporal segment, is the scaling factor, and
is the sigmoid function. Whenis large enough, the mask function becomes a step function whose value is zero exact for the region . The conventional mean-pooling feature of clipped frames are then equal to the weighted sum of the video features by the mask after normalization, i.e.,
Regarding as context, and as initial hidden state, RNN is applied to generate the caption:
is used to minimize the distance between the ground-truth and our prediction . We apply cross-enctropy loss as follow(or say, perplexity loss):
is applied to compare the difference between and as illustrated in Fig. 1, which is implemented by the norm as
As metioned in Eq. (10), we further train the sentence localizer to predict the best anchor segment by adding a soft-max layer on the mixed feature in Eq. (9). We define the one-hot label as where if the -th anchor segment is the best one, otherwise . Suppose our prediction output is by the soft-max layer. The classification loss is formulated as
Taking all losses together, we have
where and are constant parameters.
We conduct experiments on the ActivityNet Captions dataset that has been applied as the benchmark for dense video captioning. This dataset contains 20,000 videos in total, covering a wide range of complex human activities. For each video, the temporal segment and caption sentence of each human event is annotated. On average, there are 3.65 events annotated among each video, resulting in a total of 100,000 events. We follow the suggested protocol by [10, 11] to use 50% of the videos for training, 25% for validation, and 25% for testing.
The vocabulary size for all text sentence is set to be 6000. As detailed in the supplementary material, both the video and sentence encoders apply the GRU models
for feature extraction, where the dimensions of hidden and output layers are 512. The trade-off parameters in our loss,i.e., and
Training. Under the weak supervision constraint, the ground truth temporal segments are unused for training. The video itself is regarded as a special segment that is given by . We first pre-train the caption generator by using the entire video as input and each event caption among it as output. Such a pretraining process allows us to learn a well-initialized caption generator since the whole video content is related to the event caption, even the correlation is not precise. After the pretraining, we train our model in 2 stages. In the first stage, we minimize the captioning loss and reconstruction loss . Then we minimize in the second stage. Details about training are provided in the Supplementary materials and our Github repository.
Testing. For testing, only input videos are available. As already discussed in § 3.1, We starts from a random bunch of segments for initial guesses( in our reported result). After the one-round fixed-point iteration, we obtain the refined segments as . We further filter them based on the IoU between and (More details are given in the supplemental material), and keep those having high IoU as valid proposals. We then input the filtered segments to the caption generator to obtain event captioning sentences. It’s nothing to mention that we do not choose using pretrained temporal segment proposal model(e.g. SST]) for the initial temporal segment generation, which, as a matter of fact, uses external temporal segment data, and is in contradiction with our motivation.
The performance is measured with the commonly-used evaluation metrics: METEOR, CIDEr, Rouge-L, and Bleu@N. We compute above metrics on the proposals if their overlapping with the ground-truth segments is larger than a given tIoU222temporal Intersection over Union threshold, and set the score to be 0 otherwise. All scores are averaged with tIoU thresholds of 0.3, 0.5, 0.7 and 0.9 in our experiments. We use the official scripts333https://github.com/ranjaykrishna/densevid_eval for the score computation.
Baselines. Not any previous method is proposed for dense event captioning under the weak supervision. For Oracle comparisons, we still report the results by two fully-supervised methods [10, 11]. As for our method, we implement various variants to analysis the impact of each component. The first variant is the pretrained model where we randomly sample an event segment from each video and feed it into the pretrained caption generator for captioning in the testing phase. Another variant is the method by removing the anchor classification in Eq. 10, and thus regressing the temporal coordinate globally as in . As a compliment, we also carry out the version by preserving the classification term but removing the regression component from Eq. 10.
Results. The event captioning results are summarized in Table 1. In general, the Meteor and Cider metrics are considered to be more convictive than other scores: the Meteor score is highly correlated with human judgment, and has been used as the final ranking in the ActivityNet challenge; while Cider is a newly proposed metric where the repetition of sentences is taken into account. Our method reaches comparable performance with the fully-supervised methods regarding the Meteor score and obtains the best score in terms of the Cider metric. Such results are encouraging as our method is weak supervised and not any ground-truth segment is used. For the comparisons between different variants of our method, it is observed that removing the anchor classification or regression does decrease the accuracy, which verifies the necessity for each component in our model.
As we use a bunch of randomly selected temporal segments to generate the caption results, the robustness of the model towards such random strategy should also be evaluated. We use a different number of temporal segments and different random seeds to generate event caption sentences, and the evaluation results are summarized in Table 3
. From the table, we can see that the variance is small on different random seeds. Besides, we can see a slight increase of performance along with the increase of the number of temporal segments. We chooseas a trade-off between complexity and performance in our final experiment.
Moreover, we display the recalls of the detected events by various methods with respect to the testing segments in Figure 2. To compute the recall, we assign the predicted segment as a positive sample if its overlap with the testing segment is larger than the tIOU threshold. From Fig. 2, we can find that our model is much better than the random proposal model, which verified the power of our weakly-supervised methods. Also, our final model is better than the two baselines in general.
|Ours (no classification)||True||6.08||15.1||12.25||11.85||4.67||1.90||0.80|
|Ours (no regression)||True||6.11||17.66||12.40||11.98||5.45||2.69||1.44|
Illustrations. Figure. 3 illustrates event captioning results of two videos. It presents the ground-truth descriptions, the captioning sentences by the pretrained model and our method. Compared with the pretrained model which generates a single description for each video, our model is capable to generate more accurate and detailed description. Compared to the ground truths, some of the descriptions are comparable in consideration of the generated sentence and event temporal segment. However, two issues still remain. One is that our model sometimes cannot capture the beginning of an event, which, in our opinion, is due to the fact that we use the final hidden state of a temporal segment to generate description which does not rely much on the starting coordinate. Another is that our model tris to generate 2 to 3 three descriptions most of the time, which means that it’s not good at capture all the event in a video, especially those ones with many weeny events.
|Model||us||R@1, IoU 0.1||R@1, IoU 0.3||R@1, IoU 0.5||mIoU|
Using the learned caption localizer, our model can also be applied to the sentence localization task in an unsupervised way. In this section, we provide experimental results to demonstrate the effectiveness of our model on this task.
Evaluation metric. Following the works of [29, 28], we compute the "R@1, IoU=" and “mIoU” scores to measure the model’s sentence localization ability. In details, for a given sentence and video pair, the "R@1, IoU=" score indicates the percentage of sentences who’s top-1 predicted temporal segment has a higher IoU with the ground truth temporal segment than the given threshold , while the "mIoU" is the average tIoU between all top-1 prediction and ground truth temporal segment. In our experiment, is set to 0.1, 0.3 and 0.5 following the setting in .
Baselines. We compare our model’s sentence localization ability with Cross-modal Temporal Regression Localizer (CTRL)  and Attention Based Location Regression (ABLR) . Such two models achieve the state-of-the-art performance for now. Besides the unsupervised model, we also implement a fully-supervised version by using ground-truth segments.
Results Table. 2 shows the results of all compared methods. First, our supervised implementation reaches similar performance as ABLR( the state-of-the-art) compared with another fully-supervised baseline, thus indicating the effectiveness of our model. As for the unsupervised scenario, we can see that our unsupervised model outperforms CTRL by a considerable margin, which shows that our model can really learn to locate meaningful temporal segment from the indirect losses.
We raise a new task termed Weakly Supervised Dense Event Caption(WS-DEC) and propose an efficient method to tackle it. The weak supervision is of great importance as it eliminates the source-consuming annotation of accurate temporal coordinates and encourages us to explore the huge amount of videos in the wild. The proposed solution not only solves the task efficiently but also provides an unsupervised method for sentence localization. Extensive experiments on both tasks verify the effectiveness of our model. For future research, one potential direction is to verify our model by performing experiments directly on Web videos. Meanwhile, since weakly supervised learning is becoming an important research vein in the domain, our proposed method by using the cycle process and fixed-point iteration could be applied to more other tasks,e.g., weakly-supervised detection.
This work was supported in part by National Program on Key Basic Research Project (No. 2015CB352300), and National Natural Science Foundation of China Major Project (No. U1611461).
Large-scale video classification with convolutional neural networks.In CVPR, 2014.
Extracting and composing robust features with denoising autoencoders.In
Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.