Video dataset with contextual information based on movie scripts, used in the paper: "Enriching Video Captions With Contextual Text"
Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.READ FULL TEXT VIEW PDF
As a fine-grained video understanding task, dense video captioning invol...
The video captioning task is to describe the video contents with natural...
Current video captioning approaches often suffer from problems of missin...
Dense captioning aims at simultaneously localizing semantic regions and
Current movie captioning architectures are not capable of mentioning
A key capability of an intelligent system is deciding when events from p...
Charts often contain visually prominent features that draw attention to
Video dataset with contextual information based on movie scripts, used in the paper: "Enriching Video Captions With Contextual Text"
Understanding video content is a substantial task for many vision applications, such as video indexing/navigation , human-robot interaction , describing movies for the visually impaired people , or procedure generation for instructional videos . There are many difficult challenges due to the open domain and diverse set of objects, actions, and scenes that may be present in the video with complex interactions and fine motion details. Furthermore, the required contextual information may not be present in the concerned video section at all, which needs to be extracted from some other sources.
While significant progress has been made in video captioning, stemming from release of several benchmark datasets [4, 40, 13, 33, 34] and various neural algorithmic designs, the problem is far from being solved. Most, if not all, existing video captioning approaches can be divided into two sequential stages that perform visual encoding and text decoding respectively . These stages can be coupled further by additional transformations [12, 49] where the models are limited by the input visual content or the vocabulary of a specific dataset. Some approaches  consider the preceding or succeeding video clips to extract contextual relation in the visual content to generate coherent sentences in a storytelling way. In general, these approaches focus on a domain specific dataset not reflecting the whole real world, but only a subset that is missing a lot of information needed to produce human comparable results. Consequently, most captions still tend to be generic like “someone is talking to someone” and the knowledge about who, where and when is missing. We try to overcome this issue by providing contextual knowledge in addition to the video representation. This allows us to produce more specific captions like “Forrest places the Medal of Honor in Jenny’s hand.” instead of just “Someone holds someone’s hand.” as illustrated in Figure 1.
To address these limitations, we propose an end-to-end differentiable neural architecture for contextual video captioning, which exploits the required contextual information from a relevant contextual text. Our model extends a sequence-to-sequence architecture  by employing temporal attention in the visual encoder, and a pointer generator network  at the text decoder which allows for extraction of background information from a contextual text input to generate rich contextual captions. The contextual text input can be any text that is relevant to the video up to some degree without strict limitations. This could be a part of the script for a movie section, an article for a news video, or a user manual for a section of an instructional video.
Contributions. The contributions of this paper are three-fold. First, we propose a method for contextual video captioning which learns to attend over the context in raw text and generates out of vocabulary words by copying via pointing. The source code for the full framework will be publicly available111https://github.com/primle/S2VT-Pointer. Second, we augment the LSMDC dataset  by pairing video sections with the corresponding parts in the movie scripts, and share this new split with the community222https://github.com/primle/LSMDC-Context. Third, we show competitive performance both with respect to the prior state-of-the-art and ablation variants of our model. Through ablations we validate the efficacy of contextual captioning as well as individual design choices in our model.
Our goal of contextual caption generation is related to multiple topics. We briefly review the most relevant literature below.
It has been observed that deep neural networks such as VGG, ResNet , GoogLeNet  and even automatically learned architectures , can learn suitable image features to be transferred to various vision tasks [9, 54]. Generic representations for video and text have been receiving considerable attention. Pooling and attention over frame features [47, 52, 53], neural recurrence between frames and spatiotemporal 3D convolution are among the common video encoding techniques [10, 30, 41]. On the language side, distributed word representations [22, 27] and recent attention-based architectures [8, 29] provide effective and generalisable representations modeling sentential semantics.
Joint Reasoning of Video and Text. Popular research topics in joint reasoning of image/video and text include video captioning [48, 52, 25], retrieval of visual content [20, 1] and text grounding in images/videos [20, 11, 31, 28]
. Most approaches along these lines can be classified as belonging to either (i) joint language-visual embeddings or (ii) encoder-decoder architectures. The joint vision-language embeddings facilitate image/video or caption/sentence retrieval by learning to embed images/videos and sentences into the same space[25, 51]. The encoder-decoder architectures  are similar, but instead attempt to encode images into the embedding space from which a sentence can be decoded [46, 55, 12]. Most of these approaches yield generic video captions without any context due to lack of background knowledge.
Contextual video captioning has not received great attention yet besides few attempts [44, 5, 42] which might be due to lack of suitable datasets.  presents a dataset of news videos and captions that are rich in knowledge elements and employs Knowledgeaware Video Description Network (KaVD) that incorporates entities from topically related text documents. Similar to , we incorporate relevant text data, with the use of pointer networks 
, for a given video to produce richer and contextual captions. In contrast to KaVD, we propose a model which directly operates on raw contextual text data. Our model learns to attend over the relevant words, based on visual input which allows the model not only to learn contextual entities and events, but also the interaction between them. Further, it allows both video captioning with background knowledge, as well as text summarization based on visual information. We also eliminate the additional preprocessing overhead of name/event discovery and linking systems.
We now present our neural architecture for contextual video captioning. An overview of our model is shown in Figure 2. The input video clip consists of a number of consecutive frames . The contextual text sequence consists of a number of consecutive words . Our task is to find a function that encodes the input sequences and decodes a contextual caption as a sequence of consecutive words . We rely on a sequence-to-sequence architecture to handle variable input and output length. A stack of two LSTM  blocks as proposed in  is used for both encoding and decoding, which allows parameter sharing between the two stages. The stack consists of a bidirectional and a unidirectional LSTM which are mainly effective in encoding and decoding, respectively. During decoding, the bottom LSTM layer additionally uses a temporal attention over the hidden states of the top LSTM layer to identify the relevant frames. Next to the visual input, a contextual text input of variable size is encoded using another bidirectional LSTM. We use a pointer generator network  to attend over the contextual text and build a visual and context aware vocabulary distribution. In addition, the pointer generator network allows us to copy context words directly into the output caption, which enables extracting specific background knowledge not available from only visual input.
The baseline architecture consists of two main blocks: a bidirectional LSTM block stacked on top of a unidirectional LSTM block, modeling the input frame and output word sequences, respectively. The top LSTM takes an embedded video feature vectorat time step as input, and passes its hidden state concatenated with the embedding of the previously predicted word input and a frame context vector to the bottom LSTM block:
where , are the respectively memory cells for the top and bottom LSTM.
The time axis of the stacked LSTMs can be split into the encoding and decoding stage. During encoding, each video frame is passed into a pretrained Convolutional Neural Network (CNN) to obtain frame level features, from where the linear embedding
to a lower dimensional space is learned. Since there is no previously predicted word input and no frame context vector during this stage, a padding vector of zeros is used forand .
The decoding stage begins after the fixed amount of encoding time steps , by feeding the beginning of sentence (BOS) tag to the model. The BOS tag is used to signal the model to start decoding its latent representation of the video as a sentence. Since there is no video frame input in this stage, a padding vector is passed to the top LSTM. To obtain the frame context vector at decoding timestep , a temporal attention with an additive alignment score function  over the hidden states of the top LSTM is applied:
The output of the bottom LSTM is then passed to the pointer generator network, generating the output word . During the encoding stage, no loss is computed and the output of the LSTM is not passed to the pointer generator network.
We use a bidirectional LSTM to learn a representation of the contextual text. At each context encoder timestep , the embedded word is passed to the LSTM layer, producing a sequence of context encoder hidden states . These hidden states are used to build a soft attention distribution over the context word representations per decoder timestep , similar to :
where , , and are learned parameters. To overcome the general issue of tendency to produce repetition in sequence to sequence models, [43, 35] proposed a coverage model, which keeps track of the attention history. At each decoder timestep , we follow the same procedure by introducing a coverage vector , which is the sum of the previous attention distributions:
This vector informs the model about the degree of attention that the context words have received so far, and helps the model not to attend over the same words repeatedly. The coverage vector is fed to the pointer-generator network as an additional input, and the attention score calculation from Equation 6 is modified as:
where is a learned parameter vector of the same shape as . The resulting context vector , computed as
is then concatenated with the decoder hidden state and passed to two fully connected linear layers to produce the vocabulary output distribution :
where , , and are learned parameters. At each decoder timestep
, we additionally calculate a generation probability, as proposed in , based on the context vector , the decoder hidden state , and the embedded decoder word input :
is the sigmoid function and the vectors, , and the scalar are learned parameters. The generation probability is used to weight the vocabulary distribution and the attention distribution at timestep . For a word , the final distribution is given as:
Note that if a word is not in the contextual text, is zero, and similarly if is not in the global vocabulary, is zero.
The loss function per decoder timestepis given as:
where is the target word and is a parameter of the model to weight the additional coverage loss , used to penalize attending over the same contextual word representation multiple times.
As the coverage mechanism penalizes repeated attention on the contextual text, but not on the global vocabulary, we introduce an additional penalization at inference time. At timestep , the output probability of a word is multiplied by the factor , if it already occurs in the predicted sentence .
|Dataset||Domain||# Videos||# Clips||Avg. Duration||# Sentences||Vocab Size|
|News Video ||News||–||2’883||52.5s||3’302||9’179|
We test our approach on two datasets that provide both video and contextual text input.
To the best of our knowledge, the News Video Dataset  is the only publicly available dataset consisting both visual and contextual background information for video captioning. The dataset is composed of news videos from the AFP YouTube channel333https://www.youtube.com/user/AFP with the given descriptions as ground-truth captions. The videos cover a variety of topics such as protests, attacks, natural disasters and political movements from October, 2015 to November, 2017. Furthermore, the authors retrieved topically related news documents using the video meta-data tags. The official release comes with the video URLs only. However, upon our request, they kindly shared their collected news articles with us, which we use as an contextual text input in our experiments.
The Large Scale Movie Description Challenge (LSMDC) dataset  is a combination of the MPII-MD  and the M-VAD  datasets, consisting of a large set of video clips taken from Hollywood movies with paired audio description (AD) sentences as groundtruth captions. Sentences in the original AD are filtered and manually aligned to the corresponding portions for a better precision by the authors. The released dataset comes with the original captions as well as a pre-processed version, where all the character names were replaced with someone or people. The latter version is also the most used in related research and benchmarks as the character names come from the movie context rather than the visual input.
To adapt it to our problem, we augmented the LSMDC dataset with additional contextual text by using publicly available movie scripts. The scripts were downloaded from the Internet Movie Script database444https://www.imsdb.com, parsed in a similar way to the public code of Adrien Luxey555https://github.com/Adrien-Luxey/Da-Fonky-Movie-Script-Parser. The extracted text from the scripts were stored in a location-scene structure and later used to narrow down the contextual text input while generating a caption for a short video clip within the movie. Next, we downloaded the movie subtitles666https://subscene.com/ and built a coarse script scene, video time mapping using the dialogues in scripts. Note that public movie scripts are rare and can be either a draft, a final, or a shooting version. Therefore the stage directions and especially the dialogues may differ from the subtitles a lot. To overcome this issue, we built the mapping in multiple rounds and eliminated the movies and scripts which do not have sufficient correspondences between the video and the script. In the end, we assign a coarse time interval from the movie for each scene in the script which could be used as contextual text input.
Roughly 40 movies from the LSMDC dataset with AD sentences have an available movie script in the form of a draft, a shooting or a final version. In the first step, we analyzed how many words of the AD-captions can be recovered by the provided movie script context. In the second step, we removed the movies with an average caption/context overlap less than 33.3% to create a smaller split with better context richness. This way we can improve the average overlap by 7% in trade of a smaller but higher quality dataset. The resulting dataset contains 23 movies with a total of video clips. As one expects, experiments have shown that keeping the movies with almost no useful additional context is rather obstructive than helpful in the training process.
A part of LSMDC dataset is composed of movies that are paired with script sentences as groundtruth captions, instead of AD sentences. We used this split as our toyset to see how well our model can recover a caption when the ground truth caption is in the contextual text. We select the movies with an available movie script and filter out the movies with a caption/context overlap less than 90%. The resulting dataset contains 26 movies with a total of video clips.
The splits above cover a small percentage of the original LSMDC dataset. We denote the bigger split of remaining movies as LSMDC*, which is to be used to pretrain the encoder-decoder network (without contextual text input) to apply transfer learning for the relatively small splits. The new split contains all video-sentence pairs from the original dataset except the test set, since the groundtruth sentences are not available for the original test set.
For these three splits, we created our own training, test, and validation sets considering the number of clips per movie as well as the movie genres. Table I shows the statistics of the datasets used.
In all experiments, text data is lower-cased and tokenized into words. For the News Video Dataset, numbers, dates and times are replaced with special tokens following . A vocabulary is built for each respective dataset and clipped by taking account of the occurrence frequency of words. Each word is mapped to an index and the text input to our model is represented as one-hot vector. Further, we use a pretrained Word2Vec model [22, 23], trained on a subset of Google News dataset777https://code.google.com/archive/p/word2vec, to have good initialization to our word embedding layer.
We perform video representation differently depending on the used dataset due to different content and style of videos.
For each video clip, we sample one frame per second, as the video clips from the News Video Dataset are longer (up to two minutes) and short-term temporal information is less significant due to the news video style of rapid scene changes. All frames (RGB images) are smoothed with a Gaussian filter before down-scaling to the size of , to avoid aliasing artifacts. The preprocessed frames are fed into the VGG-16 
pretrained on the ImageNet dataset
The Large Scale Movie Description Challenge published precomputed video features, which we directly use in all our experiments. They provide two types of features: the output of ResNet-152  pretrained on ImageNet  before applying softmax, and the output of the I3D model  pretrained on ImageNet and Kinetics . I3D makes use of multiple frames and optical flow using 3D CNN, therefore a single feature vector input to our LSTM captures a segment of multiple frames. The concatenation of the two feature vectors is fed into the top LSTM of our model.
In all our experiments, the video features and text (word) inputs are embedded into a -dimensional and -dimensional space respectively. The LSTMs in the encoder-decoder network have a hidden state size of , and the LSTM block used to encode the contextual text in the pointer generator network has a hidden state size of . During training, dropout  rate of is applied on the video feature input, embedded word input, embedded context input, and all LSTM outputs. The training is performed with the Adam  optimizer using a learning rate of .
We unroll the stacked LSTMs to timesteps: for video encoding and for caption decoding. Note that the News Video Dataset contains longer reference captions than the LSMDC* dataset and mostly includes several sub sentences. Further, we unroll the LSTM for the contextual text to a fixed size of timesteps, following . Articles are sentence-wise cropped at the end to fit the maximum length of tokens. For video clips with multiple articles, we create a sample per article and train on all of them. During inference, we take the prediction of the sample/article pair with the highest probability (i.g. most confident). To use transfer learning in some experiments, the complete News Video vocabulary and the most frequent words of the CNN/Daily Mail dataset [16, 24] were combined together and cropped at . We first train the pointer-generator network on the bigger CNN/Daily Mail Dataset, and the sequence-to-sequence model on the News Video Dataset. Secondly, we combine the pretrained models and train on the News Video Dataset. For the final model, we use a coverage loss weight of . At inference time, we use beamsearch with a beamwidth of and a repetition penalization of .
We unroll the stacked LSTMs to timesteps: for video encoding and for caption decoding. Further, we unroll the LSTM for the contextual text to a fixed size of timesteps for AD-captions and timesteps for script-captions. Movie script scenes are cropped sentence-wise from the beginning and end, to fit the maximum length of tokens. The complete LSMDC* vocabulary is used for the final models that are trained on LSMDC-Context splits. We first train the sequence-to-sequence model on the bigger LSMDC* dataset with someone-captions. Next, we fix the weights of the top LSTM (modeling the video), while training on LSMDC-Context-AD (LSMDC-Context-Script, respectively) with name-captions. This procedure provides a good initialization for the pointer-generator network. In a last step, we release all the weights and train the full framework end-to-end. Coverage loss weight is used in the final models. At inference time, we do not use beamsearch (i.e. beamwidth of ), but a repetition penalization of .
We use METEOR 
as our quantitative evaluation metric. It is based on the harmonic mean of unigram precision and recall scores, and considers how well the predicted and the reference sentences are aligned. METEOR improves the shortcomings of BLEU and makes use of semantic matching like stemmed word matches, synonym matches and paraphrase matches next to exact word matches. In all experiments, we use METEOR 1.5888http://www.cs.cmu.edu/~alavie/METEOR as done in .
|Model||METEOR [%]||ROUGE-L [%]||CIDEr [%]|
We report the performance of our model on the News Video Dataset in Table II. In order to understand the benefits of the individual components of our model, we also present an ablation study where blocks stacks are removed. Our full model performs significantly better than the video-only and the article-only model which are missing the pointer generator network and the video encoder respectively. Comparing the results between KaVD  and our full model is difficult as the authors of KaVD and News Video Dataset only published the ratio of train, validation and test splits, but not the exact sets. The authors did not report the CIDEr score in .
We show a qualitative result in Figure 3 to highlight the capabilities of our model which presents a semantically correct summary of the article based on the visual input. While the article focuses on the hush money investigation, the model correctly uses this information to augment the visual caption of protesters doing a demonstration in a street. This can be seen in the weighting () of the attention distribution and the global vocabulary distribution: words related to the event of protesting are taken from the global vocabulary and entities like rio de janeiro or michel temer, as well as additional information are successfully extracted from the article.
The performance of our model on LSMDC-Context-AD is shown in Table III. The model is able to recover of the character names on average. Figure 4 shows an example where the model correctly extracts the name and scene location from the movie script. The difference between the predicted caption (visually correct) and the groundtruth caption shows the difficulties of the LSMDC dataset in general. Analyzing some example prediction shows that the model occasionally substitutes someone with a wrong character name. There are many reasons for this behaviour. Firstly, the movie script context does not necessarily include the video scene, nor the character name. Secondly, the dataset is too small and does not let the model learn a good context model at the pointer generator network. In contrast to experiments on the News Video Dataset that are pretrained on CNN/Daily Mail Dataset, the pointer generator network is missing a good initialization due to lack of larger text corpora with similar content and style for the experiments on LSMDC-Context-AD.
Table IV shows the performance on LSMDC-Context-Script. The model is able to learn the mapping between the video and the groundtruth caption that is mostly available in the contextual text. Analyzing some example predictions has shown the issue of the script based captions and why the scores remain relatively low. In LSMDC, consecutive samples tend to have almost identical visual input. Yet, the reference sentences describe different levels of scene details (e.g. lester, carolyn and jane are eating dinner by candlelight vs. red roses are bunched in a vase at the center of the table). Without the awareness of the sequence of samples, a correct mapping between the script sentences and the reference sentences is ambiguous. This is because a reasonable system would always go for the most likely sentence.
As the ground truth captions from the LSMDC-Context splits highly depend on the respective video clip, we omit the results of the Movie-Script-only model. In contrast to the News Video Dataset, the captions do not reflect a possible summary of the text input and therefore the results are uninformative.
In this paper, we proposed an end-to-end trainable contextual video captioning method that can extract relevant contextual information from a supplementary contextual text input. Extending a sequence-to-sequence model with a pointer generator network, our model attends over the relevant background knowledge and copy corresponding vocabulary from the given text input. Results on the News Video Dataset and LSMDC-Context validate the competitive performance of our model which directly operates on the raw contextual text data without the need of additional tools unlike prior methods. Furthermore, we make the source code of our framework and LSMDC-Context publicly available for other researchers. The performance of the presented method is naturally limited by the level of correspondence between the video and the chosen contextual text. In future, we plan to involve multiple contextual resources to extract the relevant contextual information with more confidence and precision.
Localizing moments in video with natural language. In ICCV, Cited by: §II.
Attend to you: personalized image captioning with context sequence memory networks. In CVPR, Cited by: §II.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §II, §V-A.
The journal of machine learning research15 (1). Cited by: §V-B.
Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. Cited by: §II.
Bidirectional attentive fusion with context gating for dense video captioning. In CVPR, pp. 7190–7198. Cited by: §I.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I.