Source code for Semantics-Assisted Video Captioning
Given the features of a video, recurrent neural network can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, which lead to poor performance. Third, current video captioning models are prone to generate relatively short captions, which express video contents inappropriately. Towards resolving these three problems, we make three improvements correspondingly. First of all, we utilize both static spatial features and dynamic spatio-temporal features as input for semantic detection network (SDN) in order to generate meaningful semantic features for videos. Then, we propose a scheduled sampling strategy which gradually transfers the training phase from a teacher guiding manner towards a more self teaching manner. At last, the ordinary logarithm probability loss function is leveraged by sentence length so that short sentence inclination is alleviated. Our model achieves state-of-the-art results on the Youtube2Text dataset and is competitive with the state-of-the-art models on the MSR-VTT dataset.READ FULL TEXT VIEW PDF
Although end-to-end (E2E) learning has led to promising performance on a...
This notebook paper presents our system in the ActivityNet Dense Caption...
Automatic generation of video captions is a fundamental challenge in com...
Automatically generating natural language descriptions of videos plays a...
Accelerated by the tremendous increase in Internet bandwidth and storage...
Existing state-of-the-art autoregressive video captioning methods (ARVC)...
Automatically describing video content with natural language has been
Source code for Semantics-Assisted Video Captioning
Given the features of a video, recurrent neural network can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, which lead to poor performance. Third, current video captioning models are prone to generate relatively short captions, which express video contents inappropriately. Towards resolving these three problems, we make three improvements correspondingly. First of all, we utilize both static spatial features and dynamic spatio-temporal features as input for semantic detection network (SDN) in order to generate meaningful semantic features for videos. Then, we propose a scheduled sampling strategy which gradually transfers the training phase from a teacher guiding manner towards a more self teaching manner. At last, the ordinary logarithm probability loss function is leveraged by sentence length so that short sentence inclination is alleviated. Our model achieves state-of-the-art results on the Youtube2Text dataset and is competitive with the state-of-the-art models on the MSR-VTT dataset.
video captioning, scheduled sampling, sentence-length-modulated loss, semantic assistance, RNN
video captioning, scheduled sampling, sentence-length-modulated loss, semantic assistance, RNN
Video captioning aims to automatically generate a concise and accurate description for a video. It requires techniques both in computer vision (CV) and in natural language processing (NLP). Deep learning (DL) methods for sequence-to-sequence learning are able to learn the map from discrete color arrays to dense vectors which is utilized to generate natural language sequences without the interference of human. Those methods have produced impressive results on this task compared with the results yielded by manually crafted features.
It has gained increasingly attention in video captioning that semantic meaning of a video is critical and beneficial for an RNN to generate annotations (Pan et al., 2016; Gan et al., 2017). And keeping semantic consistency between video content and video description helps to refine a generated sentence in semantic richness (Gao et al., 2017). But few researchers explore the methods to obtain video semantic features, the metrics to measure the quality of it and the correlation between video captioning performance and meaningfulness of semantic features.
Several training strategies have been used to optimize video captioning models, such as the Teacher Forcing algorithm and CIDEnt-RL. Teacher Forcing algorithm is a simple and intuitive way to train RNN. But it suffers from the discrepancy between training which utilizes ground truth to guide word generation at each step and inference which samples from the model itself at each step. RL techniques have also been adopted to improve the training process of video captioning. CIDEnt-RL is one of the best RL algorithms (Pasunuru and Bansal, 2017b) but it is extremely time-consuming to calculate metrics for every batch. In addition, the improvement on different metrics is unbalanced. In another word, the improvements on other metrics are not as large as that on the specific metrics optimized directly.
The commonly used loss function for video captioning is comprised of logarithm of probabilities of target correct words (Venugopalan et al., 2015; Donahue et al., 2015). A long sentence tends to bring high loss to the model for each additional word reduces the joint probability by roughly at least one order of magnitude. In contrast, a short sentence with few words has relatively low loss. Thus a video captioning model is prone to generate short sentences after optimized by a log likelihood loss function. Excessively short annotations may neither be able to describe a video accurately nor express the content of a video in a rich language.
We propose to improve video captioning task in three aspects. Firstly, we build our semantic detection network (SDN) on top of two streams: the first one is 2D ConNet, which is supposed to capture the static visual features of a video, and the second one is 3D ConvNet, which is intended to extract the dynamic visual information. Consequently, SDN is able to produce more meaningful semantic features for a video. Secondly, we take advantage of scheduled sampling method to train our video captioning model, which searches extreme points in the RNN state space more extensively as well as bridges the gap between training process and inference (Bengio et al., 2015). Thirdly, we optimize our model by a sentence-length-modulated loss function which encourages the model to generate longer captions with more detail.
Encoder-decoder paradigm has been widely applied by researchers in image captioning since it was introduced to machine translation (Cho et al., 2014). It becomes a mainstream method in both image captioning and machine translation (Vinyals et al., 2015; Mao et al., 2014). Inspired by successful attempts to employ attention in machine translation (Bahdanau et al., 2015) and object detection (Ba et al., 2015), models that is able to attend key elements in an image are investigated for the purpose of generating high-quality image annotations. Semantic features (You et al., 2016) and object features (Anderson et al., 2018)
are incorporated into attention mechanism as heuristic information to guide selective and dynamic attendance of salient segments in images. Reinforcement learning techniques, which optimize specific metrics of a model directly, are also adopted to enchance the performance of image captioning models(Rennie et al., 2017). Graph Convolutional Networks (GCN) is introduced to cooperate with RNN to integrate both semantic and spatial information into image encoder in order to generate excellent representations of image (Yao et al., 2018).
Though both image captioning and video captioning are multi-modal tasks, video captioning is probably harder than the former one for video consists of not only spatial feature but also temporal correlation.
Following the successful adoption of encoder-decoder paradigm in image captioning, multi-modal features of video are fed into sequence-to-sequence model to generate video description with the assistance of pretrained models in image classification (Venugopalan et al., 2015; Donahue et al., 2015). In order to alleviate the semantic inconsistency between the video content and the generated caption, visual features and semantic features of a video are mapped to a common embedding space so that semantic consistency may be achieved by minimizing the Euclidean distance between two embedded features (Pan et al., 2016).
RNN, especially LSTM, can be extended by integrating high-level tag or attribution of video with visual features of video through embedding and element-wise addition/multiplication operation (Gan et al., 2017). Yu et al. (2016) exploit a sentence generator which is built upon a RNN module to model language, a multi-modal layer to integrate different modal information and an attention module to dynamically select salient features from input. And the output of a sentence generator is fed into a paragraph generator for describing a relatively long video with several sentences.
Following the attention mechanism introduced by Xu et al. (2015), Gao et al. (2017) capture the salient structure of video with the help of visual features of video and context information provided by LSTM. Though bottom-up (Anderson et al., 2018) and top-down attention (Ramanishka et al., 2017) are proposed for image captioning, selectively focusing on salient regions in image is, to some extent, similar to picking key frames in video (Chen et al., 2018). Wang et al. (2018a) explore corss-modal attention at different granularity and captures global temporal structures as well as local temporal structures encompassed in multi-modal features to assist the generation of video captions.
The traditional method to train a RNN is the Teacher Forcing algorithm (Williams and Zipser, 1989) which feeds human annotations to RNN as input at each step to guide the token generation in training and samples a token from the model itself as input during inference. The different sources of input tokens during training and inference leads to the inability of the model to generate high-quality tokens in inference as errors may accumulate along the sequence generation.
Bengio et al. (2015) propose to switch gradually from guiding generation by true tokens to feeding sampled tokens during training which helps RNN model adapt to the inference scheme in advance. It has been applied to image captioning and speech recognition. Inspired by (Huszar, 2015) which mathematically proves that both Teacher Forcing algorithm and Curriculum Learning have a tendency to learn a biased model, Goyal et al. (2016) solve the problem by adopting adversarial domain method to align the dynamics of RNN in training and inference.
Inspired by the successful application of RL methods in image captioning (Rennie et al., 2017), Pasunuru and Bansal (2017b) propose a modified reward, which compensates the logical contradiction in phrase-matching metrics, as direct optimization target in video captioning. The gradient of non-differentiable RL loss function is computed and back-propagated by REINFORCEMENT algorithm (Williams, 1992). But calculation of reward for each training batch adds non-negligible computation cost to training process and slow down optimization progress. In addition, the improvements of RL method on miscellaneous metrics are not comparable with the improvement on the specific metrics used as RL reward.
We consider video captioning task as a supervised task. The training set is annotated as pairs of , where denotes a video and represents the corresponding target caption. Suppose there are frames from a video and a caption consists of words, then we have (1).
where each denotes a single frame and each denotes a word belonging to a fixed known dictionary.
A pretrained model is used to produce word embeddings. And then we obtain a low-dimension embedding of a caption ,
where is the dimension of word embedding space.
Our encoder is composed of 3D ConvNet, 2D ConvNet and semantic detection network (SDN). 3D ConvNet is utilized to produce spatio-temporal feature for the th video. 2D ConvNet is supposed to find the static visual feature for the th video. At last, the visual spatio-temporal representation of the th video can be obtained by concatenating two features together (3).
For semantic detection, we manually select the most common words from both the training set and the validation set as candidate tags for all the videos. The semantic detection task is treated as a multi-label classification task with as the representation of the th video and as the ground truth. If the th tag exists in the annotations of the th video, then , otherwise, . Suppose is the semantic feature of the th video. Then we have , where is a nonlinear mapping and
is sigmoid activation function. A relatively deep multi-layer perceptron (MLP) on top of two-stream framework is exploited to simulate the nonlinear projection. And SDN is trained by minimizing the loss function (4).
A probability distribution of tagsis produced by SDN to represent the semantic content of the th video in the training set, the validation set or the test set.
Standard RNN (Elman, 1990) is capable of learning temporal patterns along input sequences. But it suffers from gradient vanishing/explosion problem which results in its inability to generalize to long sequences. LSTM (Hochreiter and Schmidhuber, 1997) is a prevailing variant of RNN which alleviates the long-term dependency problem by using gates to update cell state but it ignores the semantic information of the input sequence. We use SCN (Gan et al., 2017), a variant of LSTM, as our decoder for it not only avoids the long-term dependency problem but also takes advantage of semantic information of the input video. Suppose we have video feature , semantic feature , input vector at time step and hidden state at time step . SCN integrates semantic information into , and respectively and obtains the semantics-related video feature , the semantics-related input and the semantics-related hidden state (5).
where , , and denote cell state, input gate, forget gate and output gate respectively.
Then input gate , forget gate and output gate at time step are calculated respectively in a way similar to the standard LSTM (6).
denotes logic sigmoid functionand is a bias term for each gate.
The raw cell state at current step can be computed as (7).
where denotes hyperbolic function and is the bias term for cell state. The input gate is supposed to control the throughput of the semantic-related input and the forget gate is designed to determine the preservation of the previous cell state . Thus, we have the final cell state at time step (8).
And then output gate is utilized to control the throughput ratio of the cell state so that the cell output can be determined by (9).
Semantics-related variables , , and are dependent on semantic feature so that SCN takes semantic information of video into account implicitly. The forget gate is a key component in updating to which, in some degree, avoids the long-term dependency problem. Our SCN is slightly different from the one in (Gan et al., 2017). is utilized to activate the raw cell input which confines it within in our model instead of in (Gan et al., 2017) which leads to a range of . In addition, we add a semantics-related video feature term to each recurrent step which is absent from (Gan et al., 2017).
In the context of RNN trained with the Teacher Forcing algorithm, the logarithmic probability of a given pair of input/output/label and given model parameters can be calculated as (10).
where is the length of output.
In the case of SCN, the joint logarithmic probability can be computed as follow:
where , , are the output state, the cell state and the semantic feature of the th video respectively.
To some extent, and can be viewed as the aggregation of all the previous information. We can compute them with recurrence relation (12).
where , . In inference, we need to replace with which may lead to the accumulation of prediction errors.
In order to bridge the gap between training and testing in the Teacher Forcing algorithm, we propose to train our video captioning model with scheduled sampling. Scheduled sampling transfers training process gradually from using ground truth words for guiding to using sampled words for guiding at each recurrent step. The commonly used strategy to sample a word from the output distribution is
. But the search scope is limited to a relatively small part of search space for it always selects a word with the largest probability. For the sake of enlarging the search scope, we draw a word at random from the output distribution as a part of the input for the next recurrent step. In this way, words with higher probabilities are more likely to be chosen. The randomness of the sampling procedure will make the recurrent network be able to explore a relatively large scope of the network state space. And the network is less likely to be stuck in an inferior local minimum. In the perspective of training machine learning model, multinomial sampling strategy reduces overfitting of the network, in another word, it acts like a regularizer.
Our method to optimize the language model consists of two parts: the outer loop is proposed to schedule sampling probability at each recurrent step (Algorithm 1) while the algorithm inside of RNN (Algorithm 2) specifies the procedure to sample from the output of a model with a given possibility as a part of input for the next step in RNN.
What is a good description for video? A good description should be both accurate and concise. In order to achieve this goal, we design a sentence-length-modulated loss function (13) for our model.
where is batch size and is a hyper parameter which is used to keep a balance between conciseness and accuracy of generated captions. If , (14) is a loss function commonly used in video captioning task. In this loss function, a long sentence has greater loss than a short sentence. Thus, after minimizing the loss, RNN is inclined to generate relatively short annotations which may be incomplete in semantics or sentence structure. If , all words in generated captions are treated equally in loss function as well as in the process of optimization, which may lead to redundancy or duplicate words in the process of generating captions.
Thus, we have the following optimization problem:
where is the size of training data and is the parameter of our model.
The overall structure of our model is visualized in Figure 1. Our SDN and visual feature extractors in the encoder component shares the same 2D ConvNet and 3D ConvNet in practice.
We evaluate our model on two popular video captioning datasets to show the performance of our approach. And then we compare our results with other existing methods.
YouTube2Text or MSVD (Guadarrama et al., 2013; Chen and Dolan, 2011), published in 2013, contains 1970 short YouTube video clips and the average length of them is about 10 seconds. We get roughly 40 descriptions for each video. And we follow the dataset split setting used in prior works (Pan et al., 2016; Yu et al., 2016; Gan et al., 2017), in which training dataset contains 1200 clips, validation dataset contains 100 clips and the rest of them belong to testing dataset. We tokenize the captions from the training and validation dataset and obtain around 14000 unique words. 12592 of them are uitilzed for prediction and rest of them are presented by . We add a symbol to signal the end of a sentence.
MSR-Video to Text (MSR-VTT) (Xu et al., 2016; Pan et al., 2016) is a large-scale video benchmark first presented in 2016. In its first version, MSR-VTT provides 10k short video segments with 200k descriptions in total. Each video segment is described by about 20 independent English sentences. In its second version which was published in 2017, MSR-VTT provides additional 3k short clips as testing set and video clips in its first version are used as training set and validation set. Because of lacking human annotations for the test set in the second version, we perform experiments on its first version. We tokenize and obtain 14071 unique words which appear in the train set and validation set of MSR-VTT 1.0 more than once. 13794 of them are indexed with integer starting at 0 and the rest are represented by . , which signifies the end of a sentence, is added to the vocabulary of MSR-VTT.
Based on the widely used BLEU, METEOR, ROUGE-L and CIDEr metrics, we propose an overall score (16) to evaluate the performance of a language model.
where B-4 denotes BLEU-4, C denotes CIDEr, M denotes METEOR, R represents ROUGE-L and denotes the best numeric value on the specific metrics. We presume that BLEU-4, CIDEr, METEOR and ROUGE-L reflect one particular aspect of the performance of a model respectively. And we first normalize each metrics value of a model and then take the mean value of them as an overall measurement for that model (16). If the result of a model on each metrics is closer to the best result of all models, the overall score will be close to 1. If and only if a model has the start-of-the-art performance on all metrics, the overall score is 1. If a model is much lower than the state-of-the-art result on each metrics, the overall score of the model will be close to 0.
Our visual feature consists of two parts: static visual feature and dynamic visual feature. ResNeXt (Xie et al., 2017)
, which is pretrained on ImageNet ILSVRC2012 dataset, is utilized as the static visual feature extractor in the encoder of our model. And ECO(Zolfaghari et al., 2018), which is pretrained on Kinetics-400 dataset, is utilized as dynamic visual feature extractor for the encoder in our model. More specifically, we take the 2048-dimension average pooling feature vector of the conv5/block3 output of ResNeXt as the 2D representation of videos and take the 1536-way feature of the global pool in ECO as 3D representation of videos. We set initial learning rate as for MSVD while for MSR-VTT. In addition, we drop learning rate by 0.316 every 20350 steps for MSR-VTT. Batch size is set to 64 and Adam algorithm is utilized to optimize the model for both datasets. The hyper parameter is set to 0.9, is set to 0.999 and is set to
for Adam algorithm. Each model is trained for 50 epochs in which the hyper parameter sample probabilityis set as for the th epoch. We fine-tune the hyper parameters of our model on the validation sets and select the best checkpoint for testing according to the overall score of evaluation on the validation set.
Empirically, we evaluate our method on Youtube2Text/MSVD (Guadarrama et al., 2013) and MSR-VTT (Xu et al., 2016). We report the results of our model along with the many existing models in Table 1 and Table 2.
makes use of VGG and C3D as visual feature extractors. In LSTM-E, a jointly embedding component is utilized to bridge the gap between visual information and sentence content. h-RNN is composed of a sentence generator and a paragraph generator. The sentence generator of h-RNN exploits temporal-spatial attention mechanism to focus on key segments during generation. The paragraph generator of h-RNN captures dependency between different time step outputs of the sentence generator and provides the sentence generator with new initial state. aLSTMs integrates LSTM with attention mechanism to capture the salient elements in video. What’s more, aLSTMs projects the visual feature and generated sentence feature into a common space and keeps the consistency of semantics by minimizing the Euclidean distance between two embedded features. SCN utilizes a semantics-related variant of LSTM as decoder and exploits C3D and ResNet as encoder. MTVC shares the same model on video captioning task, video prediction task and entailment generation task. The model performance on each task is benefited from the other two tasks. MTVC also utilizes attention mechanism and ensemble learning. Autoencoder for visual information and visual-semantic jointly embedding for semantic information are exploited as encoder in SibNet. The decoder of SibNet generates captions for videos with soft attention. As we can infer from Table1, our method outperforms all the other methods on all the metrics with a large margin. Compared with the previously best results, BLEU-4, CIDEr, METEOR and ROUGE-L are improved by 13.4%, 11.5%, 5.0% and 5.5% respectively. Our model has the highest overall score (16).
Table 2 displays the evaluation results of several video captioning models on MSR-VTT. v2t_navigator, Aalto, VideoLAB are the top 3 models in MSR-VTT 2017 challenge. MTVC and SibNet are similar to the ones trained on MSVD. CIDEnt-RL optimizes the model with entailment-enhanced reward (CIDEnt) by reinforcement learning technique. The CIDEr of our method is only 0.3 lower than CIDEnt-RL which directly optimizes CIDEr by RL method. And our method is better than CIDEnt-RL on other metrics by at least 1.6%. HACA exploits a so-called hierarchically aligned cross-modal attention framework to fuse multi-modal features both spatially and temporally. Our model outperforms HACA on all metrics except for METEOR which is lower by 2%. TAMoE takes advantage of external corpus and composes several experts based on external knowledge to generate captions for video. Our model achieves the state-of-the-art results on BLEU-4 and ROUGE-L and has the best result by the weighted average of four metrics (overall score (16)).
Our model achieves new state-of-the-art results on both the MSVD dataset and the MSR-VTT dataset which demonstrate the superiority of our method. Note that, our model is only trained on a single dataset without attention mechanism and it is tested without ensemble or beam search.
In this section, we will discuss the utility of the three improvements on our model.
Semantic features are the output of a multi-label classification task. Mean average precision (mAP) is often used to evaluate the results of multi-label classification task (Tsoumakas and Katakis, 2007). And we apply it to evaluate the quality of semantic features. Table 3 and Table 4 list the performance of our model trained by scheduled multinomial sampling with different semantic features on MSVD and MSR-VTT respectively. We can clearly infer from them that a better multi-label classification result results in a better video captioning model. Semantic features with higher mAP provide clearer potential attributes of a video for the model. Thus, the model is able to generate better video annotations by considering semantic features, spatio-temporal features and context information comprehensively.
Table 5 and Table 6 show the comparison among the Teacher Forcing algorithm, scheduled sampling by strategy and scheduled sampling by multinomial strategy on MSVD and MSR-VTT respectively. Teacher Forcing utilizes human annotations to guide the generation of words during training and samples from the word distribution of the output of the model to direct the generation during inference. gradually transfers from teacher forcing way to sample words with the largest possibility from the model itself during training. Multinomial is close to but samples words at random from the distribution of the model at each step. As we can see from the Tables 3 and Table 4, scheduled sampling with multinomial strategy has better performance than teacher forcing method and scheduled sampling with strategy especially on MSR-VTT. Our method explores RNN state space in larger scope and thus, is likely to find a lower local minimum during training.
As demonstrated by Table 7, the average length of human annotations is larger than all the models with (13) respectively. But Table 8 displays the tendency of redundancy in captions generated by model, which deteriorates the overall quality of model-generated sentences. The average caption length of the model with is greater than the model with while smaller than the model with . The model with generates relatively long annotations for videos without suffering from redundancy or duplication of words.
We make three improvements on the video captioning task. Firstly, our SDN extracts high-quality semantic features for video which contributes to the success of our semantics-assisted model. And then, we apply scheduled sampling training strategy. At last, a sentence-length-modulated loss function is proposed to keep our model in a balance between language redundancy and conciseness. Our method achieves satisfying results which is superior to the previous state-of-the-art results on the MSVD dataset. And performance of our model is comparable to the state-of-the-art models on the MSR-VTT dataset. In future, we may obtain further improvement on video captioning by integrating spatio-temporal attention mechanism with visual-semantics features.
Author Ke Lin is employed by company Samsung. All other authors declare no competing interests.
HC designs and performs the experiments. HC, JL and XH analyze experiment results and writes this article. KL and AM analyze data and polish the article.
This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700904, in part by the National Natural Science Foundation of China under Grant Grant 61621136008, in part by the German Research Council (DFG) under Grant TRR-169, and in part by Sumsung under contract NO. 20183000089.
The authors thank Han Liu, Hallbjorn Thor Gudmunsson and Jing Wen for valuable and insightful discussions.
The Youtube2Text dataset analyzed for this study can be found in the Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk. The MSR-VTT dataset analyzed for this study can be found in the The 1st Video to Language Challenge.
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. 6077–6086. doi:10.1109/CVPR.2018.00636