In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks (RNN), which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning systems with RNN based language models and extends this framework to videos utilizing both static image features and video-specific features. In addition, we study the usefulness of visual content classifiers as a source of additional information for caption generation. With experimental results we show that utilizing keyframe based features, dense trajectory video features and content classifier outputs together gives better performance than any one of them individually.READ FULL TEXT VIEW PDF
Generating descriptions for videos has many applications including assis...
We present our submission to the Microsoft Video to Language Challenge o...
The use of Recurrent Neural Networks for video captioning has recently g...
Well-trained generative neural networks (GNN) are very efficient at
This paper describes two of my best performing approaches on the
Learning visual feature representations for video analysis is a daunting...
The emergence of online services in our daily lives has been accompanied...
Source code for the paper "Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training"
Automatic description of videos using sentences of natural language is a challenging task in computer vision. Recently, two large collections of video clips extracted from movies, together with natural language descriptions of their visual content, have been published for scientific purposes. A unified version of these data sets, namely M-VAD and MPII-MD , has been provided for the purpose of the Large Scale Movie Description (LSMDC) Challenge 2015. In this paper, we describe the system we used while participating in LSMDC 2015111https://sites.google.com/site/describingmovies/.
Our work builds on the static image captioning system based on recurrent neural networks and proposed in  and implemented in the NeuralTalk222https://github.com/karpathy/neuraltalk system . We extend this framework for generating textual descriptions of small video clips utilizing both static image features and video-specific features. We also study the use of visual content classifiers as a source of additional information for caption generation. We also make the source code for our work available online333https://github.com/aalto-cbir/neuraltalkTheano .
We propose to use a neural-network-based framework to generate textual captions for the given input video. Our pipeline consists of three distinct stages as seen in Figure 1
. The first stage is feature extraction, wherein we extract both whole video and keyframe image based features from the input video. As the whole video based feature we use dense trajectories[23, 24]3]. We use three different CNN architectures giving us a rich variety of features for the keyframe images. In many studies, including our own [9, 7], CNN-based features and especially late fusion combinations of them have been found to provide superior performance in many computer vision and image analysis tasks.
In our current experiments, the CNN-based features are either directly input to LSTM, the Long short-term memory
recurrent neural network (RNN), or they are fed to a set of visual content classifiers which in turn produce 80-dimensional class membership vectors that can then be used as LSTM inputs. For training the 80 classifiers we have used the training set images of the COCO 2014 collection.
The third stage consists of an LSTM network, taking one feature set and possibly the visual content classification results as input and generating a sequence of words, i.e. a video caption, with the highest probability of being associated with the input features and thus the processed video clip. Next we will look at each of these processing stages in detail.
We have used the minimum length of 2 seconds for the video clips when extracting the dense trajectory features. Videos that had frames wider than 800 pixels we scaled down to the width of 720 pixels. We used the default length of 15 frames in the trajectories, giving rise to 28 dimensional displacement vectors. These we quantized to a 1000-dimensional histogram, whose codebook was created with k-means clustering of 1,000,000 randomly sampled trajectories from the training partition of the LSMDC 2015 video clips.
In addition, we created also 1000-dimensional histograms from the 96-dimensional HOG , Motion Boundary MBHx and MBHy  descriptors and the 108-dimensional HOF  descriptors of the dense trajectories. Concatenating all these five histograms resulted in 5000-dimensional video features. In many cases dense trajectory features of higher dimensionality, say have been found to be better than feature vectors of this dimensionality, but we were afraid that the training of the LSTM network might not be successful with too high-dimensional inputs.
We also extract static image features from one keyframe selected from the center of each video clip. For the feature extraction in the keyframes we use CNNs pre-trained on the ImageNet database for object classification . We use three different CNN architectures namely 16-layer and 19-layer VGG  nets and GoogLeNet . In the case of VGG nets we extract the activations of the network on the second fully-connected 4096-dimensional fc7 layer for the given input images whose aspect ratio is distorted to a square. Ten regions, as suggested in , are extracted from all images and average pooling of the region-wise features are used to generate the final features.
For GoogLeNet we have used similarly the 5th Inception module
5th Inception module, having the dimensionality of 1024. We augment these features with the reverse spatial pyramid pooling proposed in  with two scale levels. The second level consists of a grid with overlaps and horizontal flipping, resulting in a total of 26 regions, on the scale of two. The activations of the regions are then pooled using average and maximum pooling. Finally, the activations of the different scales are concatenated resulting to 2048-dimensional features. See  for some more details.
We have extracted the above described five CNN-based image features also from the images of the COCO 2014  training set and trained an SVM classifier for each of the 80 object categories specified in COCO 2014. In particular, we utilized linear SVMs with homogeneous kernel maps  of order to approximate the intersection kernel. Furthermore, we used two rounds of hard negative mining  and sampled negative examples on each round.
For each LSMDC keyframes we thus have 15 SVM outputs (five features times initial and two hard negative trainings) that we combine with arithmetic mean in the late fusion stage. The 80 fusion values, one for each object category, are then concatenated to form a class membership vector for each keyframe image. These vectors we optionally use as inputs to the LSTM network.
To learn generative models of sentences conditioned on the input image, video and class membership features, we chose to use LSTM networks . This choice was based on two basic requirements this problem imposes. Firstly, the model needs to handle sentences of arbitrary length and LSTMs are able to do this by design. Secondly, during training using gradient descent methods the error signal and its gradients need to propagate a long way back in time without exploding, and again LSTMs satisfy this criteria.
The block diagram of a single LSTM cell is shown in Figure 2. It consists of a memory cell , whose value at any timestep is influenced by the current input , previous output and previous cell state . The update to the memory value is controlled using the input gate and the forget gate. The output is controlled using the output gate. The gates are implemented using sigmoidal non-linearity keeping them completely differentiable. The input and forget gates LSTM cells have the ability to preserve the content of the memory cell over long periods making it easier to learn longer sequences. This process is formalized in the equations below:
We add a softmax layer at the output of the LSTM to generate a probability distribution over the vocabulary. At each time step, the LSTM is trained to assign the highest probability to the word it thinks should appear next given the current inputs and the hidden state:
In our simplest architecture, visual features are fed into the LSTM only at the zeroth time step as the input . We refer to this feature input as the init feature since it initializes the hidden state of the LSTM. In the subsequent time steps, a start symbol followed by the word embeddings for each word in the reference caption are fed through input . In our experiments with the COCO dataset, we have found it helpful to let the LSTM have access to the visual features throughout the generation process. This requires adding to the LSTM cell a new input which we refer to as the persistent features. This input plays the same role as in equations (1)–(4) except with a different set of weights. Note that we can input different visual features in the init and persistent lines thereby allowing the model to learn simultaneously from two different sources.
The training procedure for the LSTM generator is the same as in , we try to maximize the log probability assigned to the training samples by the model. Equivalently, we can minimize the negative log likelihood as given as:
To evaluate various forms of our model we used the the LSMDC 2015 public test set as the benchmark. The evaluation is performed using four standard metrics used in the LSMDC evaluation server namely: METEOR , BLEU , ROUGE-L  and CIDEr . Table 1 shows these four metrics computed for different models. In addition to the metrics, we also show the perplexity of the model on the public test set and the average lengths of the generated sentences. Results are provided always for beam sizes 1 and 5 used in the caption generation stage.
In order to get a quick baseline, we used models trained earlier on the COCO dataset to generate captions on the LSMDC test set with a simple rule-based translation applied to their output. This translation is done in order to better match the LSMDC vocabulary and is implemented using the simple rule:
Models 1–3 in Table 1 are such translated models trained on the COCO dataset. Model 1 coco-kf was trained on the COCO dataset using concatenated GoogLeNet-based features with a total dimensionality of 4096 as the init features. This approach matches the use of the NeuralTalk model described in  and . Model 2 coco-kf+cls was trained using GoogLeNet as the init features and the outputs of the 80 SVM classifiers as the persistent feature, while model 3 coco-cls+kf was trained with the role of these two feature types reversed. The results of these models have in our earlier experiments shown increasingly better performance on the COCO dataset itself, but we can hardly observe such progression in the translated results on the LSMDC dataset.
Next, we trained three models similar to the above COCO models, but now with captions available and features and content classification results calculated from the keyframes of the videos in the LSMDC 2015 dataset. The results are presented as models 4–6 in Table 1. Here we can see the benefit of using persistent features as the model 6 cls+kf performs better than the models trained solely on keyframe features.
Finally, we trained three models using the dense trajectory video features and the keyframe-based SVM output features, presented in Table 1 as models 7–9. Again we see that using the higher-dimensional feature, here the dense trajectory feature, as the persistent input to the LSTM network gives the best performance among the group of models. The result of model 9 cls+traj can also be regarded as the best one obtained in our experiments and therefore we have used it in our final blind test data submission to the LSMDC 2015 Challenge.
As we can see from Table 1, the persistent dense trajectory video features combined with the init SVM classifier features from keyframes outperform all the other models in three out of the four metrics used. Comparing this with model 6 cls+kf shows that using video features as opposed to just keyframe features gives a better performance. It also indicates that combining the keyframe and video features is better than using just the video features.
A rather surprising finding is that using larger beam sizes in inference lead to poorer performance. This is slightly counterintuitive, but can be understood when we look at the lengths of the sentences produced by these two beam sizes. For example, model 9 cls+traj produces sentences with the average length of 5.33 words with beam size 1, while with beam size 5 the average length drops to just 3.79 words. This is because with higher beam sizes the model always picks the most likely sentence and penalizes heavily any word it is unsure of. This results in the model picking very generic sentences like “SOMEONE looks at SOMEONE” over more descriptive ones.
In this paper, we described the framework of techniques utilized in our participation in the LSMDC 2015 Challenge. We presented a technique which utilizes (1) video-based dense trajectory features, (2) keyframe-based visual features, and (3) object classifier output features, and an LSTM network to generate video descriptions therefrom. We discussed a couple of architectural variations and experimentally determined the best architecture among them for this dataset. We also experimentally verified the effect of the beam size used in the inference stage on the performance of the captioning system. The two conclusions we made are: (1) Using the classifier output features to initialize the LSTM network and video features after the initialization results in the best performance. (2) Beam size one in the sentence generation process is better than larger beam sizes.
This work has been funded by the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, 251170) and Data to Intelligence (D2I) DIGILE SHOK project. The calculations were performed using computer resources within the Aalto University School of Science “Science-IT” project.
Proceedings of Computer Vision and Pattern Recognition, 2005.
Convolutional network features for scene recognition.In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, Florida, 2014.