An End-to-End Baseline for Video Captioning

04/04/2019 ∙ by Silvio Olivastri, et al. ∙ Oxford Brookes University 0

Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders -- then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video captioning is the problem of generating textual descriptions based on video content. Some of its exciting applications include human-robot interaction, automated video content description and assisting the visually impaired by describing the content of movies to them. This task is particularly challenging for, as a description generation method, it should capture not only the objects, scenes, and activities present in the input video (i.e. address tasks such as video tagging, action recognition and object recognition), but it also needs to be capable of expressing how these objects, scenes, and activities relate to each other in a spatial and temporal fashion in a natural language sentence.

Figure 1: Gradient flow comparison between the disjoint training of the RNN decoder and the CNN encoder (A) and the end-to-end training of both encoder and decoder (B). In the first case (A), the CNN encoder does not update its parameters in dependence of the captioning loss, but as a function of just the classification loss. Only the decoder updates its parameters as a function of the gradient of the captioning loss (blue arrow). In the end-to-end case (B), the parameters of both CNN and RNN are updated according to the evolution of the gradients from the captioning loss. Gradient flow is depicted by the blue arrow.

As shown in Wu et al[30], two major approaches to video captioning exist: template-based language models and sequence learning-based models. The former class of methods detect words from the visual content (e.g. via object detection) to then generate the desired sentence using grammatical constraints such presence of a subject, verbs, object triplets, and so on. Interesting studies in this sense were conducted in [7, 19, 34]

. The latter class, instead, learn a probability distribution from the common space (set of feature vectors from video) of visual content to then flexibly generate the sentence without using a specific pattern or language template. Examples of this second category of approaches are 

[27, 35, 29].

1.1 Encoder-decoder frameworks

Thanks to the recent developments of useful deep learning frameworks, such as Long Short-Term Memory (LSTM) 


networks and Gated Recurrent Units (GRU) 

[4], as well as of machine translation techniques such as [22], the dominant approach in video captioning is currently based on sequence learning using an encoder-decoder framework. In this models, the encoder represents an input video sequence as a fixed dimension feature vector, which is then fed to the decoder to generate the output sentence one word at a time. At each time step in the decoder, the current input (the word) and the previously generated hidden states of the output sequence are used to predict the next word and the hidden state.

One of the most severe drawbacks of such models, however, is that the underlying feature space for the video content is static and does not change during the training process. More specifically, an encoder (typically, a Convolutional Neural Network, CNN) is pre-trained on other datasets builit for different tasks, to be then used as feature extractor on the video-captioning datasets. Although this makes some sense given the multi-task nature of the video captioning process illustrated above, the resulting disjoint training process, in which the decoder is trained on the captioning task with static features as input, is inherently suboptimal.
Recent state-of-the-art methods [35, 16, 37] try to address this issue by capturing dynamic temporal correspondences between feature vectors corresponding to different video frames. However, these works do not address the basic fact that video captioning may well require the system to learn task-specific features that will necessarily differ from those learned for, e.g., (vaguely) related action classification, video tagging or object detection tasks for which the CNN was trained.

1.2 Our proposal: end-to-end training

Our view is that a decoder designed for the video captioning task should instead be able to learn task-specific features. As shown in Figure 1

(A), in current state of the art works the gradient is not updated during the learning of the encoder part. In addition, the CNN implementing the encoding is trained using a different loss function aimed at solving a different task.

We propose to address this problem by bringing forward the end-to-end training of both encoder and decoder as a whole, as shown in Figure 1(B). Our philosophy is inspired by the success of end-to-end trainable network architectures in image recognition [10], image segmentation [12], object detection [18] and image-captioning [28]. In particular, the video generation TGANs-C model [15] is trained end-to-end by optimising three different kinds of losses: a video-level and a frame-level matching-aware loss to correct the label of real or synthetic videos/frames and to align video/frames with the most correct caption, respectively, and a temporal coherence loss to emphasise temporal consistency. In image captioning, Vinyals et al[28] show how the training of the whole model achieves superior performance using just the negative log-likelihood loss. As an additional example, Faster R-CNN [18] region proposal-based CNN architectures achieved what was then state-of-the-art object detection accuracy at the PASCAL VOC challenges in both 2007 and 2012.

In our view it is simply imperative to extend end-to-end training to video captioning, given that the latter provides simple inference [28] and can handle complex problems best described by multiple losses [15]. The reason such an approach was never explored for video captioning is, apparently, because of the amount of memory required to process video data for each batch. Also, batch sizes for video captioning can become very high (e.g. 512), making training prohibitive on a small number of GPUs.

In this paper we address this issue by accumulating gradients over multiple steps, to update parameters only after the required effective batch size is achieved. This approach is slower to train as compared to the separate training of the two components, because of the increase in the number of iterations required. A speed up in the training process can be obtained by initialising the weights of encoder and decoder from disjointly trained encoder-decoder models, leaving fine tuning to be conducted in a fully end-to-end fashion. Such a process is simple, and can be implemented in current deep learning platforms by just two lines of code.

1.3 Contributions

To the best of our knowledge, our work presents the first End-to-End trainable framework (EtENet) for video captioning. Our approach propagates the gradient from the last layer of the RNN Decoder to the first layer of the CNN encoder as illustrated in Figure 1 (B).
Performance-wise, training our network architecture in the traditional, disjoint way produces comparable results to the current state-of-the-art, whereas our end-to-end training framework delivers significant performance improvements, setting a new and simple benchmark/baseline for the field of video captioning.

Summarising, we present the first end-to-end trainable framework for video captioning which:

  • can learn to encode video captioning-specific features;

  • accumulates the gradients to limit GPU memory consumption, and is therefore able to handle the large batch sizes required to train RNN based decoders;

  • uses a two-state training process to speed up training;

  • establishes a simple baseline for future works.

2 Related work

Inspired by the latest computer vision and machine translation techniques [22], recent efforts in the video captioning field follow the sequence learning approach. The common architecture, as mentioned, is an encoder-decoder framework [30] that uses either 2D or 3D CNNs to collect video features in a fixed-dimension vector, which is then fed to a Recurrent Neural Network to generate the desired output sentence, one word at a time.

The first notable work in this area was done by Venugopalan et al[27]. They represented an entire video using a mean-pooled vector of all features generated from a CNN at frame level. The resulting fixed-length feature vector, was fed to an LSTM for caption generation. Although state-of-the-art results were achieved at the time, the temporal structure of the video was not well modelled in this framework.
Since then, alternative views have been supported on how to improve the visual model in the encoder-decoder pipeline. The same authors et al[26] have later proposed a different method which exploits two-layer LSTMs as both encoder and decoder. In this setting, compared to the original pipeline [27], each frame is used as input at each time step for the encoder LSTM, which takes care of encoding the temporal structure of the video into a fixed size vector. This model, however, still leaves room for a better spatiotemporal feature representation of videos, as well as calling for improved links between the visual model and the language model.

To address this problem, 3D CNNs and attention models have been since introduced. Inspired by 

[33], Yao et al[35] employ a temporal soft-attention model in which each output vector of the CNN encoder is weighted before contributing to each word’s prediction. The spatial component is extracted using the intermediate layers of a 3D CNN which is used in combination with the 2D CNN. On their part, [36] have proposed a spatiotemporal attention scheme which includes a paragraph generator and sentence generator. The paragraph generator is designed to pick up the sentence’s ordering, whereas the sentence generator focuses on specific visual elements from the encoder.
A distinct line of research has been brought forward by Pam and his team in [14] and subsequently in [16]. The first work tries, in addition to using features from both 2D and 3D CNNs, to introduce a visual semantic embedding space for enforcing the relationship between the semantics of the entire sentence and the corresponding visual content. Multiple Instance Learning models have been used in [5] for detecting attributes to feed to a two-layer LSTM controlled by a transfer unit.

In the last two years, numerous relevant papers have been published. Similarly to  [16], Zhang et al[37] use a task-driven dynamic fusion across the LSTM to process the different data types. The model adaptively chooses different fusion patterns according to task status. Xu et al[32] test on the MSR-VTT dataset a Multimodal Attention LSTM Network that fuses audio and video features before feeding the result to an LSTM multi-level attention mechanism. In [6], the authors create a Semantic Compositional Network plugging into standard LSTM decoding the probabilities of tags extracted the frames, in addition to the usual video features, merged in a fixed-dimension vector.
Chen et al[3] show that it is possible to get good results using just

6-8 frames. An LSTM encoder takes a sequence of visual features while a GRU decoder helps generate the sentence. The main idea of this interesting work is to use reinforcement learning, while a CNN is used to discriminate whether a frame must be encoded or not. A recent work by 

[29] improves the performance of [35] and the entire architecture of the model by inserting a reconstruction layer on top that aims to replicate the video features, starting from the hidden state of the LSTM cell.

Unlike previous work, in which the weight of the encoder part are not changes, our study will focus on end-to-end training. This strategy, which has been proven successful in various fields of application, pushes the encoder towards better capturing the features which are actually crucial for caption generation. Our training process is divided in two stages: while in the first part only the decoder is trained, in the second stage the whole model is fine-tuned.
In this work in particular we use Inception-ResNet-v2 as an encoder and a modified version of Soft-Attention defined by [35]. Nevertheless, the approach is general and can be applied to other architectures.

3 Approach

In this section, we describe the end-to-end trainable encoder-decoder architecture based approaches to video-captioning. A figure of our framework is shown in Figure 2. Section 3.1 explains the encoder based the Inception-ResNet-v2 [23] architecture, followed by the decoder (§ 3.2), which is based on our Soft-Attention LSTM (SA-LSTM) design inspired by [35] and its improvements from by [35] in § 3.2.2 and initialization detail in § 3.2.3. Next, we explain how the training process in § 3.3 which divided further into three subsections. Firstly, gradient accumulation (§ 3.3.1) method used to achieves the desired high batch size for the training of decoder. Secondly, the two-stage training to speed up the training process in § 3.3. Lastly, normalised functions are described in § 3.3.3.

3.1 Encoder

A common strategy [27]

is to use a 2D CNN pre-trained on the ImageNet dataset 

[20] as an encoder. Typically, feature vectors are generated before the first fully-connected layer of the neural network, for each frame of the video. This is done as a preprocessing step, and many versions of 2D CNN were brought forward over the course of the years for video captioning.
For instance, Venugopalan et al[27, 26] would use variants of AlexNet [10], 16 layer VGG [21] and GoogLeNet [24]. Yao et al[35] would also use GoogLeNet, in combination with a 3D CNN. Gan et al[6], instead, preferred ResNet-152 [8] whereas Wang et al[29] used Inception-v4 [23].
Based on the observation by Wang et al[29] that, by considering a deeper network, we are more likely to capture the high-level semantic information about the video, in this work we decided to use Inception-ResNet-v2 [23] as the encoder, rather than other convolutional neural network architectures. The version used in our experiments is pre-trained on ImageNet, and is available in the


repository111 accessed on 11/06/2018.

Formally, given a video composed by a sequence of RGB images, our encoder is a function mapping each frame to a feature vector using the average pooling stage after the conv2d_7b layer of Inception-ResNet-v2. We thus denote by


the output of the encoder, that will be later used by the decoder.

3.2 Decoder

The work by Wang et al[29] was instrumental to the performance exibited by the Soft-Attention model developed by Yao et al[35]. Using Inception-v4 as decoder, SA-LSTM achieves good results compared to other, more complex system (such as, for instance, [36]).
For that reason our work builds on the version of Yao et al

. developed in Theano

222 accessed on 11/06/2018

. With the advent of new frameworks such as TensorFlow and PyTorch, working with Theano as a base of our work would have been quite difficult. Thus, we decided to create our version of the decoder proposed in

[35] written using PyTorch.
Our SA-LSTM framework contemplates a number of variants to the original formulation, which are expained in detail below.

3.2.1 Formulation

At a higher level, the SA-LSTM decoder takes as input vectors generated by the encoder, , together with the previous hidden state , memory cell and word . The output of the system is formed by: (i) the probability of the next word based on the previously observed words and on the feature vectors ; (ii) the current hidden state , and (iii) the current memory cell state . Namely:


The algorithm runs sequentially through the output sequence, predicting one word at a time.

Digging deeper into the SA-LSTM model, at time , the first step is to create a single vector from by applying the Soft-Attention mechanism to the whole encoder output. Firstly, for each vector a normalised score is computed:


where , , , , are all trainable variables and is the unnormalised score of vector at the time .
Secondly, a coefficient is computed to measure the importance of the final vector as a function of the previous hidden state, with additional parameters and :



is a sigmoid activation function. The final vector is computed as follows:


The output (6) of the Soft-Attention function is concatenated with the embedding at the previous time step,


and fed to a LSTM. Memory cell and hidden state are updated through the following equations:


The word prediction is a function of the concatenation of and and of the embedding of :


where denotes vector concatenation and all weight matrices

and bias vectors

are trainable network parameters.

3.2.2 Model improvements

We need to stress a number of differences with the original framework as formulated in [35]: (i) in Equation (14), is not weighted; (ii) the LSTM machinery (from Equation (7) to (13)) is also different, and (iii) in Equation 5 a term was added. These changes are inspired by the original code repository by Yao et al. After publication of [35], the authors released new tests including the above suggestions for model improvement.

3.2.3 Initialization details

Our framework takes blocks of 16 RGB frames as input. Each frame is processed by the our version of Inception-ResNet-v2 [24] up to the average pooling stage after the conv2d_7b layer. Thus, the encoder output is composed by 16 vectors of 1536 elements each. As explained, the decoder takes as input , the previously observed words and the hidden and states. The first word of every prediction is the token <SOS>, while and are respectively initialised as follows:


where , , , , are trainable variables and is the mean of all the vectors in .
The desired video caption is predicted word by word until <EOS> is produced or after a maximum caption length is reached (set to 30 for MSVD and to 20 for MSR-VTT). The input is the same throughout each iteration. We use 512 as the dimension of the LSTM hidden layer, 486 as embedding dimension for , while the cardinality of the word probability vector obviously depends of the size of the vocabulary being considered (12,000 for MSVD, 200,000 for MSR-VTT).

Figure 2: Description of our architecture. All frames are processed one by one using Inception-ResNet-v2 (IRv2). Then, all vectors from to are concatenated ( symbol) to represent their collection . At each time a single word is predicted by SA-LSTM, based on the concatenated vector and the previous word .

3.3 Training Process

3.3.1 Accumulate to Optimize

Recurrent networks require a large (e.g. 64 in  [35]) batch size to converge to good local minima. As our SA-LSTM is based on a recurrent network architecture (LSTM), it also requires a large batch size for training

In our initial tests, when using a disjoint training setup similar to  [35]’s, we noticed that increasing the batch size would indeed boost performance. Unfortunately, Inception-ResNet-v2 is very expensive in term of memory requirements, hence large batch sizes are difficult to implement. A single batch, for instance would use 5 GB (GigaByte) of GPU memory. The machine our tests were conducted on is equipped with 4 Nvidia P100 GPUs with 16 GB of memory each, allowing a maximum batch size of 12.

To overcome this problem, our training strategy is centred on accumulating gradients until the neural network has processed 512 examples. After that, the accumulated gradients are used to update the parameters of both encoder and decoder. The pseudocode for this process is provided in Algorithm 1. The standard training process is modified into one that accumulates gradients for size. As a result, the approach achieves an effective batch size equal to .

2:  Reset gradient to zero
3:  for batch size of Examples in Training set do
4:     Model forward step using Examples
5:     Compute loss
6:     Normalise loss using
7:     Backward step and accumulate gradients
8:     if  is  then
9:        Update model with accumulated gradient
10:        Reset gradient to zero
11:     end if
13:  end for
Algorithm 1 Training with accumulated gradient

3.3.2 Two-stage Training

Stochastic optimisers require lots of parameter update iterations to identify a good local minima. Hence, our gradient accumulation strategy would be quite slow, as opposed to disjoint training in which GPU memory requirements are much lower, if naively implemented. To strike a balance between a closer to optimal but slower end-to-end training setup and a faster but less optimal disjoint training framework we adopt a two-stage training process.

In the first stage, we freeze the weights of the pre-trained encoder to start training the decoder. As the encoder’s weights are kept constant, this is equivalent to train a decoder on pre-computed features of the encoder. As a result, memory requirements are low, and the process is fast. Once the decoder reaches a reasonable performance on the validation set, the second stage of the training process starts.

In the second stage

, the whole network is trained end-to-end while freezing the batch normalisation layer of Inception-ResNet-v2. In both phases, SA-LSTM uses the real target outputs (i.e., the target words) as next input, rather than its own previous prediction.

Given the heterogeneity of the architecture, we use Adam [9] as an optimisation algorithm and different parameter values for encoder and decoder. For the former, inspired by [24], since the batch size is 512 and each example has 16 frames, we set the learning rate to and the weight decay to . The decoder is instead updated using as learning rate and

as weight decay. To avoid the vanishing and exploding gradient problems, typical of RNNs, we force the gradient to belong to the range


At test time a beam search [13] with size 5 is used for final caption generation as in [29]. We observed that using a large size (i.e. nodes) does not further improve the performance.

3.3.3 Loss Function

Similar to  [35, 33], we adopted as overall loss of the network the sum of the Negative Log-Likelihood loss and the normalised Doubly Stochastic Attention one:


where is the caption length, is the number of examples, and is the size of the feature vector . The Doubly Stochastic Attention component of the loss can be seen as encouraging the model to pay equal attention to every frame over the course of the caption’s generation process.
Similarly to  [35], we set to .

4 Evaluation

In this section, we describe the metrics (§ refmetrics), datasets (§ 4.2) used to evaluate our model, and preprocesing steps used on input data in § 4.3.

4.1 Metrics

While BLEU and METEOR were created for machine translation tasks, ROUGE-L’s aim is to compare a machine-generated summary with the human-generated sentence. CIDEr is the only metric created for evaluating image descriptions that use human consensus.

BLEU. is a method for comparing two strings. It is based on a BiLingual Evaluation Understudy proposed by Papineni et al[17]

in 2002. It is instrumental in assessing performance in machine translation tasks. The metric computes the n-gram word precision between a predicted sentence and one or more reference sentences. In the unigram case, it counts up the number of candidate sentence words which occur in any reference sentence and then divides by the total number of words in the candidate sentence. For our evaluation, we use the BLEU-4. It computes the four score with 1-gram, 2-gram, 3-gram and 4-gram and sums all results together using the geometric means.

METEOR. In [1]

Banerjee and Lavie exposed the main weaknesses of BLEU: it does not directly take recall into account and can count incorrect matches in the presence of common function words. In response, they create that a metric called METEOR (Metric for Evaluation of Translation with Explicit ORdering) computes the F-mean (combining the precision and recall via a harmonic-mean) weighted by a penalty factor that caters for unigram matching. Moreover, METEOR includes a vocabulary of synonyms that help map two synonym unigrams.

ROUGE-L. The Recall-Oriented Understudy for Gisting Evaluation measure was created by Lin [11], and focusses on automatically determining the quality of a summary. The metric counts the number of overlapping units such as n-gram, word sequences, and word pairs, between the computer-generated summary to be evaluated and the ideal summaries created by humans. The version commonly used to compare captioning is ROUGE-L, which applies the F-mean score to the longest common subsequence of unigrams between two sentences.

CIDEr. Vedantam et al[25]

published the first metric focused on evaluating image descriptions is. The similarity is measured by cosine similarity. First, all the words mapped to their stem or root forms. Then, each sentence is represented as a set of n-grams (one to four) weighted using the TF-IDF. Finally, the cosine similarity is computed between the candidate sentence and the reference sentences for each n-gram. Similar to BLEU, the final score is calculated by combining the scores using uniform weights. The range of that metric is


4.2 Datasets

We evaluate our model and compare it with our competitors on two standard video captioning benchmarks. MSVD is one of the first video captioning datasets to include multi-category videos. MSR-VTT, on the other hand, is based on 20 categories and is of much larger scale than MSVD.

MSVD. The most popular dataset for video captioning systems evaluation is, arguably, the Microsoft Video Description Corpus (MSVD), also known as YoutubeClips [2]. The dataset contains 1970 videos, each video depicting a single activity lasting about 6 to 25 seconds. The frame rate is in most cases 30 frames per second. Each video is associated with multiple descriptions in the English language, collected via Amazon Mechanical Turk, for a total of 70,028 natural language captions. As done in previous works [27], in our experiments we split the dataset into three parts: the first 1,200 videos for training, the following 100 videos for validation, and the remaining 670 videos for testing.

MSR-VTT. The MSR Video to Text [31] dataset is a recent large-scale benchmark for video captioning. 10K video clips from 20 categories were collected from a commercial video search engine (e.g., music, people, gaming, sports, and TV shows). Each of these videos was annotated with 20 sentences produced by 1327 Amazon Mechanical Turk workers, for a total number of captions of around 200K.
As prescribed in the original paper, we split videos by index number: 6,513 for training, 497 for validation and 2,990 for the test. The number of unique words present in the captions is close to 30K.

4.3 Preprocessing

Following the usual pre-processing of Inception-ResNet-v2, height and width of each frame of the video are first resized to 314, to then use the central crop patch of 299x299 pixels of each frame. A normalisation of the pixel using mean and standard deviation of 0.5 is applied. Training, Validation and Test examples are subject to the same frame preprocessing steps. To save memory space, the images are saved as u-int8 while pixel normalisation is performed at each time instant.

Using all frames of a video is very time inefficient – as [3] shows, it is possible to create an efficient model using fewer frames. On the other hand, we do not apply any additional filtering to the frames, as we prefer to leave this task for the attention mechanism. In agreement with [29] and with our own findings, we decided to represent each video by 16 equally-spaced features.

As for the captions, we tokenise them by converting all words to lowercase and applying the TreebankWordTokenizer class from the Natural Language Toolkit333 accessed on 11/06/2018 to split sentences into words. We also remove tokens that contain punctuation characters only. The tokens are generated by TreebankWordTokenizer, which uses regular expressions to tokenise text as in the Penn Treebank444 accessed on 11/06/2018, thus adhering to English grammar while maintaining punctuation in the token.

Both datasets (§ 4.2) were preprocessed in the same way.

5 Experiments

We conclude by discussing the experimental results generated by evaluating our end-to-end trainable framework (EtENet) on the datasets described in (§ 4.2).

5.1 State-of-the-art Comparision

Quantitatively, Tables 1 and 2 clearly show how our approach, both when using only step 1 of the training, and when applying both steps, matches or outperforms all the work done previously, except when measured using the BLEU metric. In fact, as explained by Banerjee et al[1], BLEU is a metric that has many weaknesses, e.g., the lack of explicit word-matching between translation and reference. In opposition, according to [25], CIDEr was specifically designed to evaluate automatic caption generation from visual sources. Hence, it makes more sense to stress the results under the CIDEr metric.
Indeed, our proposed EtENet outperforms all the existing state-of-the-art method across both the datasets (see Tables 12) when performance is measured by CIDEr.

The substantial difference between our model and the others assessed confirms that EtENet succeeds in achieving excellent results without requiring an overly complex structure, e.g., the addition of new layers as in RecNet (row 11, Table  1), or the adoption of new learning mechanisms such as reinforcement learning as in PickNet (row 3, Table  2). Moreover, this shows how it is possible to obtain excellent results even when using roughly half the frames used in the original approach [35] and in others, including [32, 37, 29].

Model B@4 M C R-L
LSTM-YT [27] 33.3 29.1 - -
S2VT [26] - 29.8 - -
SA [35] 41.9 29.6 51.7 -
LSTM-E [14] 45.3 31.6 - -
h-RNN [36] 49.9 32.6 65.8 -
LSTM-TSA [16] 52.8 33.5 74.0 -
SCN-LSTM [6] 51.1 33.5 77.7 -
MA-LSTM [32] 52.3 33.6 70.4 -
TDDF [37] 45.8 33.3 73.0 69.7
PickNet [3] 52.3 33.3 76.5 69.6
RecNet [29] 52.3 34.1 80.3 69.8
EtENet 49.1 33.6 83.5 69.5
EtENet 50.0 34.3 86.6 70.2
Table 1: Comparison between our architecture (EtENet) and the state-of-the-art models on the MSVD dataset, using the following metrics: BLEU-4, METEOR, CIDEr and ROUGE-L.

5.2 One vs Two Steps

Another consideration to make is that already after training step 1 (i.e., with frozen encoder, second last row in both Tables 1 and 2) our network is able to match the other state-of-the-art models while outperforming the original Soft-Attention (SA) work [35] (third row of Table 1). End-to-end training (last row in Tables 12) further improves the performance across the board (i.e., across datasets and across different metrics), thanks to the additional fine tuning.

Model B@4 M C R-L
MA-LSTM [32] 36.5 26.5 41.0 59.8
TDDF [37] 37.3 27.8 43.8 59.2
PickNet [3] 41.3 27.7 44.1 59.8
RecNet [29] 39.1 26.6 42.7 59.3
EtENet 40.3 27.5 46.8 60.4
EtENet 40.5 27.7 47.6 60.6
Table 2: Comparison between our architecture and the state-of-the-art on the MSR-VTT benchmark, using the following metrics: BLEU-4, METEOR, CIDEr and ROUGE-L.

5.3 Qualitative results

From a qualitative point of view, Figure 3 reports both some positive and some negative examples. Generally, we can notice that the increase in accuracy achieved by the two-step training setting leads, in some cases, to a visible improvement of the generated sentences.

Figure 3: Qualitative results produced by EtENet. In (A) and (C), which show a video from MSR-VTT and one from MSVD, respectively, it is possible to observe how end-to-end training can dramatically improve the quality of the resulting caption. In (B) a negative example from the MSR-VTT dataset is shown, for which our network cannot successfully identify the ground truth.

Much more extensively quantitative results are reported in the Supplementary Material.

6 Conclusions

In this paper, we proposed a simple end-to-end framework for video-captioning. To address the problem with the large amount of memory required to process video data for each batch, a gradient accumulation strategy was conceived. Our training procedure is articulated into two steps to speed up the training process, while still allowing end-to-end training. Our evaluation on standard benchmark datasets shows how our approach outperforms the state of the art using all the most commonly accepted metrics.

We believe we managed to set a new and straightforward benchmark/baseline for future work thanks to our simple end-to-end architecture, providing an opportunity to take research in the field forward starting from a more efficient training framework.
Our model is not exempt from drawbacks. Training a very deep a neural network end-to-end requires significant computational resources. Our proposed two-stage training process is a step towards an efficient training procedure suited to the task. This, however, leaves an opportunity to further improve training speed in future studies.


  • [1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgements. ACLworkshop, pages 358–373, 2005.
  • [2] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. ACL HLT, pages 190–200, 2011.
  • [3] Y. Chen, S. Wang, W. Zhang, and Q. Huang. Less is more: Picking informative frames for video captioning. ECCV, pages 358–373, 2018.
  • [4] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
  • [5] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P.Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. CVPR, 2015.
  • [6] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. CVPR, 2017.
  • [7] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. ICCV, pages 2712–2719, 2013.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
  • [9] S. Hochreiter and J. J. J. Schmidhuber. Long short-term memory. Neural Computation, pages 1–15, 1997.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.
  • [11] C. Lin. Rouge: A package for automatic evaluation of summaries. ACLworkshop, 2004.
  • [12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3431–3440, 2015.
  • [13] P. Norvig. Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp, chapter 6, pages 195–200. Morgan Kaufmann, 1992.
  • [14] Y. Pan, T. Mei, T. Yao, H. Li, , and Y. Rui. Jointly modeling embedding and translation to bridge video and language. CVPR, pages 4594–4602, 2016.
  • [15] Y. Pan, Z. Qiu, T. Yao, H. Li, and T. Mei. To create what you tell: Generating videos from captions. ACM, 2017.
  • [16] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. CVPR, 2017.
  • [17] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a 521 method for automatic evaluation of machine translation. ACL, pages 311–318, 2002.
  • [18] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, 2015.
  • [19] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. GCPR, pages 184–195, 2014.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Karpathy, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
  • [22] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS, pages 3104–3104, 2014.
  • [23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    AAAI, 2017.
  • [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
  • [25] R. Vedantam, C. Lawrence, and D. Parikh. Cider: consensus- 535 based image description evaluation. CVPR, 2015.
  • [26] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. ICCV, pages 4534–4542, 2015.
  • [27] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural network. NAACL-HLT, pages 1494–1504, 2015.
  • [28] O. Vinyals, A. Tosheva, S. Bengio, and D. Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. TPAMI, 2016.
  • [29] B. Wang, L. Ma, W. Zhang, and W. Liu. Reconstruction network for video captioning. CVPR, pages 7622–7631, 2018.
  • [30] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S. Chang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018.
  • [31] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. CVPR, pages 5288–5296, 2016.
  • [32] J. Xu, T. Yao, Y. Zhang, and T. Mei. Learning multimodal attention lstm networks for video captioning. ACM, 2017.
  • [33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.
  • [34] R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, pages 2346–2352, 2015.
  • [35] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. ICCV, pages 4507–4515, 2015.
  • [36] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. CVPR, pages 4584–4593, 2016.
  • [37] X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. Task-driven dynamic fusion: Reducing ambiguity in video description. CVPR, pages 3713–3721, 2017.