Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

02/27/2019 ∙ by Nayyer Aafaq, et al. ∙ The University of Western Australia 0

Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE_L metrics.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Describing videos in natural language is trivial for humans, however it is a very complex task for machines. To generate meaningful video captions, machines are required to understand objects, their interaction, spatio-temporal order of events and other such minutiae in videos; yet, have the ability to articulate these details in grammatically correct and meaningful natural language sentences. The bicephalic nature of this problem has recently led researchers from Computer Vision and Natural Language Processing (NLP) to combine efforts in addressing its challenges 

[2, 3, 4, 29]. Incidentally, wide applications of video captioning in emerging technologies, e.g. procedure generation from instructional videos [1], video indexing and retrieval [44, 54]; have recently caused it to receive attention as a fundamental task in Computer Vision.

Early methods in video captioning and description, e.g. [25, 8] primarily aimed at generating the correct Subject, Verb and Object (a.k.a. SVO-Triplet) in the captions. More recent methods [49, 38]

rely on Deep Learning


to build frameworks resembling a typical neural machine translation system that can generate a single sentence 

[56, 32] or multiple sentences [37, 42, 58] to describe videos. The two-pronged problem of video captioning provides a default division for the deep learning methods to encode visual contents of videos using Convolutional Neural Networks (CNNs) [43, 47]

and decode those into captions using language models. Recurrent Neural Networks (RNNs) 

[15, 13, 21] are the natural choice for the latter component of the problem.

Since semantically correct sentence generation has a longer history in the field of NLP, deep learning based captioning techniques mainly focus on language modelling [50, 33]. For visual encoding, these methods forward pass video frames through a pre-trained 2D CNN; or a video clip through a 3D CNN, and extract features from an inner layer of the network - referred as ‘extraction layer’. Features of frames/clips are commonly combined with mean pooling to compute the final representation of the whole video. This, and similar other visual encoding techniques [32, 50, 17, 33] - due to the nascency of video captioning research - grossly under-exploit the prowess of visual representation for the captioning task. To the best of our knowledge, this paper presents the first work that concentrates on improving the visual encoding mechanism for the captioning task.

Figure 1: The ‘’ clips and ‘

’ frames of a video are processed with 3D and 2D CNNs respectively. Neuron-wise Short Fourier Transform is applied hierarchically to the extraction layer activations of these networks (using the whole video). Relevant high-level action and object semantics are respectively derived using the intersection of vocabulary from the language model dictionary with the labels of 3D CNN and an Object Detector. The output features of the Object Detector are also used to embed spatial dynamics of the scene and plurality of the objects therein. The resulting codes are compressed with a fully-connected layer and used to learn a multi-layer GRU as a language model.

We propose a visual encoding technique to compute representations enriched with spatio-temporal dynamics of the scene, while also accounting for the high-level semantic attributes of the videos. Our visual code (‘’ in Fig. 1) fuses information from multiple sources. We process activations of 2D and 3D CNN extraction layers by hierarchically applying Short Fourier Transform [30] to them, where InceptionResNetv2 [45] and C3D [47] are used as the 2D and 3D CNNs respectively. The proposed neuron-wise activation transformation using whole videos results in encoding fine temporal dynamics of the scenes. We encode spatial dynamics by processing objects’ locations and their multiplicity information extracted from an Object Detector (YOLO [36]). The semantics attached to the output layers of the Object Detector and the 3D CNN are also exploited to embed high-level semantic attributes in our visual codes. We compress the visual codes and learn a language model using the resulting representation. With highly rich visual codes, a relatively simple Gated Recurrent Unit (GRU) network is proposed for language modeling, comprising two layers, that already results in on-par or better performance compared to the existing sophisticated models [51, 53, 33, 17]

on multiple evaluation metrics.

The main contributions of this paper are as follows. We propose a visual encoding technique that effectively encapsulates spatio-temporal dynamics of the videos and embeds relevant high-level semantic attributes in the visual codes for video captioning. The proposed visual features contain the detected object attributes, their frequency of occurrences as well as the evolution of their locations over time. We establish the effectiveness of the proposed encoding by learning a GRU-based language model and perform thorough experimentation on MSVD [10] and MSR-VTT [56] datasets. Our method achieves up to and gain in the state-of-the-art on METEOR and ROUGE metrics for these datasets.

2 Related Work

Classical methods in video captioning commonly use template based techniques in which Subject (S), Verb (V), and Object (O) are detected separately and then joined together in a sentence. However, the advancement of deep learning research has also transcended to modern video captioning methods. The latest approaches in this direction generally exploit deep learning for visual feature encoding as well as its decoding into meaningful captions.

In template based approaches, the first successful video captioning method was proposed by Kojima et al. [25] that focuses on describing videos of one person performing one action only. Their heavy reliance on the correctness of manually created activity concept hierarchy and state transition model prevented its extension to more complex videos. Hanckmann et al. [20] proposed a method to automatically describe events involving multiple actions (seven on average), performed by one or more individuals. Whilst most of the prior work was restricted to constrained domains [24, 8], Krishnamoorthy et al. [26] led the early works of describing open domain videos. [19] proposed semantic hierarchies to establish relationships between actor, action and objects. [39] used CRF to model the relationship between visual entities and treated video description as a machine translation problem. However, the aforementioned approaches depend on predefined sentence templates and fill in the template by detecting entities from classical methods. These approaches are not sufficient for the syntactically rich sentence generation to describe open domain videos.

In contrast to the methods mentioned above, deep models directly generate sentences given a visual input. For example LSTM-YT [50] feed in visual contents of video obtained by average pooling all the frames into LSTM and produce the sentences. LSTM-E [32] explores the relevance between the visual context and sentence semantics. The initial visual features in this framework were obtained using 2D-CNN and 3D-CNN whereas the final video representation was achieved by average pooling the features from frames / clips neglecting the temporal dynamics of the video. TA [57] explored the temporal domain of video by introducing an attention mechanism to assign weights to the features of each frame and later fused them based on attention weights. S2VT [49] incorporated optical flow to cater for the temporal information of the video. SCN-LSTM [17] proposed semantic compositional network that can detect the semantic concepts from mean pooled visual content of the video and fed that information into a language model to generate captions with more relevant words. LSTM-TSA[33] proposed a transfer unit that extracts semantic attributes from both images as well as mean pooled visual content of videos and added it as a complementary information to the video representation to further improve the quality of caption generation. M-VC [53] proposed a multi-model memory network to cater for long term visual-textual dependency and to guide the visual attention.

Even though the above methods have employed deep learning, they have used mean pooled visual features or attention based high level features from CNNs. These features have been used directly in their framework in the language model or by introducing additional unit in the standard framework. We argue that this technique under-utilizes the state of the art CNN features in video captioning framework. We propose features that are rich in visual content and empirically show that this enrichment of visual features alone when combined with a standard and simple language model can outperform existing state of the art methods. Visual features are part of every video captioning framework. Hence, instead of using high level or mean pooled features, building on top of our visual features can further enhance the video captioning frameworks’ performances.

3 Proposed Approach

Let denote a video that has ‘’ frames or ‘’ clips. The fundamental task in automatic video captioning is to generate a textual sentence comprising ‘

’ words that matches closely to human generated captions for the same video. Deep learning based video captioning methods typically define an energy loss function of the following form for this task:


where Pr(.) denotes the probability, and

is a visual representation of . By minimizing the cost defined as the Expected value of the energy over a large corpus of videos, it is hoped that the inferred model can automatically generate meaningful captions for unseen videos.

In this formulation, ‘’ is considered a training input, that makes remainder of the problem a sequence learning task. Consequently, the existing methods in video captioning mainly focus on tailoring RNNs [15] or LSTMs [21] to generate better captions, assuming effective visual encoding of to be available in the form of ‘’. The representation prowess of CNNs has made them the default choice for visual encoding in the existing literature. However, due to the nascency of video captioning research, only primitive methods of using CNN features for ‘’ can be found in the literature. These methods directly use 2D/3D CNN features or their concatenations for visual encoding, where the temporal dimension of the video is resolved by mean pooling [32, 33, 17].

We acknowledge the role of apt sequence modeling for video description, however, we also argue that designing specialized visual encoding techniques for captioning is equally important. Hence, we mainly focus on the operator in the mapping , where . We propose a visual encoding technique that, along harnessing the power of CNN features, explicitly encodes spatio-temporal dynamics of the scene in the visual representation, and embeds semantic attributes in it to further help the sequence modelling phase of video description to generate semantically rich textual sentences.

3.1 Visual Encoding

For clarity, we describe the visual representation of a video as , where to

are themselves column-vectors computed by the proposed technique. We explain these computations in the following.

3.1.1 Encoding Temporal Dynamics

In the context of video description, features extracted from pre-trained 2D-CNNs, e.g. VGG 

[43] and 3D-CNNs, e.g. C3D [47] have been shown useful for visual encoding of videos. The standard practice is to forward pass individual video frames through a 2D CNN and store activation values of a pre-selected extraction layer of the network. Then, perform mean pooling over those activations for all the frames to compute the visual representation. A similar procedure is adopted with 3D CNN with a difference that video clips are used in forward passes instead of frames.

A simple mean pooling operation over activation values is bound to fail in encoding fine-grained temporal dynamics of the video. This is true for both 2D and 3D CNNs, despite the fact that the latter models video clips. We address this shortcoming by defining transformations and , such that and . Here, and denote the activation vectors of the extraction layers of 2D and 3D CNNs for the video frame and video clip respectively. The aim of these transformations is to compute and that encode temporal dynamics of the complete video with high fidelity.

We use the last layer of InceptionResnetV2 [45] to compute , and the layer of C3D [47] to get . The transformations are defined over the activations of those extraction layers. Below, we explain in detail. The transformation is similar, except that it uses activations of clips instead of frames.

Let denote the activation value of the neuron of the network’s extraction layer for the frame of the training video. We leave out the superscript 2D for better readability. To perform the transform, we first define and compute , where the operator computes the Short Fourier Transform [30] of the vector in its argument and stores the first ‘’ coefficients of the transform. Then, we divide into two smaller vectors and , where . We again apply the operator to these vectors to compute and in p-dimensional space. We recursively perform the same operations on and to get the p-dimensional vectors , , , and . We combine all these vectors as . We also illustrate this operation in Fig. 2. The same operation is performed individually for each neuron of our extraction layer. We then concatenate to form , where denotes the number of neurons in the extraction layer. As a result of performing , we have computed a representation the video while accounting for fine temporal dynamics in the whole sequence of video frames. Consequently, results in a much more informative representation than that obtained with mean pooling of the neuron activations.

Figure 2: Illustration of hierarchical application of Short Fourier Transform to the activations of the neuron of the extraction layer for the video.

We define in a similar manner for the set of video clip activations. This transformation results in , where denotes the number of neurons in the extraction layer of the 3D CNN. It is worth mentioning that a 3D CNN is already trained on short video clips. Hence, its features account for the temporal dimension of to some extent. Nevertheless, accounting for the fine temporal details in the whole video adds to our encoding significantly (see Section 4.3). It is noteworthy that exploiting Fourier Transform in a hierarchical fashion to encode temporal dynamics has also been considered in human action recognition [52, 35]. However, this work is the first to apply Short Fourier Transform hierarchically for video captioning.

3.1.2 Encoding Semantics and Spatial Evolution

It is well-established that the latter layers of CNNs are able to learn features at higher levels of abstraction due to hierarchical application of convolution operations in the earlier layers [27]. The common use of activations of e.g. fully-connected layers as visual features for captioning is also motivated by the fact that these representations are discriminative transformations of high-level video features. We take this concept further and argue that the output layers of CNNs can themselves serve as discriminative encodings of the highest abstraction level for video captioning. We describe the technique to effectively exploit these features in the paragraphs to follow. Here, we briefly emphasize that the output layer of a network contains additional information for video captioning beyond what is provided by the commonly used extraction layers of networks, because:

  1. The output labels are yet another transformation of the extraction layer features, resulting from network weights that are unaccounted for by extraction layer.

  2. The semantics attached to the output layer are at the same level of abstraction that is encountered in video captions - a unique property of the output layers.

We use the output layers of an Object Detector (i.e. YOLO [36]) and a 3D CNN (i.e. C3D [47]) to extract semantics pertaining to the objects and actions recorded in videos. The core idea is to quantitatively embed object labels, their frequencies of occurrence, and evolution of their spatial locations in videos in the visual encoding vector. Moreover, we also aim to enrich our visual encoding with the semantics of actions performed in the video. The details of materializing this concept are presented below.

Objects Information:

Different from classifiers that only predict labels of input images/frames, object detectors can localize multiple objects in individual frames, thereby providing cues for ascertaining plurality of the same type of objects in individual frames and evolution of objects’ locations in multiple frames. Effective embedding of such high-level information in vector ‘

’ promises descriptions that can clearly differentiate between e.g. ‘people running’ and ‘person walking’ in a video.

The sequence modeling component of a video captioning system generates a textual sentence by selecting words from a large dictionary . An object detector provides a set of object labels at its output. We first compute , and define , where denotes the cardinality of a set. The vectors in are further defined with the help ‘’ frames sampled from the original video. We perform this sampling using a fixed time interval between the sampled frames of a given video. The samples are passed through the object detector and its output is utilized in computing . A vector is defined as , where indicates the element of (i.e. an object name), Pr(.) and Fr(.) respectively compute the probability and frequency of occurrence of the object corresponding to , and represent the velocity of the object between the frames and (in the sampled frames).

We define over ‘’ frames, whereas the used object detector processes individual frames that results in a probability and frequency value for each frame. We resolve this and related mismatches by using the following definitions of the components of :

  • Pr(.) : .

  • Fr(.) = : , where ‘’ is the allowed maximum number of the same class of objects detected in a frame. We let in experiments.

  • : and . Here, denote the Expected values of the and coordinates of the same type of objects in a given frame, such that the coordinates are also normalized by the respective frame dimensions.

We let in our experiments, resulting in that compose . The indices of coefficients in identify the object labels in videos (i.e. probable nouns to appear in the description). Unless an object is detected in the video, the coefficients of corresponding to it are kept zero. The proposed embedding of high level semantics in contain highly relevant information about objects in explicit form for a sequence learning module of video description system.

Actions Information:

Videos generally record objects and their interaction. The latter is best described by the actions performed in the videos. We already use a 3D CNN that learns action descriptors for the videos. We tap into the output layer of that network to further embed high level action information in our visual encoding. To that end, we compute , where is the set of labels at the output of the 3D CNN. Then, we define , where is the element of (an action label) and

is a binary variable that is 1 only if the action is predicted by the network.

We concatenate the above described vectors and to form our visual encoding vector , where . Before passing this vector to a sequence modelling component of our method, we perform its compression using a fully connected layer, as shown in Fig. 1. Using activation function and fixed weights, this layer projects ‘’ to a 2K-dimensional space. The resulting projection ‘’ is used by our language model.

3.2 Sequence Modelling

We follow the common pipeline of video description techniques that feeds visual representation of a video to a sequence modelling component, see Fig. 1. Instead of resorting to a sophisticated language model, we develop a relatively simpler model employing multiple layers of Gated Recurrent Units (GRUs) [13]

. GRUs are known to be more robust to vanishing gradient problem - an issue encountered in long captions - due to their ability of remembering the relevant information and forgetting the rest over time. A GRU has two gates: reset

and update , where the update gate decides how much the unit updates its previous memory and the reset gate determines how to combine the new input with the previous memory. Concretely, our language model computes the hidden state of a GRU as:

where, denotes the hadamard product, is sigmoid activation , are learnable weight matrices, and denote the respective biases. In our approach, for a given video, whereas the signal is the word embedding vector. In Section 4.3, we report results using two layers of GRUs, and demonstrate that our language model under the proposed straightforward sequence modelling already provides highly competitive performance due to the proposed visual encoding.

4 Experimental Evaluation

4.1 Datasets

We evaluate our technique using two popular benchmark datasets from the existing literature in video description, namely Microsoft Video Description (MSVD) dataset [10], and MSR-Video To Text (MSR-VTT) dataset [56]. We first give details of these datasets and their processing performed in this work, before discussing the experimental results.

MSVD Dataset [10]: This dataset is composed of 1,970 YouTube open domain videos that predominantly show only a single activity each. Generally, each clip is spanning over 10 to 25 seconds. The dataset provides multilingual human annotated sentences as captions for the videos. We experiment with the captions in English. On average, 41 ground truth captions can be associated with a single video. For benchmarking, we follow the common data split of 1,200 training samples, 100 samples for validation and 670 videos for testing [57, 53, 17].

MSR-VTT Dataset [56]: This recently introduced open domain videos dataset contains a wide variety of videos for the captioning task. It consists of 7,180 videos that are transformed into 10,000 clips. The clips are grouped into 20 different categories. Following the common settings [56], we divide the 10,000 clips into 6,513 samples for training, 497 samples for validation and the remaining 2,990 clips for testing. Each video is described by 20 single sentence annotations by Amazon Mechanical Turk (AMT) workers. This is one of the largest clips-sentence pair dataset available for the video captioning task, which is the main reason of choosing this dataset for benchmarking our technique.

4.2 Dataset Processing & Evaluation Metrics

We converted the captions in both datasets to lower case, and removed all punctuations. All the sentences were then tokenized. We set the vocabulary size for MSVD to 9,450 and for MSR-VTT to 23,500. We employed “fasttext“ [9]

word embedding vectors of dimension 300. Embedding vectors of 1,615 words for MSVD and 2,524 words for MSR-VTT were not present in the pretrained set. Instead of using randomly initialized vectors or ignoring the out of vocabulary words entirely in the training set, we generated embedding vectors for these words using character n-grams within the word, and summing the resulting vectors to produce the final vector. We performed dataset specific fine-tuning on the pretrained word embeddings.

In order to compare our technique with the existing methods, we report results on the four most popular metrics, including; Bilingual Evaluation Understudy (BLEU) [34], Metric for Evaluation of Translation with Explicit Ordering (METEOR) [6], Consensus based Image Description Evaluation (CIDEr[48] and Recall Oriented Understudy of Gisting Evaluation (ROUGE[28]. We refer to the original works for the concrete definitions of these metrics. The subscript ‘’ in CIDEr indicates the metric variant that inhibits higher values for inappropriate captions in human judgment. Similarly, the subscript ‘’ indicates the variant of ROUGE that is based on recall-precision scores of the longest common sequence between the prediction and the ground truth. We used the Microsoft COCO server [11] to compute our results.

4.3 Experiments

In our experiments reported below we use InceptionResnetV2 (IRV2) [45] as the 2D CNN, whereas C3D [47] is used as the 3D CNN. The last ‘’ layer of the former, and the ‘’ layer of the latter are considered as the extraction layers

. The 2D CNN is pre-trained on the popular ImageNet dataset 

[40], whereas Sports 1M dataset [23] is used for the pre-training of C3D. To process videos, we re-size the frames to match the input dimensions of these networks. For the 3D CNN, we use 16-frame clips as inputs with an 8-frame overlap. YOLO [36]

is used as the object detector in all our experiments. To train our language model, we include a start and an end token to the captions to deal with the dynamic length of different sentences. We set the maximum sentence length to 30 words in the case of experiments with MSVD dataset, and to 50 for the MSR-VTT dataset. These length limits are based on the available captions in the datasets. We truncate a sentence if its length exceeds the set limit, and zero pad in the case of shorter length.

We tune the hyper-parameters of our language model on the validation set. The results below use two layers of GRUs, that employ 0.5 as the dropout value. We use the RMSProp algorithm with a learning rate

to train the models. A batch size of 60 is used for training in our experiments. We performed training of our models for 50 epochs. We used the sparse cross entropy loss to train our model. The training is conducted using NVIDIA Titan XP 1080 GPU. We used TensorFlow framework for development.

4.3.1 Results on MSVD dataset

We comprehensively benchmark our method against the current state-of-the-art in video captioning. We report the results of the existing methods and our approach in Table. 1. For the existing techniques, recent best performing methods are chosen and their results are directly taken from the existing literature (same evaluation protocol is ensured). The table columns present scores for the metrics BLEU-4 (B-4), METEOR (M), CIDEr (C) and ROUGE (R).

The last seven rows of the Table report results of different variants of our method to highlight the contribution of various components of the overall technique. GRU-MP indicates that we use our two-layer GRU model, while the common ‘Mean Pooling (MP)’ strategy is adopted to resolve the temporal dimension of videos. ‘C3D’ and ‘IRV2’ in the parentheses identify the networks used to compute the visual codes. We abbreviate the joint use of C3D and IRV2 as ‘CI’. We use ‘EVE’ to denote our Enriched Visual Encoding that applies Hierarchical Fourier Transform - indicated by the subscript ‘hft’ - on the activations of the network extraction layers. The proposed final technique, that also incorporates the high-level semantic information - indicated by the subscript ‘+sem’ - is mentioned in the last row of the Table. We also follow the same notational conventions for our method in the remaining Tables.

Model B-4 M C R
FGM [46] 13.7 23.9 - -
S2VT [49] - 29.2 - -
LSTM-YT [50] 33.3 29.1 - -
Temporal-Attention (TA) [57] 41.9 29.6 51.67 -
h-RNN [58] 49.9 32.6 65.8 -
MM-VDN [55] 37.6 29.0 - -
HRNE [31] 43.8 33.1 - -
GRU-RCN [5] 47.9 31.1 67.8 -
LSTM-E [32] 45.3 31.0 - -
SCN-LSTM [17] 51.1 33.5 77.7 -
LSTM-TSA [33] 52.8 33.5 74.0 -
TDDF [59] 45.8 33.3 73.0 69.7
BAE [7] 42.5 32.4 63.5 -
PickNet [12] 46.1 33.1 76.0 69.2
aLSTMs [18] 50.8 33.3 74.8 -
M-IC [53] 52.8 33.3 - -
RecNet [51] 52.3 34.1 80.3 69.8
GRU-MP - (C3D) 28.8 27.7 42.6 61.6
GRU-MP - (IRV2) 41.4 32.3 68.2 67.6
GRU-MP - (CI) 41.0 31.3 61.9 67.6
GRU-EVE - (C3D) 40.6 31.0 55.7 67.4
GRU-EVE - (IRV2) 45.6 33.7 74.2 69.8
GRU-EVE - (CI) 47.8 34.7 75.8 71.1
GRU-EVE - (CI) 47.9 35.0 78.1 71.5
Table 1: Benchmarking on MSVD dataset [10] in terms of BLEU-4 (B-4), METEOR (M), CIDEr (C) and ROUGE. See the text for the description of proposed method GRU-EVE’s variants.

Our method achieves a strong value of METEOR, which provides a gain over the closest competitor. Similarly, gain over the current state-of-the-art for ROUGE is . For the other metrics, our scores remain competitive to the best performing methods. It is emphasized, that our approach derives its main strength from the visual encoding part in contrast to sophisticated language model, which is generally the case for the existing methods. Naturally, complex language models entail difficult and computationally expensive training process, which is not a limitation of our approach.

We illustrate representative qualitative results of our method in Fig. 3. We abbreviate our final approach as ‘GRU-EVE’ in the figure for brevity. The semantic details and accuracy of e.g. plurality, nouns and verbs is clearly visible in the captions generated by the proposed method. The figure also reports the captions for GRU-MP-(CI) and GRU-EVE-(CI) to show the difference resulting from hierarchical Fourier transform (hft) as compared to the Mean Pooling (MP) strategy. These captions justify the noticeable gain achieved by the proposed hft over the traditional MP in Table 1. We also observe in the table that our method categorically outperforms the mean pool based methods, i.e. LSTM-YT [50], LSTM-E [32], SCN-LSTM [17], and LSTM-TSA[33] on METEOR, CIDEr and ROUGE. Under these observations, we safely recommend the proposed hierarchical Fourier transformation as the substitute for the ‘mean pooling’ in video captioning.

Figure 3: Illustration of caption generated for MSVD test set: The final approach is abbreviated as GRU-EVE for brevity. A sentence from ground truth captions is shown for reference.

In Table 2, we compare the variant of our method based on a single CNN with the best performing single CNN based existing methods. The results are directly taken from [53] for the provided METEOR metric. As can be seen, our method outperforms all these methods. In Table 3, we also compare our method on METEOR with the state-of-the-art methods that necessarily use multiple visual features to obtain the best performance. A significant gain is achieved by our method to the closest competitor in this regard.

FGM [46] 23.90
S2VT [49] 29.2
LSTM-YT [50] 29.07
TA [57] 29.0
p-RNN [58] 31.1
HRNE [31] 33.1
BGRCN [5] 31.70
MAA [16] 31.80
RMA [22] 31.90
LSTM-E [32] 29.5
M-inv3 [53] 32.18
GRU-EVE-(IRV2) 33.7
Table 2: Performance comparison with single 2D-CNN based methods on MSVD dataset [10].
SA-G-3C [57] 29.6
S2VT-RGB-Flow [49] 29.8
LSTM-E-VC [32] 31.0
p-RNN-VC [58] 32.6
M-I[53] 33.3
GRU-EVE - (CI) 35.0
Table 3: Performance comparison on MSVD dataset [10] with the methods using multiple features. The scores of existing methods are taken from [53]. V denotes VGG19, C is C3D, I denotes Inception-V3, G is GoogleNet and I denotes InceptionResNet-V2

4.3.2 Results on MSR-VTT dataset

MSR-VTT [56] is a recently released dataset. We compare performance of our approach on this dataset with the latest published models such as Alto [41], RUC-UVA [14], TDDF [59], PickNet [12], M-VC [53] and RecNet [51]. The results are summarized in Table 4. Our method significantly improves the state-of-the-art on this dataset on METEOR, CIDEr and ROUGE metrics, while achieving strong results on BLEU-4 metric. These result ascertain the effectiveness of the proposed enriched visual encoding for visual captioning.

Model B-4 M C R
Alto [41] 39.8 26.9 45.7 59.8
RUC-UVA [14] 38.7 26.9 45.9 58.7
TDDF [59] 37.3 27.8 43.8 59.2
PickNet [12] 38.9 27.2 42.1 59.5
M-VC [53] 38.1 26.6 - -
RecNet [51] 39.1 26.6 42.7 59.3
GRU-EVE - (IRV2) 32.9 26.4 39.2 57.2
GRU-EVE - (CI) 36.1 27.7 45.2 59.9
GRU-EVE - (CI) 38.3 28.4 48.1 60.7
Table 4: Benchmarking on MSR-VTT dataset [56] in terms of BLEU-4 (B-4), METEOR (M), CIDEr (C) and ROUGE.

5 Discussion

We conducted a through empirical evaluation of the proposed method to explore its different aspects. Below we discuss and highlight few of these aspects in the text.

For the settings discussed in the previous section, we generally observed semantically rich captions generated by the proposed approach. In particular, these captions well captured the plurality of objects and their motions/actions. Moreover, the captions generally described the whole videos instead of its partial clips. Instead of only two, we also tested different number of GRU layers, and observed that increasing the number of GRU layers deteriorated the BLEU-4 score. However, there were improvements in all the remaining metrics. We retained only two GRU layers in the final method mainly for computational gains. Moreover, we also tested different architectures of GRU, e.g. with state sizes 512, 1024, 2048 and 4096. We observed a trend of performance improvement until 2048 states. However, further states did not improve the performance. Hence, 2048 were finally used in the results reported in the previous section.

Whereas all the components of the proposed technique contributed to the overall final performance, the biggest revelation of our work is the use of hierarchical Fourier Transform to capture the temporal dynamics of videos. As compared to the ‘nearly standard’ mean pooling operation performed in the existing captioning pipeline, the proposed use of Fourier Transform promises a significant performance gain for any method. Hence, we safely recommend replacing the mean pooling operation with our transformation for the future techniques.

6 Conclusion

We presented a novel technique for visual encoding of videos to generate semantically rich captions. Besides capitalizing on the representation power of CNNs, our method explicitly accounts for the spatio-temporal dynamics of the scene, and high-level semantic concepts encountered in the video. We apply Short Fourier Transform to 2D and 3D CNN features of the videos in a hierarchical manner, and account for the high-level semantics by processing output layer features of an Object Detector and the 3D CNN. Our enriched visual representation is used to learn a relatively simple GRU based language model that performs on-par or better than the existing video description methods on popular MSVD and MSR-VTT datasets.
Acknowledgments This research was supported by ARC Discovery Grant DP160101458 and partially by DP190102443. We also thank NVIDIA corporation for donating the Titan XP GPU used in our experiments.


  • [1] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In IEEE CVPR, 2016.
  • [2] B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. A Workshop on Language and Vision at CVPR 2015.
  • [3] B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. A Workshop on Language and Vision at CVPR 2018.
  • [4] R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. The Joint Video and Language Understanding Workshop at ICCV 2015.
  • [5] N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
  • [6] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  • [7] L. Baraldi, C. Grana, and R. Cucchiara. Hierarchical boundary-aware neural encoder for video captioning. In IEEE CVPR, 2017.
  • [8] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. Video in sentences out. arXiv preprint arXiv:1204.2742, (2012).
  • [9] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
  • [10] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1, pages 190–200. ACL, 2011.
  • [11] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [12] Y. Chen, S. Wang, W. Zhang, and Q. Huang. Less is more: Picking informative frames for video captioning. arXiv preprint arXiv:1803.01457, (2018).
  • [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [14] J. Dong, X. Li, W. Lan, Y. Huo, and C. G. Snoek. Early embedding and late reranking for video captioning. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1082–1086. ACM, 2016.
  • [15] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • [16] R. Fakoor, A.-r. Mohamed, M. Mitchell, S. B. Kang, and P. Kohli. Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261, 2016.
  • [17] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic Compositional Networks for visual captioning. In IEEE CVPR, 2017.
  • [18] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 19(9):2045–2055, 2017.
  • [19] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE international conference on computer vision, pages 2712–2719, 2013.
  • [20] P. Hanckmann, K. Schutte, and G. J. Burghouts. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV, pages 372–380, 2012.
  • [21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [22] A. K. Jain, A. Agarwalla, K. K. Agrawal, and P. Mitra. Recurrent memory addressing for describing videos. In CVPR Workshops, 2017.
  • [23] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    , pages 1725–1732, 2014.
  • [24] M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused video description. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011.
  • [25] A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. IJCV, 50(2):171–184, 2002.
  • [26] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, volume 1, page 2, 2013.
  • [27] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [28] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.
  • [29] M. Margaret, M. Ishan, H. Ting-Hao, and F. Frank. Story Telling Workshop and Visual Story Telling Challenge at NAACL 2018.
  • [30] A. V. Oppenheim. Discrete-time signal processing. Pearson Education India, 1999.
  • [31] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In IEEE CVPR, pages 1029–1038, 2016.
  • [32] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4594–4602, 2016.
  • [33] Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. In IEEE CVPR, 2017.
  • [34] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on ACL, pages 311–318, 2002.
  • [35] H. Rahmani and A. Mian. 3d action recognition from novel viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1506–1515, 2016.
  • [36] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
  • [37] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition, 2014.
  • [38] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie description. IJCV, 123(1):94–120, 2017.
  • [39] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 433–440, 2013.
  • [40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [41] R. Shetty and J. Laaksonen. Frame-and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1073–1076. ACM, 2016.
  • [42] A. Shin, K. Ohnishi, and T. Harada. Beyond caption to narrative: Video captioning with multiple sentences. In IEEE International Conference on Image Processing (ICIP), 2016.
  • [43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [44] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe. Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recognition, 75:175–187, 2018.
  • [45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In AAAI, volume 4, page 12, 2017.
  • [46] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Coling, volume 2, page 9, 2014.
  • [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [48] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In IEEE CVPR, 2015.
  • [49] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
  • [50] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
  • [51] B. Wang, L. Ma, W. Zhang, and W. Liu. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7622–7631, 2018.
  • [52] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet ensemble for 3d human action recognition. IEEE transactions on pattern analysis and machine intelligence, 36(5):914–927, 2014.
  • [53] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7512–7520, 2018.
  • [54] J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence, 40(4):769–790, 2018.
  • [55] H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. A multi-scale multiple instance video description network. arXiv preprint arXiv:1505.05914, (2015).
  • [56] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In IEEE CVPR, 2016.
  • [57] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pages 4507–4515, 2015.
  • [58] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In IEEE CVPR, 2016.
  • [59] X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. Task-driven dynamic fusion: Reducing ambiguity in video description. In IEEE CVPR, 2017.