From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

by   Jingkuan Song, et al.

Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multi-modal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.



There are no comments yet.


page 1

page 4

page 7


Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

Multi-modal learning, particularly among imaging and linguistic modaliti...

VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning

Multi-modal information is essential to describe what has happened in a ...

Delving Deeper into the Decoder for Video Captioning

Video captioning is an advanced multi-modal task which aims to describe ...

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural...

Deep Learning for Video Classification and Captioning

Accelerated by the tremendous increase in Internet bandwidth and storage...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the explosive growth of online videos over the past decade, video captioning has become a hot research topic. In a nutshell, video captioning is the problem of translating a video into meaningful textual sentences describing its visual content. As such, solving this problem has the potential to help various applications, from video indexing and search to human-robot interaction.

Building on the pioneering work of Kojima et al. [1], a series of studies have been conducted to come up with a first generation of video captioning systems [2, 3, 4]

. Recently, however, the development of these systems has more and more relied on deep neuronal networks (DNN) that have been proven effective in both computer vision (e.g., image classification and object detection) and natural language understanding (e.g., machine translation and language modeling), forming two technological pillars of video captioning solutions. In particular, deep Convolutional Neural Networks (CNNs) (e.g., VggNet

[5] and ResNet [6]

) have been widely deployed to extract representative visual features, while Recurrent Neural Networks (RNNs) (e.g., Long Short Term Memory (LSTM)


and Gate Recurrent Unit (GRU)


) have been deployed to translate sequential term vectors to natural language sentences. Despite significant conceptual and computational complexity of these DNN-based models, their effectiveness has given rise to the so-called

encoder-decoder scheme as a popular modern approach for video captioning. In this scheme, typically a CNN is used as an encoder and a RNN as a decoder. This approach has shown better performance than traditional video captioning methods with hand-crafted features.

Fig. 1:

In real-life scenario, a video can be described by different sentences because the providers have different intents, experiences and so on. However, if we use deterministic model for video captioning, only one sentence is predicted with the highest probability, which conflicts with the real scenario. By taking different hidden factors (e.g., intention and experience) into consideration, a trained model should be able to output different sentences. P1, P2 and P3 indicates three persons.

Recent efforts towards developing and implementing an encoder-decoder scheme for video captioning have mainly focused on solving the following questions: 1) how to help an encode-decoder framework to more efficiently and effectively bridge the gap between video and language [9]? 2) How to facilitate video captioning using semantic information [10]? 3) How to deploy an attention mechanism to help decide what visual information to extract from video [11, 12]? 4) How to extract attributes/key concepts from sentences to enhance video captioning? [13, 14, 15]. Numerous approaches have been proposed to address these questions [10, 16, 11, 12, 17].

However, the above mentioned approaches have been deterministic without incorporating uncertainties (i.e., both subjective judgment and model uncertainty) into the model calculations at all stages of the modeling. Firstly, in essential, video captioning is a complex process and involves many factors such as video itself, description intents, personal characteristics and experiences, etc. Except for the video content, other factors are inherently random and unpredictable. For example, in Fig.1, we asked three people to describe two videos separately, and they provided different descriptions for each video. This indicates that video captioning is subjective and uncertain. Secondly, video captioning models are always abstractions of the natural video captioning processes by leaving out some less important components and keeping only relevant and prominent components, thus model uncertainty arises. However, both uncertainties are ignored in previous work.

Therefore, in this paper we are focusing on dealing with the above uncertainties. All our attempts are to ascertain the true nature about video captioning. We propose a novel approach, namely multi-modal stochastic RNN networks (MS-RNN), which model the uncertainty observed in the data using latent stochastic variables. Our method is inspired by variational auto-encoder (VAE) [18]

, which uses a set of latent variables to capture the latent information. Our work makes the following contributions: 1) We propose an novel end-to-end MS-RNN approach for video captioning. To our knowledge, this is the first approach to video captioning that takes the uncertainty, both subjective judgment and model uncertainty, into consideration. Therefore, for each video, our model can generate multiple sentences to describe it from different aspects. 2) We proposed a multi-modal Long Short-Term Memory (M-LSTM) layer, which incorporates the features from different information sources (i.e., visual and word) into a set of higher-level representation by adjusting the weights on each individual source for improving the video captioning performance. 3) We develop a novel backward stochastic LSTM (S-LSTM) mechanism to model uncertainty in a latent process through latent variables. With S-LSTM, the uncertainty is expressed in the form of probability distribution of latent variables. The uncertainty can be model into a prior distribution by making use of the consistency between prior distribution and posterior distribution. 4) The proposed model is evaluated on two challenging datasets MSVD and MSR-VTT. The experimental results show that our method achieves superior performance in video captioning. Note that our model only utilizes the appearance features of videos, and no attention mechanism is incorporated.

Ii Related Work

Ii-a Recurrent Neural Networks

Recurrent Neural Networks (RNNs) [19, 20] form a directed cycle to connect units. This mechanism allows them to process arbitrary sequential data streams, thus RNNs have been widely used in computational linguistics and achieved great success. Taking language model as an example, RNNs model a sequential data streams (e.g., a sentence) by decomposing the probability distribution over outputs:


At each time step, an RNN observes an element and updates its internal states, , where is a deterministic non-linear function and indicates a set of parameters. The probability distribution over is parametrized as: . The RNN Language Model (RNNLM) [21] parametrized the output distribution by applying a softmax function onto the previous hidden state . To learn the model’s parameters, RNNLM maximizes the log-likelihood by adopting the gradient descent. However, most existing RNNs models propagate deterministic hidden states.

Ii-B Visual Captioning

The study of visual captioning problem has been going on for many years. In 2002, the video captioning system [1] was proposed for describing human behavior, the method firstly detects visual information (i.e. position of head, direction of head and positions of hands) to find the position where the person is and the gesture what the person does, then selects appropriate predicate, object, etc. with domain knowledge. Finally, the method applies syntactic rules to generate a whole sentence. Following this work, a series of studies are conducted to utilize such technique to enhance different multimedia applications [2, 3, 4]. And there are some works tackle the problem with probabilistic graphical model. Farhadi et al. [22] introduce the meaning space, which is represented as triplets of object; action; scene in the form of a Markov Random Field (MRF), and map the images and sentence to the meaning space to find the relationship of between images and sentences. Rohrbach et al. [23] try to model the relationship between different components of the visual information with a Conditional Random Field (CRF), then tackle the captioning problem as a machine translation problem to generate sentences.

Inspired by the recent advances in image classification using CNN networks (e.g., VggNet [5], GoogLeNet [24] and ResNet [6]), and in machine translation utilizing RNN, there have been a few attempts [10, 25, 16, 17, 11] to address video caption generation by firstly adopting an efficient CNN network to extract video appearance features, and secondly utilizing a RNN to take video features and the previous predicted words to infer a new word with a softmax [26, 27, 28]. In order to further improve the performance, more complex approaches [11, 10, 17] are proposed from different aspects. Specifically, Yao et al. [11] adopted a spatio-temporal convolutional neural network (3-D CNN) for capturing video motion information and a soft attention mechanism to select relevant frame level features for video captioning. Pan et al. [10] incorporated the semantic relationship between sentence and visual content for video captioning, while Yu et al. [17] proposed a hierarchical framework consisting of a sentence generator to describe a specific short video internal and a paragraph generator to capturing the inter-sentence dependency. However, all of them treat video captioning as a deterministic problem, which can only generate one output, which violate the nature of video captioning. By taking different hidden factors (e.g., intention and experience) into consideration, a trained model should be able to output different sentences. Note that the model introduced in [29] also can generate diverse sentences for image captioning, because it use different LSTMs to generate different sentences (the number of LSTMs is equal to the number of different sentences.), so their model no uncertain factors and does not capture the uncertainty in captioning problem.

Ii-C What is Uncertainty

From the management point of view, uncertainty is the lack of exact knowledge, regardless of what is the cause of this deficiency [30, 31, 32]. Models provide us a solution to clarify our understanding of our knowledge gap, but in real life, understanding the average processes is often not sufficient and it is impossible to predict with certain results [33]

. In general, besides language uncertainty, uncertainty can be classified into six major types

[33, 30]: 1) measurement errors resulting from imperfections in measuring devices and observational techniques etc. 2) systematic error, which occurs as the results of bias in the measuring devices or the sampling process. 3) natural variation, which occurs in system that changes, with respect to time, space or other variations, in ways. 4) inherent randomness, which results from a system that is irreducible to a deterministic one. 5) model uncertainty, which mainly arises because the mathematical and computer models that are used for predicting future events or for answering question under specific scenarios. And 6) subjective judgment, which occurs as a result of interpretation of data. Without sufficient data, the experts’ judgment will be based on observations and experience. All of these uncertainties are hidden factors affecting the results of video captioning, and we propose to model these uncertainties using latent stochastic variables.

Ii-D Variational auto-encoder

As mentioned above, we know that we should find a method to capture the uncertainty in the video captioning problem. But how can we model the uncertainty? Variational auto-encoder (VAE) [18] model gives us a good way to solve this problem. For capturing the variations in the observed variables x

, the VAE model introduces a set of latent random variables

z and rewrites the objective function as follow:



is Kullback-Leibler divergence between two distributions

and , which measures the non-symmetric difference between two probability distributions. And is a approximate posterior distribution, which avoids to solve the intractable true posterior distribution. In [18]

, the VAE model was used to paint the digits, so it needs to decide not just which number is written, but the angle, the stroke width, and also abstract stylistic properties, so the model uses a set of latent random variables to capture the latent information. Inspired by this, we also use latent variables with a stochastic layer to capture the uncertainty information in the video captioning. Different with painting digits, the video captioning task needs generate different sentences based on the content of the video, so our objective function is a conditional probability, so we use the loss function introduced in conditional variational auto-encoder (CVAE)


, which extend the VAE to dispose conditional probability distribution. And Krishnan

et al. [35] compared the different variational models, they guide us to choose a effective variational model. And there are some works extend the VAE model to RNN [36, 37, 38] for generating speech or music signal. All these works inspire us extend the captioning problem to a uncertainty problem.

Fig. 2: The end-to-end multi-modal RNNs stochastic architecture for video captioning. The S-LSTM is proposed to introduce latent variables to propagate uncertainty. During the training phase, S-LSTM enables the consistency between prior distribution and posterior distribution. Therefore, during the test phase, we only need the learned prior distribution to support video caption generation. It’s a common strategy in VAE model. And we use the B-LSTM to infer the posterior distribution over latent variables, so the B-LSTM layer is removed during the test phase.

Iii The Proposed Approach

In this section, we introduce our approach for video captioning, and we follow the conventional encoder-decoder framework. The encoder is based purely on neural networks to generate video descriptions, and the decoder, named Multimodal Stochastic Recurrent Neural Networks (MS-RNN) (see Fig. 2), is our major contribution. We first introduce the architecture of our proposed network, and then devise the loss function and optimization.

Iii-a Problem Formulation

Given a video v with frames, we extract their frame-level features and v can be represented as , where and is the dimension of the frame-level features. For each v, we also have a textual sentence a to describe it, and a includes words which can be represented as . Specifically, is one-hot vector where is the dimension of the vocabulary. Therefore, we have and the . Given a video, our model will predict one word at a time until we generate a textual sentence to describe the input video. In detail, in the -th time step, our model utilizes v and the previous words to predict a word with the maximal probability , until we reach the end of the sentence. In addition, we set a mark as the end of sentence.

Iii-B Encoder

The goal of an encoder is to compute feature vectors that are compact and representative and can capture the most related visual information for the decoder. Specifically, it encodes the input v into a continuous representation which may be a variable-sized set . Thanks to the rapid development of deep convolutional neural networks (CNNs), which have made a great success in large scale image recognition task [6], object detection [39] and visual captioning [9]. High-level features can been extracted from upper or intermediate layers of a deep CNN network. Therefore, a set of well-tested CNN networks, such as the ResNet-152 model [6]

which has achieved the best performance in ImageNet Large Scale Visual Recognition Challenge, can be used as candidate encoders for our framework. With a CNN architecture, we can apply it to each frame to extract representative frame-level features.

For encoding the sentence, because of the sparsity of one-hot vectors , like previous works [11, 10], we process one-hot vector with ”embedding” method. We set a parameter matrix to map the one-hot vectors a to s as follow:


The and will be input to the next step. In addition, the end of sentence is mapped to .

Iii-C Decoder with MS-RNN

The MS-RNN consists of three core components as shown Fig. 2: a basic LSTM layer for extracting word-level features, a multimodal LSTM layer (M-LSTM) for encoding multi-view information (visual and textual features) simultaneously and chronologically, and a backward stochastic LSTM layer (S-LSTM) to adequately introduce latent variables.

Iii-C1 LSTM for Word Features

In our MS-RNN model, we use a basic LSTM layer to take as input and output word features with encoded temporal information.


where . More specifically, a standard LSTM unit consists of three gates: a “forget gate” () that decides what information we are going to throw away from a LSTM unit; an “input gate” () that decides what new information we are going to store in the cell state; and an “output gate” that controls the extent to which the value in memory is used to compute the output activation of the block. A standard LSTM can be defined as:



is a sigmoid function,

denotes a hyperbolic tangent function, is a cell state vector, is an output vector, is a sigmoid gate, is a set of parameters, denotes the element-wise multiplication, and is a set of bias values. Then, for each word , we extracted its word features as .

Iii-C2 Multimodal LSTM Layer

Next, a M-LSTM layer takes the and a video-level feature as inputs to fuse a high-level features .


Here, instead of using advanced but complex temporal or spatial attention mechanism to select a video-level feature, we use the basic mean pooling strategy to obtain one :


The motivation is that if our model using the basic way to utilize the visual features can improve the performance of video captioning, the advantages of our MS-RNN are manifest. However, as shown in [11, 12], the attention mechanism can further boost the performance of video captioning.

Multimodal LSTM (M-LSTM) is a novel variant of LSTM, and it not only inherits the numerical stability of LSTM but also generates plausible features from multiview sources. We choose LSTM as our basic RNN unit due to the following reasons: 1) it achieved great success in machine translation, speech recognition, image and video caption [40, 41, 9]; and 2) compared with basic RNN units, it is absolutely capable of handing the “long-term dependencies” problem.

Given two modalities and as the inputs, and two initialized vectors and , a M-LSTM can be used to fuse them and extract a higher-level feature. A M-LSTM unit can be described as bellow:


To obtain an abstract concept from two-modalities, the M-LSTM needs to firstly project and

into a common feature space, then the inside gates can add them together with an activation function. Then, in each time step

, we extracted a higher-level feature .

Fig. 3: The stochastic cell of the S-LSTM.

Iii-C3 Backward Stochastic LSTM

In this subsection, we introduce our Backward Stochastic LSTM (S-LSTM) to take the output of M-LSTM to approximate the posterior distributions over latent variables defined as , where . The S-LSTM consists of two units: a backward LSTM unit and a stochastic unit. We define the output of the backward LSTM as .

For the backward LSTM unit in time step , its output is defined as:


where is the output of M-LSTM at time step , is the output of embedding layer, and is initialized to zero vector. The form of similar with , but it process sequence with backward direction. We can see that the output of backward LSTM in time step depends on the present input , and future output . This is because in the stochastic units, the posterior distribution of , which is calculated with Eq. 15, does not depend on the past outputs and deterministic states, but depend on the present and future ones. Therefore, we propose to use the backward LSTM to extract the future information, and incorporate it with a stochastic layer to achieve our goal.

Fig. 3 demonstrates the stochastic unit structure. To obtain , we utilize an “reparameterization trick” introduced in [18]. This trick randomly samples a set of values

from a standard Gaussian distribution. Therefore,

. If we assume , we can use to calculate . Next, we need to solve the problem of how to learn and for .

In detail, the stochastic unit takes , and as input to approximate and by two feed-forward networks (i.e., and ). In addition, each of them contains two fully connected layers.


is a concatenation operation. In addition, the stochastic unit also takes and to approximate and by two feed-forward networks (i.e., and ):


For training, we set , this method, introduced in [37], can improve the posterior approximation by using the prior mean, while for testing we set , and we set the as zero vector at the beginning. To output a symbol , a probability distribution over a set of possible words is obtained using and :


where and

are parameters to be learned. Next, we can interpret the output of the softmax layer

as a probability distribution over words.

Iii-D Loss Function

Based on the variational inference and conditional variational autoencoder (CVAE) proposed in

[34], we define the following loss function:


where is the evidence lower bound (ELBO) of the log likelihood. The distribution is an approximate posterior distribution , which aims to approximate the intractable true posterior distribution. For the first term , which is an expected log likelihood under . This term is written as:


Here, we process the concatenation vector with a softmax layer, mentioned by Eq.12, to approximate .

The second term , namely KL term, is the Kullback-Leibler divergence, which measures the non-symmetric difference between two probability distributions (i.e., and ). And in our work, we choose the variational model introduced in [35] to factorize the posterior distribution. The posterior and prior distributions are factorized as below:


For approximating the and , we firstly use a backward LSTM layer to encode ( we have encoded to mentioned in Eq.3) and to , then utilize the method, mentioned in Sec. III-C3

, to approximate the means and the variances of

and . So we can use the following function to calculate the Kullback-Leibler divergence at the -th time step:


For the whole sentence generation, we calculate the global Kullback-Leibler divergence by:


In this paper, we maximize the above proposed loss function to learn all the parameters. More specifically, we use backpropagation through time (BPTT) algorithm to compute the gradients and conduct the optimization with adadelta


Iv Experiment

We evaluate our model on two standard video captioning benchmark datasets: the widely used Microsoft Video Description (MSVD) [43] and the large-scale MSR Video-to-Text (MSR-VTT) [44].

MSVD: This dataset consists of short video clips collected from YouTube, with an average length of about 9s. In addition, this dataset contains about 80,000 clip-description pairs labeled by Amazon Mechanical Turkers (AMT). In other words, each clip has multiple sentence descriptions. In total, all the descriptions contain nearly unique vocabularies. Following previous work [10, 16, 17], we split this dataset into a training, a validation and a testing dataset with , and video clips, respectively.

MSR-VTT: This dataset was proposed by Xu et al. [44] in 2016. They aim to provide a new large-scale video benchmark for supporting video understanding, especially for the task of translating videos to text. In total, this dataset contains K web video clips and K clip-sentence pairs in total. Each clip is annotated with 20 natural natural sentences by AMT workers. This dataset is collected from a commercial video search engining and so far it covers the most comprehensive categories and diverse visual content, representing the largest dataset in terms of sentences and vocabularies. We run our experiments on their updated version with sentence quality control. This dataset is divided into three subsets: % for training, % for validating and % for testing.

Fig. 4: Demonstration of our results, which are generated by repeatedly inputing each video five times into our trained model on the MSVD dataset. Our model is able to generate different captions based on the different hidden stochastic variables.

Iv-a Evaluation Metrics

To evaluate the performance of our model, we utilize the following four evaluation metrics: BLUE

[45], METEOR [46], CIDEr [47] and ROUGE-L [48]. In addition, Microsoft COCO evaluation server [49] has implemented these metrics, so we directly call such evaluation functions to test the performance of video captioning.

Iv-B Experimental Settings

Video Appearance Feature Extraction. The experimental results obtained by Xu et al. [44] show that applying different pooling methods (i.e., single frame, meaning pooling and soft-attention) obtains different performance. Both mean pooling and soft-attention perform significantly better than single frame. The soft-attention performs slightly better than mean pooling with 0.6% BULE@4 and 0.6% METEOR increases, but it involves more operations. Therefore, we apply a mean pooling to a set of frame level features to generate a representative video-level feature. In addition, we follow previous work [11] to uniformly sample

frames from each clip for controlling video frames duplication. Deep convolutional neural networks (CNNs) achieved a great success in image feature extraction. Therefore, in this paper we respectively use the ResNet-152

[6] and GoogLeNet [24], the two state-of-the-art CNNs, to extract video frame level features to analyze our model. The results show that ResNet features perform better (see TabIII).

Sentence Preprocessing. For MSVD dataset, we tokenize it by firstly converting all words to lowercases and secondly utilizing the WordPunct function from NLTK toolbox to tokenize sentences and remove punctuations. As a result, we obtained a vocabulary with words from the MSVD training dataset. For the MSR-VTT dataset, after tokenization we obtain a size vocabulary from its training dataset. For each dataset, we use the one-hot vector (-of- encoding, where is the vocabulary size) to represent each word.

Training Details. For dealing with sentences with arbitrary size, we add a begin-of-sentence tag bos to start each sentence, and an end-of-sentence tag eos to end each sentence. During training, we maximize the loss function by taking the video and its corresponding groundtruth sentence label as the inputs.

In addition, in our experiments, with an initial learning rate to avoid the gradient explosion, we set the beam search size as . Empirically, we set all the M-LSTM unit sizes as 512, all the B-LSTM unit sizes as 512, the dimension of latent variables as 256, and the word embedding size as 512. Our objective function Eq.13

is optimized over the whole training video sentence pairs with mini-batch 64 in size of MSVD and 256 in size of MSR-VTT. We stop training our model until 500 epochs are reached or until the evaluation metric does not improve on the validation set at the patience of 20. In addition, we multiply the KL term by a scalar, which starts at 0.01 and linearly increases to 1 over the first 20 epochs.

Testing Details. During testing, our model takes the video and a begin-of-sentence tag bos as inputs to generate sentences to describe the input video. After the parameter are learned, we perform the generation with Beam Search [50].

In addition, our model incorporates latent variables for ascertaining the true nature about video caption and has potential to describe video from different aspects. Thus, we have repeatedly input the test videos into our trained model five times. Each time we obtain a performance showing in Tab.I. Finally, we obtain an average performance. Moreover, Fig.4 shows some output examples.

Time B@1 B@2 B@3 B@4 M C RL
1 83.2 72.8 63.5 53.4 33.9 73.7 70.4
2 82.7 72.6 63.7 53.6 34.0 75.2 70.3
3 82.4 72.1 62.9 52.8 33.6 74.7 69.8
4 83.0 72.8 63.6 53.8 34.0 76.6 70.4
5 83.1 72.7 63.5 53.1 33.6 73.9 70.0
mean 82.9 72.6 63.5 53.3 33.8 74.8 70.2
TABLE I: Performances of our MS model obtained by repeatedly input test videos into our model five times.

Iv-C Results on MSVD Dataset

In this paper, we propose to utilize probability distribution of latent variables to depict uncertainty, thus for each time our model may generate different descriptions. In this subsection, we run the testing five times and report the results in Tab.I. The performance of each testing is quite stable and reasonable. By checking the generated sentences (see Fig.4), we can see that our model can describe a video from various aspects, likely in real life, human provide various sentences to describe one video to fit their intents.

Iv-D Component Analysis

In this paper, we design two core components: a M-LSTM layer and a S-LSTM layer, which affect the performance of our algorithm. In this subsection, we study their performance variance with the following two settings:

  • Only using M-LSTM for video captioning (M).

  • Incorporating M-LSTM and S-LSTM for video captioning (M+S).

In this sub-experiment, we firstly conduct the experiments on the MSVD dataset and use ResNet to extract frame features.

Tab.II lists the results, which demonstrate that our MS-RNN model with both M-LSTM and S-LSTM outperforms M-LSTM only on all evaluation metrics, with a 1.3% M, 3.3% C and 1% RL performance increases.

Model B@1 B@2 B@3 B@4 M C RL
M 82.7 71.8 62.7 52.2 32.5 71.5 69.2
M+S 82.9 72.6 63.5 53.3 33.8 74.8 70.2
TABLE II: Exploring MS-RNN. The top model uses only M-LSTM, while the bottom model integrates M-LSTM and S-LSTM. B, M, C, and RL are short for Blue, METEOR, CIDEr and ROUGE-L, respectively. All values are reported as percentage (%).

In Fig.4, we show some example sentences generated by our approach, with only M-LSTM and with both M-LSTM and S-LSTM, respectively. From Fig.4, we have the following observations:

  • Both M-LSTM and M-LSTM+S-LSTM are able to generate accurate descriptions for a video. In addition, the results generated by M-LSTM+S-LSTM are generally better than M-LSTM method, which is consistent with the results reported in Tab. II.

  • M-LSTM is deterministic and it can only generate one sentence, while M-LSTM+S-LSTM can produce different sentences.

  • In general, M-LSTM+S-LSTM can provide more specific, comprehensive and accurate descriptions than M-LSTM. For example, the left top example, M-LSTM generates “a women is playing a guitar”, while M-LSTM+S-LSTM provides “a girl is singing” and “a women is playing with a guitar”. From the middle bottom, we can see that M-LSTM provides a wrong description “cucumber”, while M-LSTM+S-LSTM generates “vegetables” and a set of verbs “slicing, chopping and cutting”.

  • Our MS-RNN model may produce duplicate and comprehensive results, which is consistent with the nature of video captioning.

  • The last column shows some wrong examples. For the right top example, both methods provide wrong descriptions, “cutting a cucumber” and “slicing a carrot”. This is mainly because the MSVD dataset contains many videos about cooking and few videos about folding paper, which leads to an over-fitting problem, In addition, the right middle is also inaccurate. This is because both our models only take video appearance features as inputs and ignores the motion features. For the right bottom example, our model does not correctly identify the number of objects in some cases.

Model B@1 B@2 B@3 B@4 M C
LSTM-E(V)[10] 74.9 60.9 50.6 40.2 29.5 -
h-RNN(V)[17] 77.3 64.5 54.6 44.3 31.1 62.1
SA(G)[11] 79.1 63.2 51.2 40.6 29.0 -
MFA-LSTM(R)[13] 81.3 69.8 60.5 50.4 32.2 69.8
MS-RNN(G) 80.3 68.4 58.7 48.0 31.0 66.6
MS-RNN(R) 82.9 72.6 63.5 53.3 33.8 74.8
TABLE III: Comparing the quality of sentence generation on different video spatial representations on the MSVD dataset. (V), (G), and (R) stands for the VGGNet, GoogLeNet, and ResNe, respectively. This experiment is conducted on the MSVD dataset. All the values are reported as percentage (%).

Iv-E Comparison Results on MSVD Dataset

In this subsection, we conduct experiments to examine how different video representations work on video captioning, as well as comparing our model with existing approaches. In addition, all the approaches in this sub-experiments only take one type video representation extracting from VggNet (V), GoogleNet (G) or ResNet (R). We conduct our experiments on the MSVD dataset.

Tab. III lists the experimental results. From Tab.III, we have following observations:

  • With only appearance features, our MS-RNN (R) model achieves the best performance on all evaluation metrics. Compared with the state-of-the-art method MFA-LSTM (R), our model achieves significantly better performance, with 1.6%, 2.8%, 3%, 2.9%, 1.6% and 5% increases on B@1, B@2, B@3, B@4, M and C, respectively.

  • For video captioning task, the RestNet-based video representation performs better than both VggNet-based and GoogleNet-based video features. Specifically, for our model RestNet feature performs better than GoogleNet features. For the whole experimental results, the approaches (SCN-LSTM and MFA-LSTM) with ResNet-based features performs better than the methods with GoogleNet or VggNet-based features.

  • Compared with the methods using attention mechanisms, e.g., temporal attention [11], our MS-RNN (R) achieves even better results with 3.8%, 9.4%, 12.3%, 12.7% and 4.8% increases on B@1, B@2, B@3, B@4 and M by using a simple mean pooling strategy. This indicates the advantages of our proposed MS-LSTM.

Model B@1 B@2 B@3 B@4 M C
LSTM-E(V+C)[10] 78.8 66.0 55.4 45.3 31.0 -
SA(G+3D)[11] 80.0 64.7 52.6 42.2 29.6 51.7
h-RNN(V+C)[17] 81.5 70.4 60.4 49.9 32.6 65.8
MFA-LSTM(R+C)[13] 82.9 72.0 62.7 52.8 33.4 68.9
SCN-LSTM(R+C) [51] - - - 51.1 33.5 77.7
MS-RNN(R) 82.9 72.6 63.5 53.3 33.8 74.8
TABLE IV: Performance comparison with methods using both appearance and motion video features. This experiment is conducted on the MSVD dataset.

We also compare our methods with the others using multiple features. Specifically, in this subsection, we compare our model using only appearance features with six state-of-the-art methods: LSTM-E(V+C) [10], SA(G+3DCNN) [11], HRNE-AT(G+C) [16], h-RNN(V+C) [17], MFA-LSTM(R+C) [13] and SCN-LSTM(R+C) [51], which make use of both appearance and motion video features. Here, V and R are short for VggNet and ResNet, which are used to extract appearance features. 3D and C are short for 3DCNN and C3D, which are used to generate video motion features.

The experimental results are shown in Tab.IV. Although our model only uses appearance features, it performs better than existing methods on B@2 (72.6%), B@3 (63.5%), B@4 (53.3%) and M (33.8%), and achieves comparable results on B@1 (82.9%) and C (74.8%).

Model B@4 M C RL
MP-LSTM(V)[25] 34.8 24.8 - -
MP-LSTM(C) 35.4 24.8 - -
MP-LSTM(V+C) 35.8 25.3 - -
SA-LSTM(V)[11] 35.6 25.4 - -
SA-LSTM(C) 36.1 25.7 - -
SA-LSTM(V+C) 36.6 25.9 - -
MFA-LSTM(R+C)[13] 39.2 26.9 44.6 60.1
MS-RNN(R) 39.8 26.1 40.9 59.3
TABLE V: Experiment results on the MSR-VTT dataset. SA-LSTM runs employ soft attention over the frame level features extracted from deep network, while MP-LSTM and our method utilize mean pooling over the frame level video features.

Iv-F Comparison Results on MSR-VTT Dataset

In this section, we compare our method with MP-LSTM [25] and SA-LSTM [11] on the MSR-VTT dataset. In addition, to obtain the appearance features, the MP-LSTM and our MS-RNN are based on the mean pooling strategy, while SA-LSTM is based on a soft attention mechanism. In theory, soft attention is more complex than mean pooling, but usually provides better visual features. The experimental results are shown in Tab.V and we have the following observations:

  • MS-RNN gains a promising performance with 39.8% B@4, 26.1% M, 40.9% C and 59.3% RL on the MSR-VTT dataset.

  • Overall with same visual input (VGG-19, VGG-19+C3D, or C3D), SA-LSTM performs better than MP-LSTM. However, SA is based on the soft attention. In other words, in theory SA-LSTM takes better visual features as inputs. Compared with MP-LSTM, our MS-RNN (R) outperforms MP-LSTM (VGG-19+C3D) with 4% B@4 and 0.8% M increases. Compared with SA-LSTM, our MS-RNN (R) outperforms SA-LSTM(VGG-19+C3D) with 3.2% B@4. Compared with MS-RNN(R+C), our model achieves comparable results on B@4, M and RL by using single feature (R).

V Conclusions and Future Work

In this paper, we propose a Multimodal Stochastic Recurrent Neural Network (MS-RNN) framework for video captioning. This work has shown how to extend the modeling capabilities of RNN by approximating both prior distribution and true posterior distribution with a nonlinear latent layer (S-LSTM). In addition, MS-RNN achieves the state-of-the-art performance with only mean video appearance features and is comparable with the counterparts, which take both video appearance and motion features. Last but not least, the proposed model can be applied to a wide range of video analysis applications.

In the future, we will integrate the state-of-the-art attention mechanism [11] with our model to further improve the video captioning performance. Moreover, the motion feature will be considered.


  • [1] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description of human activities from video images based on concept hierarchy of actions,” International Journal of Computer Vision, vol. 50, no. 2, pp. 171–184, 2002.
  • [2] M. W. Lee, A. Hakeem, N. Haering, and S. Zhu, “SAVE: A framework for semantic annotation of visual events,” in Computer Vision and Patter Recognition, 2008, pp. 1–8.
  • [3] M. U. G. Khan, L. Zhang, and Y. Gotoh, “Human focused video description,” in International Conference on Computer Vision, 2011, pp. 1480–1487.
  • [4] P. Hanckmann, K. Schutte, and G. J. Burghouts, “Automated textual descriptions for a wide range of video events with 48 human actions,” in ECCV, 2012, pp. 372–380.
  • [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2014.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Patter Recognition, 2016, pp. 770–778.
  • [7] S.Hochreiter and J.Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [8] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
  • [9] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - video to text,” in International Conference on Computer Vision, 2015, pp. 4534–4542.
  • [10] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Computer Vision and Patter Recognition, 2016, pp. 4594–4602.
  • [11] L. Yao, A. Torabi, K. Cho, N. Ballas, C. J. Pal, H. Larochelle, and A. C. Courville, “Describing videos by exploiting temporal structure,” in International Conference on Computer Vision, 2015, pp. 4507–4515.
  • [12] Z. Guo, L. Gao, J. Song, X. Xu, J. Shao, and H. T. Shen, “Attention-based LSTM with semantic consistency for videos captioning,” in ACM MM, 2016, pp. 357–361.
  • [13] X. Long, C. Gan, and G. de Melo, “Video captioning with multi-faceted attention,” CoRR, vol. abs/1612.00234, 2016.
  • [14] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Computer Vision and Patter Recognition, 2016, pp. 4651–4659.
  • [15] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” CoRR, vol. abs/1611.01646, 2016.
  • [16] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Computer Vision and Patter Recognition, 2016, pp. 1029–1038.
  • [17] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Computer Vision and Patter Recognition, 2016, pp. 4584–4593.
  • [18] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [19] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
  • [20] M. I. Jordan, “Serial order: A parallel distributed processing approach.” vol. 121, p. 64, 1986.
  • [21] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH,, 2010, pp. 1045–1048.
  • [22] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010, pp. 15–29.
  • [23] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, “Translating video content to natural language descriptions,” in International Conference on Computer Vision, 2013, pp. 433–440.
  • [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Patter Recognition, 2015, pp. 1–9.
  • [25] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in NAACL HLT, 2015, pp. 1494–1504.
  • [26] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating attractive visual captions with styles,” in CVPR, 2017, pp. 3137–3146.
  • [27] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition gan for visual paragraph generation,” arXiv preprint arXiv:1703.07022, 2017.
  • [28] F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, and S. Wen, “Temporal modeling approaches for large-scale youtube-8m video understanding,” arXiv preprint arXiv:1707.04555, 2017.
  • [29] Z. Wang, F. Wu, W. Lu, J. Xiao, X. Li, Z. Zhang, and Y. Zhuang, “Diverse image captioning via grouptalk,” in IJCAI, 2016, pp. 2957–2964.
  • [30] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” vol. abs/1703.04977, 2017.
  • [31] Y. Li and Y. Gal, “Dropout inference in bayesian neural networks with alpha-divergences,” vol. abs/1703.02914, 2017.
  • [32] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with Bayesian neural network dynamics models,” in

    Data-Efficient Machine Learning workshop, ICML

    , 2016.
  • [33] L. Uusitalo, A. Lehikoinen, I. Helle, and K. Myrberg, “An overview of methods to evaluate uncertainty of deterministic models in decision support,” Environmental Modelling and Software, vol. 63, pp. 24 – 31, 2015.
  • [34] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in NIPS, 2015, pp. 3483–3491.
  • [35] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep kalman filters,” CoRR, vol. abs/1511.05121, 2015.
  • [36] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio, “A hierarchical latent variable encoder-decoder model for generating dialogues,” in AAAI, 2017, pp. 3295–3301.
  • [37] M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther, “Sequential neural models with stochastic layers,” in NIPS, 2016, pp. 2199–2207.
  • [38] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in NIPS, 2015, pp. 2980–2988.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
  • [40] X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” CoRR, vol. abs/1411.5654, 2014.
  • [41] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig, “From captions to visual concepts and back,” in Computer Vision and Patter Recognition, 2015, pp. 1473–1482.
  • [42] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
  • [43] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in ACL HLT, 2011, pp. 190–200.
  • [44] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in Computer Vision and Patter Recognition, 2016, pp. 5288–5296.
  • [45] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002, pp. 311–318.
  • [46] M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in The Workshop on Statistical Machine Translation,, 2014, pp. 376–380.
  • [47] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Computer Vision and Patter Recognition, 2015, pp. 4566–4575.
  • [48] C. Flick, “Rouge: A package for automatic evaluation of summaries,” in

    The Workshop on Text Summarization Branches Out

    , 2004, p. 10.
  • [49] X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” Computer Science, 2015.
  • [50] D. Furcy and S. Koenig, “Limited discrepancy beam search,” in IJCAI, 2005, pp. 125–131.
  • [51] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” CoRR, vol. abs/1611.08002, 2016.