I Introduction
With the explosive growth of online videos over the past decade, video captioning has become a hot research topic. In a nutshell, video captioning is the problem of translating a video into meaningful textual sentences describing its visual content. As such, solving this problem has the potential to help various applications, from video indexing and search to humanrobot interaction.
Building on the pioneering work of Kojima et al. [1], a series of studies have been conducted to come up with a first generation of video captioning systems [2, 3, 4]
. Recently, however, the development of these systems has more and more relied on deep neuronal networks (DNN) that have been proven effective in both computer vision (e.g., image classification and object detection) and natural language understanding (e.g., machine translation and language modeling), forming two technological pillars of video captioning solutions. In particular, deep Convolutional Neural Networks (CNNs) (e.g., VggNet
[5] and ResNet [6]) have been widely deployed to extract representative visual features, while Recurrent Neural Networks (RNNs) (e.g., Long Short Term Memory (LSTM)
[7]and Gate Recurrent Unit (GRU)
[8]) have been deployed to translate sequential term vectors to natural language sentences. Despite significant conceptual and computational complexity of these DNNbased models, their effectiveness has given rise to the socalled
encoderdecoder scheme as a popular modern approach for video captioning. In this scheme, typically a CNN is used as an encoder and a RNN as a decoder. This approach has shown better performance than traditional video captioning methods with handcrafted features.Recent efforts towards developing and implementing an encoderdecoder scheme for video captioning have mainly focused on solving the following questions: 1) how to help an encodedecoder framework to more efficiently and effectively bridge the gap between video and language [9]? 2) How to facilitate video captioning using semantic information [10]? 3) How to deploy an attention mechanism to help decide what visual information to extract from video [11, 12]? 4) How to extract attributes/key concepts from sentences to enhance video captioning? [13, 14, 15]. Numerous approaches have been proposed to address these questions [10, 16, 11, 12, 17].
However, the above mentioned approaches have been deterministic without incorporating uncertainties (i.e., both subjective judgment and model uncertainty) into the model calculations at all stages of the modeling. Firstly, in essential, video captioning is a complex process and involves many factors such as video itself, description intents, personal characteristics and experiences, etc. Except for the video content, other factors are inherently random and unpredictable. For example, in Fig.1, we asked three people to describe two videos separately, and they provided different descriptions for each video. This indicates that video captioning is subjective and uncertain. Secondly, video captioning models are always abstractions of the natural video captioning processes by leaving out some less important components and keeping only relevant and prominent components, thus model uncertainty arises. However, both uncertainties are ignored in previous work.
Therefore, in this paper we are focusing on dealing with the above uncertainties. All our attempts are to ascertain the true nature about video captioning. We propose a novel approach, namely multimodal stochastic RNN networks (MSRNN), which model the uncertainty observed in the data using latent stochastic variables. Our method is inspired by variational autoencoder (VAE) [18]
, which uses a set of latent variables to capture the latent information. Our work makes the following contributions: 1) We propose an novel endtoend MSRNN approach for video captioning. To our knowledge, this is the first approach to video captioning that takes the uncertainty, both subjective judgment and model uncertainty, into consideration. Therefore, for each video, our model can generate multiple sentences to describe it from different aspects. 2) We proposed a multimodal Long ShortTerm Memory (MLSTM) layer, which incorporates the features from different information sources (i.e., visual and word) into a set of higherlevel representation by adjusting the weights on each individual source for improving the video captioning performance. 3) We develop a novel backward stochastic LSTM (SLSTM) mechanism to model uncertainty in a latent process through latent variables. With SLSTM, the uncertainty is expressed in the form of probability distribution of latent variables. The uncertainty can be model into a prior distribution by making use of the consistency between prior distribution and posterior distribution. 4) The proposed model is evaluated on two challenging datasets MSVD and MSRVTT. The experimental results show that our method achieves superior performance in video captioning. Note that our model only utilizes the appearance features of videos, and no attention mechanism is incorporated.
Ii Related Work
Iia Recurrent Neural Networks
Recurrent Neural Networks (RNNs) [19, 20] form a directed cycle to connect units. This mechanism allows them to process arbitrary sequential data streams, thus RNNs have been widely used in computational linguistics and achieved great success. Taking language model as an example, RNNs model a sequential data streams (e.g., a sentence) by decomposing the probability distribution over outputs:
(1) 
At each time step, an RNN observes an element and updates its internal states, , where is a deterministic nonlinear function and indicates a set of parameters. The probability distribution over is parametrized as: . The RNN Language Model (RNNLM) [21] parametrized the output distribution by applying a softmax function onto the previous hidden state . To learn the model’s parameters, RNNLM maximizes the loglikelihood by adopting the gradient descent. However, most existing RNNs models propagate deterministic hidden states.
IiB Visual Captioning
The study of visual captioning problem has been going on for many years. In 2002, the video captioning system [1] was proposed for describing human behavior, the method firstly detects visual information (i.e. position of head, direction of head and positions of hands) to find the position where the person is and the gesture what the person does, then selects appropriate predicate, object, etc. with domain knowledge. Finally, the method applies syntactic rules to generate a whole sentence. Following this work, a series of studies are conducted to utilize such technique to enhance different multimedia applications [2, 3, 4]. And there are some works tackle the problem with probabilistic graphical model. Farhadi et al. [22] introduce the meaning space, which is represented as triplets of object; action; scene in the form of a Markov Random Field (MRF), and map the images and sentence to the meaning space to find the relationship of between images and sentences. Rohrbach et al. [23] try to model the relationship between different components of the visual information with a Conditional Random Field (CRF), then tackle the captioning problem as a machine translation problem to generate sentences.
Inspired by the recent advances in image classification using CNN networks (e.g., VggNet [5], GoogLeNet [24] and ResNet [6]), and in machine translation utilizing RNN, there have been a few attempts [10, 25, 16, 17, 11] to address video caption generation by firstly adopting an efficient CNN network to extract video appearance features, and secondly utilizing a RNN to take video features and the previous predicted words to infer a new word with a softmax [26, 27, 28]. In order to further improve the performance, more complex approaches [11, 10, 17] are proposed from different aspects. Specifically, Yao et al. [11] adopted a spatiotemporal convolutional neural network (3D CNN) for capturing video motion information and a soft attention mechanism to select relevant frame level features for video captioning. Pan et al. [10] incorporated the semantic relationship between sentence and visual content for video captioning, while Yu et al. [17] proposed a hierarchical framework consisting of a sentence generator to describe a specific short video internal and a paragraph generator to capturing the intersentence dependency. However, all of them treat video captioning as a deterministic problem, which can only generate one output, which violate the nature of video captioning. By taking different hidden factors (e.g., intention and experience) into consideration, a trained model should be able to output different sentences. Note that the model introduced in [29] also can generate diverse sentences for image captioning, because it use different LSTMs to generate different sentences (the number of LSTMs is equal to the number of different sentences.), so their model no uncertain factors and does not capture the uncertainty in captioning problem.
IiC What is Uncertainty
From the management point of view, uncertainty is the lack of exact knowledge, regardless of what is the cause of this deficiency [30, 31, 32]. Models provide us a solution to clarify our understanding of our knowledge gap, but in real life, understanding the average processes is often not sufficient and it is impossible to predict with certain results [33]
. In general, besides language uncertainty, uncertainty can be classified into six major types
[33, 30]: 1) measurement errors resulting from imperfections in measuring devices and observational techniques etc. 2) systematic error, which occurs as the results of bias in the measuring devices or the sampling process. 3) natural variation, which occurs in system that changes, with respect to time, space or other variations, in ways. 4) inherent randomness, which results from a system that is irreducible to a deterministic one. 5) model uncertainty, which mainly arises because the mathematical and computer models that are used for predicting future events or for answering question under specific scenarios. And 6) subjective judgment, which occurs as a result of interpretation of data. Without sufficient data, the experts’ judgment will be based on observations and experience. All of these uncertainties are hidden factors affecting the results of video captioning, and we propose to model these uncertainties using latent stochastic variables.IiD Variational autoencoder
As mentioned above, we know that we should find a method to capture the uncertainty in the video captioning problem. But how can we model the uncertainty? Variational autoencoder (VAE) [18] model gives us a good way to solve this problem. For capturing the variations in the observed variables x
, the VAE model introduces a set of latent random variables
z and rewrites the objective function as follow:(2) 
where
is KullbackLeibler divergence between two distributions
and , which measures the nonsymmetric difference between two probability distributions. And is a approximate posterior distribution, which avoids to solve the intractable true posterior distribution. In [18], the VAE model was used to paint the digits, so it needs to decide not just which number is written, but the angle, the stroke width, and also abstract stylistic properties, so the model uses a set of latent random variables to capture the latent information. Inspired by this, we also use latent variables with a stochastic layer to capture the uncertainty information in the video captioning. Different with painting digits, the video captioning task needs generate different sentences based on the content of the video, so our objective function is a conditional probability, so we use the loss function introduced in conditional variational autoencoder (CVAE)
[34], which extend the VAE to dispose conditional probability distribution. And Krishnan
et al. [35] compared the different variational models, they guide us to choose a effective variational model. And there are some works extend the VAE model to RNN [36, 37, 38] for generating speech or music signal. All these works inspire us extend the captioning problem to a uncertainty problem.Iii The Proposed Approach
In this section, we introduce our approach for video captioning, and we follow the conventional encoderdecoder framework. The encoder is based purely on neural networks to generate video descriptions, and the decoder, named Multimodal Stochastic Recurrent Neural Networks (MSRNN) (see Fig. 2), is our major contribution. We first introduce the architecture of our proposed network, and then devise the loss function and optimization.
Iiia Problem Formulation
Given a video v with frames, we extract their framelevel features and v can be represented as , where and is the dimension of the framelevel features. For each v, we also have a textual sentence a to describe it, and a includes words which can be represented as . Specifically, is onehot vector where is the dimension of the vocabulary. Therefore, we have and the . Given a video, our model will predict one word at a time until we generate a textual sentence to describe the input video. In detail, in the th time step, our model utilizes v and the previous words to predict a word with the maximal probability , until we reach the end of the sentence. In addition, we set a mark as the end of sentence.
IiiB Encoder
The goal of an encoder is to compute feature vectors that are compact and representative and can capture the most related visual information for the decoder. Specifically, it encodes the input v into a continuous representation which may be a variablesized set . Thanks to the rapid development of deep convolutional neural networks (CNNs), which have made a great success in large scale image recognition task [6], object detection [39] and visual captioning [9]. Highlevel features can been extracted from upper or intermediate layers of a deep CNN network. Therefore, a set of welltested CNN networks, such as the ResNet152 model [6]
which has achieved the best performance in ImageNet Large Scale Visual Recognition Challenge, can be used as candidate encoders for our framework. With a CNN architecture, we can apply it to each frame to extract representative framelevel features.
For encoding the sentence, because of the sparsity of onehot vectors , like previous works [11, 10], we process onehot vector with ”embedding” method. We set a parameter matrix to map the onehot vectors a to s as follow:
(3) 
The and will be input to the next step. In addition, the end of sentence is mapped to .
IiiC Decoder with MSRNN
The MSRNN consists of three core components as shown Fig. 2: a basic LSTM layer for extracting wordlevel features, a multimodal LSTM layer (MLSTM) for encoding multiview information (visual and textual features) simultaneously and chronologically, and a backward stochastic LSTM layer (SLSTM) to adequately introduce latent variables.
IiiC1 LSTM for Word Features
In our MSRNN model, we use a basic LSTM layer to take as input and output word features with encoded temporal information.
(4) 
where . More specifically, a standard LSTM unit consists of three gates: a “forget gate” () that decides what information we are going to throw away from a LSTM unit; an “input gate” () that decides what new information we are going to store in the cell state; and an “output gate” that controls the extent to which the value in memory is used to compute the output activation of the block. A standard LSTM can be defined as:
(5) 
where
is a sigmoid function,
denotes a hyperbolic tangent function, is a cell state vector, is an output vector, is a sigmoid gate, is a set of parameters, denotes the elementwise multiplication, and is a set of bias values. Then, for each word , we extracted its word features as .IiiC2 Multimodal LSTM Layer
Next, a MLSTM layer takes the and a videolevel feature as inputs to fuse a highlevel features .
(6) 
Here, instead of using advanced but complex temporal or spatial attention mechanism to select a videolevel feature, we use the basic mean pooling strategy to obtain one :
(7) 
The motivation is that if our model using the basic way to utilize the visual features can improve the performance of video captioning, the advantages of our MSRNN are manifest. However, as shown in [11, 12], the attention mechanism can further boost the performance of video captioning.
Multimodal LSTM (MLSTM) is a novel variant of LSTM, and it not only inherits the numerical stability of LSTM but also generates plausible features from multiview sources. We choose LSTM as our basic RNN unit due to the following reasons: 1) it achieved great success in machine translation, speech recognition, image and video caption [40, 41, 9]; and 2) compared with basic RNN units, it is absolutely capable of handing the “longterm dependencies” problem.
Given two modalities and as the inputs, and two initialized vectors and , a MLSTM can be used to fuse them and extract a higherlevel feature. A MLSTM unit can be described as bellow:
(8) 
To obtain an abstract concept from twomodalities, the MLSTM needs to firstly project and
into a common feature space, then the inside gates can add them together with an activation function. Then, in each time step
, we extracted a higherlevel feature .IiiC3 Backward Stochastic LSTM
In this subsection, we introduce our Backward Stochastic LSTM (SLSTM) to take the output of MLSTM to approximate the posterior distributions over latent variables defined as , where . The SLSTM consists of two units: a backward LSTM unit and a stochastic unit. We define the output of the backward LSTM as .
For the backward LSTM unit in time step , its output is defined as:
(9) 
where is the output of MLSTM at time step , is the output of embedding layer, and is initialized to zero vector. The form of similar with , but it process sequence with backward direction. We can see that the output of backward LSTM in time step depends on the present input , and future output . This is because in the stochastic units, the posterior distribution of , which is calculated with Eq. 15, does not depend on the past outputs and deterministic states, but depend on the present and future ones. Therefore, we propose to use the backward LSTM to extract the future information, and incorporate it with a stochastic layer to achieve our goal.
Fig. 3 demonstrates the stochastic unit structure. To obtain , we utilize an “reparameterization trick” introduced in [18]. This trick randomly samples a set of values
from a standard Gaussian distribution. Therefore,
. If we assume , we can use to calculate . Next, we need to solve the problem of how to learn and for .In detail, the stochastic unit takes , and as input to approximate and by two feedforward networks (i.e., and ). In addition, each of them contains two fully connected layers.
(10) 
is a concatenation operation. In addition, the stochastic unit also takes and to approximate and by two feedforward networks (i.e., and ):
(11) 
For training, we set , this method, introduced in [37], can improve the posterior approximation by using the prior mean, while for testing we set , and we set the as zero vector at the beginning. To output a symbol , a probability distribution over a set of possible words is obtained using and :
(12) 
where and
are parameters to be learned. Next, we can interpret the output of the softmax layer
as a probability distribution over words.IiiD Loss Function
Based on the variational inference and conditional variational autoencoder (CVAE) proposed in
[34], we define the following loss function:(13) 
where is the evidence lower bound (ELBO) of the log likelihood. The distribution is an approximate posterior distribution , which aims to approximate the intractable true posterior distribution. For the first term , which is an expected log likelihood under . This term is written as:
(14) 
Here, we process the concatenation vector with a softmax layer, mentioned by Eq.12, to approximate .
The second term , namely KL term, is the KullbackLeibler divergence, which measures the nonsymmetric difference between two probability distributions (i.e., and ). And in our work, we choose the variational model introduced in [35] to factorize the posterior distribution. The posterior and prior distributions are factorized as below:
(15) 
(16) 
For approximating the and , we firstly use a backward LSTM layer to encode ( we have encoded to mentioned in Eq.3) and to , then utilize the method, mentioned in Sec. IIIC3
, to approximate the means and the variances of
and . So we can use the following function to calculate the KullbackLeibler divergence at the th time step:(17) 
For the whole sentence generation, we calculate the global KullbackLeibler divergence by:
(18) 
In this paper, we maximize the above proposed loss function to learn all the parameters. More specifically, we use backpropagation through time (BPTT) algorithm to compute the gradients and conduct the optimization with adadelta
[42].Iv Experiment
We evaluate our model on two standard video captioning benchmark datasets: the widely used Microsoft Video Description (MSVD) [43] and the largescale MSR VideotoText (MSRVTT) [44].
MSVD: This dataset consists of short video clips collected from YouTube, with an average length of about 9s. In addition, this dataset contains about 80,000 clipdescription pairs labeled by Amazon Mechanical Turkers (AMT). In other words, each clip has multiple sentence descriptions. In total, all the descriptions contain nearly unique vocabularies. Following previous work [10, 16, 17], we split this dataset into a training, a validation and a testing dataset with , and video clips, respectively.
MSRVTT: This dataset was proposed by Xu et al. [44] in 2016. They aim to provide a new largescale video benchmark for supporting video understanding, especially for the task of translating videos to text. In total, this dataset contains K web video clips and K clipsentence pairs in total. Each clip is annotated with 20 natural natural sentences by AMT workers. This dataset is collected from a commercial video search engining and so far it covers the most comprehensive categories and diverse visual content, representing the largest dataset in terms of sentences and vocabularies. We run our experiments on their updated version with sentence quality control. This dataset is divided into three subsets: % for training, % for validating and % for testing.
Iva Evaluation Metrics
To evaluate the performance of our model, we utilize the following four evaluation metrics: BLUE
[45], METEOR [46], CIDEr [47] and ROUGEL [48]. In addition, Microsoft COCO evaluation server [49] has implemented these metrics, so we directly call such evaluation functions to test the performance of video captioning.IvB Experimental Settings
Video Appearance Feature Extraction. The experimental results obtained by Xu et al. [44] show that applying different pooling methods (i.e., single frame, meaning pooling and softattention) obtains different performance. Both mean pooling and softattention perform significantly better than single frame. The softattention performs slightly better than mean pooling with 0.6% BULE@4 and 0.6% METEOR increases, but it involves more operations. Therefore, we apply a mean pooling to a set of frame level features to generate a representative videolevel feature. In addition, we follow previous work [11] to uniformly sample
frames from each clip for controlling video frames duplication. Deep convolutional neural networks (CNNs) achieved a great success in image feature extraction. Therefore, in this paper we respectively use the ResNet152
[6] and GoogLeNet [24], the two stateoftheart CNNs, to extract video frame level features to analyze our model. The results show that ResNet features perform better (see TabIII).Sentence Preprocessing. For MSVD dataset, we tokenize it by firstly converting all words to lowercases and secondly utilizing the WordPunct function from NLTK toolbox to tokenize sentences and remove punctuations. As a result, we obtained a vocabulary with words from the MSVD training dataset. For the MSRVTT dataset, after tokenization we obtain a size vocabulary from its training dataset. For each dataset, we use the onehot vector (of encoding, where is the vocabulary size) to represent each word.
Training Details. For dealing with sentences with arbitrary size, we add a beginofsentence tag bos to start each sentence, and an endofsentence tag eos to end each sentence. During training, we maximize the loss function by taking the video and its corresponding groundtruth sentence label as the inputs.
In addition, in our experiments, with an initial learning rate to avoid the gradient explosion, we set the beam search size as . Empirically, we set all the MLSTM unit sizes as 512, all the BLSTM unit sizes as 512, the dimension of latent variables as 256, and the word embedding size as 512. Our objective function Eq.13
is optimized over the whole training video sentence pairs with minibatch 64 in size of MSVD and 256 in size of MSRVTT. We stop training our model until 500 epochs are reached or until the evaluation metric does not improve on the validation set at the patience of 20. In addition, we multiply the KL term by a scalar, which starts at 0.01 and linearly increases to 1 over the first 20 epochs.
Testing Details. During testing, our model takes the video and a beginofsentence tag bos as inputs to generate sentences to describe the input video. After the parameter are learned, we perform the generation with Beam Search [50].
In addition, our model incorporates latent variables for ascertaining the true nature about video caption and has potential to describe video from different aspects. Thus, we have repeatedly input the test videos into our trained model five times. Each time we obtain a performance showing in Tab.I. Finally, we obtain an average performance. Moreover, Fig.4 shows some output examples.
Time  B@1  B@2  B@3  B@4  M  C  RL 

1  83.2  72.8  63.5  53.4  33.9  73.7  70.4 
2  82.7  72.6  63.7  53.6  34.0  75.2  70.3 
3  82.4  72.1  62.9  52.8  33.6  74.7  69.8 
4  83.0  72.8  63.6  53.8  34.0  76.6  70.4 
5  83.1  72.7  63.5  53.1  33.6  73.9  70.0 
mean  82.9  72.6  63.5  53.3  33.8  74.8  70.2 
IvC Results on MSVD Dataset
In this paper, we propose to utilize probability distribution of latent variables to depict uncertainty, thus for each time our model may generate different descriptions. In this subsection, we run the testing five times and report the results in Tab.I. The performance of each testing is quite stable and reasonable. By checking the generated sentences (see Fig.4), we can see that our model can describe a video from various aspects, likely in real life, human provide various sentences to describe one video to fit their intents.
IvD Component Analysis
In this paper, we design two core components: a MLSTM layer and a SLSTM layer, which affect the performance of our algorithm. In this subsection, we study their performance variance with the following two settings:

Only using MLSTM for video captioning (M).

Incorporating MLSTM and SLSTM for video captioning (M+S).
In this subexperiment, we firstly conduct the experiments on the MSVD dataset and use ResNet to extract frame features.
Tab.II lists the results, which demonstrate that our MSRNN model with both MLSTM and SLSTM outperforms MLSTM only on all evaluation metrics, with a 1.3% M, 3.3% C and 1% RL performance increases.
Model  B@1  B@2  B@3  B@4  M  C  RL 

M  82.7  71.8  62.7  52.2  32.5  71.5  69.2 
M+S  82.9  72.6  63.5  53.3  33.8  74.8  70.2 
In Fig.4, we show some example sentences generated by our approach, with only MLSTM and with both MLSTM and SLSTM, respectively. From Fig.4, we have the following observations:

Both MLSTM and MLSTM+SLSTM are able to generate accurate descriptions for a video. In addition, the results generated by MLSTM+SLSTM are generally better than MLSTM method, which is consistent with the results reported in Tab. II.

MLSTM is deterministic and it can only generate one sentence, while MLSTM+SLSTM can produce different sentences.

In general, MLSTM+SLSTM can provide more specific, comprehensive and accurate descriptions than MLSTM. For example, the left top example, MLSTM generates “a women is playing a guitar”, while MLSTM+SLSTM provides “a girl is singing” and “a women is playing with a guitar”. From the middle bottom, we can see that MLSTM provides a wrong description “cucumber”, while MLSTM+SLSTM generates “vegetables” and a set of verbs “slicing, chopping and cutting”.

Our MSRNN model may produce duplicate and comprehensive results, which is consistent with the nature of video captioning.

The last column shows some wrong examples. For the right top example, both methods provide wrong descriptions, “cutting a cucumber” and “slicing a carrot”. This is mainly because the MSVD dataset contains many videos about cooking and few videos about folding paper, which leads to an overfitting problem, In addition, the right middle is also inaccurate. This is because both our models only take video appearance features as inputs and ignores the motion features. For the right bottom example, our model does not correctly identify the number of objects in some cases.
Model  B@1  B@2  B@3  B@4  M  C 

LSTME(V)[10]  74.9  60.9  50.6  40.2  29.5   
hRNN(V)[17]  77.3  64.5  54.6  44.3  31.1  62.1 
SA(G)[11]  79.1  63.2  51.2  40.6  29.0   
MFALSTM(R)[13]  81.3  69.8  60.5  50.4  32.2  69.8 
MSRNN(G)  80.3  68.4  58.7  48.0  31.0  66.6 
MSRNN(R)  82.9  72.6  63.5  53.3  33.8  74.8 
IvE Comparison Results on MSVD Dataset
In this subsection, we conduct experiments to examine how different video representations work on video captioning, as well as comparing our model with existing approaches. In addition, all the approaches in this subexperiments only take one type video representation extracting from VggNet (V), GoogleNet (G) or ResNet (R). We conduct our experiments on the MSVD dataset.
Tab. III lists the experimental results. From Tab.III, we have following observations:

With only appearance features, our MSRNN (R) model achieves the best performance on all evaluation metrics. Compared with the stateoftheart method MFALSTM (R), our model achieves significantly better performance, with 1.6%, 2.8%, 3%, 2.9%, 1.6% and 5% increases on B@1, B@2, B@3, B@4, M and C, respectively.

For video captioning task, the RestNetbased video representation performs better than both VggNetbased and GoogleNetbased video features. Specifically, for our model RestNet feature performs better than GoogleNet features. For the whole experimental results, the approaches (SCNLSTM and MFALSTM) with ResNetbased features performs better than the methods with GoogleNet or VggNetbased features.

Compared with the methods using attention mechanisms, e.g., temporal attention [11], our MSRNN (R) achieves even better results with 3.8%, 9.4%, 12.3%, 12.7% and 4.8% increases on B@1, B@2, B@3, B@4 and M by using a simple mean pooling strategy. This indicates the advantages of our proposed MSLSTM.
Model  B@1  B@2  B@3  B@4  M  C 

LSTME(V+C)[10]  78.8  66.0  55.4  45.3  31.0   
SA(G+3D)[11]  80.0  64.7  52.6  42.2  29.6  51.7 
hRNN(V+C)[17]  81.5  70.4  60.4  49.9  32.6  65.8 
MFALSTM(R+C)[13]  82.9  72.0  62.7  52.8  33.4  68.9 
SCNLSTM(R+C) [51]        51.1  33.5  77.7 
MSRNN(R)  82.9  72.6  63.5  53.3  33.8  74.8 
We also compare our methods with the others using multiple features. Specifically, in this subsection, we compare our model using only appearance features with six stateoftheart methods: LSTME(V+C) [10], SA(G+3DCNN) [11], HRNEAT(G+C) [16], hRNN(V+C) [17], MFALSTM(R+C) [13] and SCNLSTM(R+C) [51], which make use of both appearance and motion video features. Here, V and R are short for VggNet and ResNet, which are used to extract appearance features. 3D and C are short for 3DCNN and C3D, which are used to generate video motion features.
The experimental results are shown in Tab.IV. Although our model only uses appearance features, it performs better than existing methods on B@2 (72.6%), B@3 (63.5%), B@4 (53.3%) and M (33.8%), and achieves comparable results on B@1 (82.9%) and C (74.8%).
Model  B@4  M  C  RL 

MPLSTM(V)[25]  34.8  24.8     
MPLSTM(C)  35.4  24.8     
MPLSTM(V+C)  35.8  25.3     
SALSTM(V)[11]  35.6  25.4     
SALSTM(C)  36.1  25.7     
SALSTM(V+C)  36.6  25.9     
MFALSTM(R+C)[13]  39.2  26.9  44.6  60.1 
MSRNN(R)  39.8  26.1  40.9  59.3 
IvF Comparison Results on MSRVTT Dataset
In this section, we compare our method with MPLSTM [25] and SALSTM [11] on the MSRVTT dataset. In addition, to obtain the appearance features, the MPLSTM and our MSRNN are based on the mean pooling strategy, while SALSTM is based on a soft attention mechanism. In theory, soft attention is more complex than mean pooling, but usually provides better visual features. The experimental results are shown in Tab.V and we have the following observations:

MSRNN gains a promising performance with 39.8% B@4, 26.1% M, 40.9% C and 59.3% RL on the MSRVTT dataset.

Overall with same visual input (VGG19, VGG19+C3D, or C3D), SALSTM performs better than MPLSTM. However, SA is based on the soft attention. In other words, in theory SALSTM takes better visual features as inputs. Compared with MPLSTM, our MSRNN (R) outperforms MPLSTM (VGG19+C3D) with 4% B@4 and 0.8% M increases. Compared with SALSTM, our MSRNN (R) outperforms SALSTM(VGG19+C3D) with 3.2% B@4. Compared with MSRNN(R+C), our model achieves comparable results on B@4, M and RL by using single feature (R).
V Conclusions and Future Work
In this paper, we propose a Multimodal Stochastic Recurrent Neural Network (MSRNN) framework for video captioning. This work has shown how to extend the modeling capabilities of RNN by approximating both prior distribution and true posterior distribution with a nonlinear latent layer (SLSTM). In addition, MSRNN achieves the stateoftheart performance with only mean video appearance features and is comparable with the counterparts, which take both video appearance and motion features. Last but not least, the proposed model can be applied to a wide range of video analysis applications.
In the future, we will integrate the stateoftheart attention mechanism [11] with our model to further improve the video captioning performance. Moreover, the motion feature will be considered.
References
 [1] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language description of human activities from video images based on concept hierarchy of actions,” International Journal of Computer Vision, vol. 50, no. 2, pp. 171–184, 2002.
 [2] M. W. Lee, A. Hakeem, N. Haering, and S. Zhu, “SAVE: A framework for semantic annotation of visual events,” in Computer Vision and Patter Recognition, 2008, pp. 1–8.
 [3] M. U. G. Khan, L. Zhang, and Y. Gotoh, “Human focused video description,” in International Conference on Computer Vision, 2011, pp. 1480–1487.
 [4] P. Hanckmann, K. Schutte, and G. J. Burghouts, “Automated textual descriptions for a wide range of video events with 48 human actions,” in ECCV, 2012, pp. 372–380.
 [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2014.
 [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Patter Recognition, 2016, pp. 770–778.
 [7] S.Hochreiter and J.Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [8] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.
 [9] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence  video to text,” in International Conference on Computer Vision, 2015, pp. 4534–4542.
 [10] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in Computer Vision and Patter Recognition, 2016, pp. 4594–4602.
 [11] L. Yao, A. Torabi, K. Cho, N. Ballas, C. J. Pal, H. Larochelle, and A. C. Courville, “Describing videos by exploiting temporal structure,” in International Conference on Computer Vision, 2015, pp. 4507–4515.
 [12] Z. Guo, L. Gao, J. Song, X. Xu, J. Shao, and H. T. Shen, “Attentionbased LSTM with semantic consistency for videos captioning,” in ACM MM, 2016, pp. 357–361.
 [13] X. Long, C. Gan, and G. de Melo, “Video captioning with multifaceted attention,” CoRR, vol. abs/1612.00234, 2016.
 [14] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Computer Vision and Patter Recognition, 2016, pp. 4651–4659.
 [15] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” CoRR, vol. abs/1611.01646, 2016.
 [16] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in Computer Vision and Patter Recognition, 2016, pp. 1029–1038.
 [17] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in Computer Vision and Patter Recognition, 2016, pp. 4584–4593.
 [18] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in ICLR, 2014.
 [19] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
 [20] M. I. Jordan, “Serial order: A parallel distributed processing approach.” vol. 121, p. 64, 1986.
 [21] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in INTERSPEECH,, 2010, pp. 1045–1048.
 [22] A. Farhadi, S. M. M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010, pp. 15–29.
 [23] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, “Translating video content to natural language descriptions,” in International Conference on Computer Vision, 2013, pp. 433–440.
 [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Computer Vision and Patter Recognition, 2015, pp. 1–9.
 [25] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in NAACL HLT, 2015, pp. 1494–1504.
 [26] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating attractive visual captions with styles,” in CVPR, 2017, pp. 3137–3146.
 [27] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topictransition gan for visual paragraph generation,” arXiv preprint arXiv:1703.07022, 2017.
 [28] F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, and S. Wen, “Temporal modeling approaches for largescale youtube8m video understanding,” arXiv preprint arXiv:1707.04555, 2017.
 [29] Z. Wang, F. Wu, W. Lu, J. Xiao, X. Li, Z. Zhang, and Y. Zhuang, “Diverse image captioning via grouptalk,” in IJCAI, 2016, pp. 2957–2964.
 [30] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” vol. abs/1703.04977, 2017.
 [31] Y. Li and Y. Gal, “Dropout inference in bayesian neural networks with alphadivergences,” vol. abs/1703.02914, 2017.

[32]
Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with Bayesian
neural network dynamics models,” in
DataEfficient Machine Learning workshop, ICML
, 2016.  [33] L. Uusitalo, A. Lehikoinen, I. Helle, and K. Myrberg, “An overview of methods to evaluate uncertainty of deterministic models in decision support,” Environmental Modelling and Software, vol. 63, pp. 24 – 31, 2015.
 [34] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in NIPS, 2015, pp. 3483–3491.
 [35] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep kalman filters,” CoRR, vol. abs/1511.05121, 2015.
 [36] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, and Y. Bengio, “A hierarchical latent variable encoderdecoder model for generating dialogues,” in AAAI, 2017, pp. 3295–3301.
 [37] M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther, “Sequential neural models with stochastic layers,” in NIPS, 2016, pp. 2199–2207.
 [38] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in NIPS, 2015, pp. 2980–2988.
 [39] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” in NIPS, 2015.
 [40] X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” CoRR, vol. abs/1411.5654, 2014.
 [41] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig, “From captions to visual concepts and back,” in Computer Vision and Patter Recognition, 2015, pp. 1473–1482.
 [42] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
 [43] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in ACL HLT, 2011, pp. 190–200.
 [44] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSRVTT: A large video description dataset for bridging video and language,” in Computer Vision and Patter Recognition, 2016, pp. 5288–5296.
 [45] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002, pp. 311–318.
 [46] M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in The Workshop on Statistical Machine Translation,, 2014, pp. 376–380.
 [47] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensusbased image description evaluation,” in Computer Vision and Patter Recognition, 2015, pp. 4566–4575.

[48]
C. Flick, “Rouge: A package for automatic evaluation of summaries,” in
The Workshop on Text Summarization Branches Out
, 2004, p. 10.  [49] X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” Computer Science, 2015.
 [50] D. Furcy and S. Koenig, “Limited discrepancy beam search,” in IJCAI, 2005, pp. 125–131.
 [51] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, “Semantic compositional networks for visual captioning,” CoRR, vol. abs/1611.08002, 2016.
Comments
There are no comments yet.