Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

06/03/2019 ∙ by Wei Zhang, et al. ∙ Shandong University Columbia University 0

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder component makes use of the forward flow to produce a sentence description based on the encoded video semantic features. Two types of reconstructors are subsequently proposed to employ the backward flow and reproduce the video features from local and global perspectives, respectively, capitalizing on the hidden state sequence generated by the decoder. Moreover, in order to make a comprehensive reconstruction of the video features, we propose to fuse the two types of reconstructors together. The generation loss yielded by the encoder-decoder component and the reconstruction loss introduced by the reconstructor are jointly cast into training the proposed RecNet in an end-to-end fashion. Furthermore, the RecNet is fine-tuned by CIDEr optimization via reinforcement learning, which significantly boosts the captioning performance. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the performance of video captioning consistently.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Describing visual contents with natural language automatically has received increasing attention in both the computer vision and natural language processing communities. It can be applied in various practical applications, such as image and video retrieval

[1, 2, 3, 4, 5], answering questions from images [6], and assisting people who suffer from vision disorders [7].

Previous work predominantly focused on describing still images with natural language [8, 9, 10, 11, 12, 13]. Recently, researchers have striven to generate sentences to describe video contents [14, 15, 16, 17, 18, 19, 20]. Compared to image captioning [21, 22], describing videos is more challenging as the amount of information (e.g., objects, scenes, actions, etc.) contained in videos is much more sophisticated than that in still images. More importantly, the temporal dynamics within video sequences need to be adequately captured for captioning, besides the spatial content modeling.

Fig. 1: The proposed RecNet with an encoder-decoder-reconstructor architecture. The encoder-decoder relies on the forward flow from video to caption (blue dotted arrow), in which the decoder generates a caption with the frame features yielded by the encoder. The reconstructor, exploiting the backward flow from caption to video (green dotted arrow), takes the hidden state sequence of the decoder as input and reproduces the visual features of the input video.

Recently, an encoder-decoder architecture has been widely adopted for video captioning [17, 23, 24, 25, 26, 27, 28, 20, 29], as shown in Fig. 1

. However, the encoder-decoder architecture only relies on the forward flow (video to sentence), but does not consider the information from sentence to video, named as backward flow. Usually the encoder is a convolutional neural network (CNN) capturing the image structure to yield its semantic representation. For a given video sequence, the yielded semantic representations by a CNN are further fused together to exploit the video temporal dynamics and generate the video representation. The decoder is usually a long short-term memory (LSTM) 

[30]

or a gated recurrent unit (GRU) 

[31], which is popular in processing sequential data [32]

. LSTM and GRU generate the sentence fragments one by one, and ensemble them to form one sentence. The semantic information from target sentences to source videos is never included. Actually, the backward flow can be yielded by a dual learning mechanism that has been introduced into neural machine translation (NMT) 

[33, 34] and image segmentation [35]. This mechanism reconstructs a source from a target when the target is achieved, and demonstrates that the backward flow from target to source improves the performance.

To well exploit the backward flow, we refer to the idea of dual learning and propose an encoder-decoder-reconstructor architecture shown in Fig. 1, denoted as RecNet, to address video captioning. Specifically, the encoder fuses the video frame features together to exploit the video temporal dynamics and generate the video representation, based on which the decoder generates the corresponding sentences. The reconstructor, realized by one LSTM, leverages the backward flow (sentence to video). That is, the reconstructor reproduces the video information based on the hidden state sequence generated by the decoder. Such an encoder-decoder-reconstructor can be viewed as a dual learning architecture, where video captioning is the primal task and reconstruction behaves as its dual task. In the dual task, a reconstruction loss measuring the difference between the reconstructed and original visual features is additionally used to train the primal task and optimize the parameters of the encoder and decoder. With such a reconstructor, the decoder is encouraged to embed more information from the input video sequence. Therefore, the relationship between the video sequence and caption can be further enhanced. And the decoder can generate more semantically correlated sentences to describe the visual contents of the video sequences, yielding significant performance improvements. As such, the proposed reconstructor, which further helps finetune the parameters of the encoder and decoder, is expected to bridge the semantic gap between the video and sentence.

Moreover, the reconstructor can help mitigate the discrepancy, also referred to as the exposure bias [36], between the training and inference processes, which widely exists in RNNs for the captioning task. The proposed reconstructor can help regularize the transition dynamics of RNNs, as the dual information captured by the RecNet provides a complementary view to the encoder-decoder architecture. As such, the reconstructor can help alleviate the exposure bias between training and inference and mitigate the discrepancy, as will be demonstrated in Sec. 4.4.4.

Besides, we intend to directly train the captioning models guided by evaluation metrics, such as BLEU and CIDEr, instead of the conventionally used cross-entropy loss. However, these evaluation metrics are discrete and non-differentiable, which makes it difficult to optimize using traditional methods. The self-critical sequence training 

[37] is an excellent REINFORCE-based algorithm, specializing in processing the discrete and non-differentiable variables. In this paper, we resort to the self-critical sequence training algorithm to further boost the performance of the RecNet on the video captioning task.

To summarize, our main contributions of this work are as follows. We propose a novel reconstruction network and build an end-to-end encoder-decoder-reconstructor architecture to exploit both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Two types of reconstructors are customized to recover the global and local structures of the video, respectively. Moreover, a joint model is presented to reconstruct both the global and local structures simultaneously for further improving the reconstruction of the video representation. Extensive results obtained by cross-entropy training and self-critical sequence training [37] on benchmark datasets indicate that the backward flow is well exploited by the proposed reconstructors, and considerable gains on video captioning are achieved. Besides, ablation studies show that the proposed reconstructor can help regularize the transition dynamics of RNNs, thereby mitigating the discrepancy between training and inference processes.

2 Related Work

In this section, we first introduce two types of video captioning: template-based approaches [38, 39, 40, 41, 14] and sequence learning approaches [25, 18, 19, 17, 23, 24, 42, 28, 20, 43, 29, 44], and then introduce the application of dual learning.

2.1 Template-based Approaches

Template-based methods first define some specific rules for language grammar, and then parse the sentence into several components such as subject, verb, and object. The obtained sentence fragments are associated with words detected from the visual content to produce the final description about an input video with predefined templates. For example, a concept hierarchy of actions was introduced to describe human activities in [38], while a semantic hierarchy was defined in [39] to learn the semantic relationship between different sentence fragments. In [40], the conditional random field (CRF) was adopted to model the connections between objects and activities of the visual input and generate the semantic features for description. Besides, Xu et al. proposed a unified framework consisting of a semantic language model, a deep video model, and a joint embedding model to learn the association between videos and natural sentences [14]. However, as stated in [20], the aforementioned approaches highly depend on predefined templates and are thus limited by the fixed syntactical structure, which is inflexible for sentence generation.

Fig. 2: The proposed RecNet consists of three parts: the CNN-based encoder which extracts the semantic representations of the video frames, the LSTM-based decoder which generates natural language for visual content description, and the reconstructor which exploits the backward flow from caption to visual contents to reproduce the frame representations.

2.2 Sequence Learning Approaches

Compared with the template-based methods, the sequence learning approaches aim to directly produce a sentence description about the visual input with more flexible syntactical structures. For example, in [19]

, a video representation was obtained by averaging each frame feature extracted by a CNN, and then fed to LSTMs for sentence generation. In

[28], the relevance between video context and sentence semantics was considered as a regularizer in the LSTM. However, since simple mean pooling is used, the temporal dynamics of the video sequence are not adequately addressed. Yao et al. introduced an attention mechanism to assign weights to the features of each frame and then fused them based on the attentive weights [25]. Venugopalan et al. proposed S2VT [18], which includes the temporal information with optical flow and employs LSTMs in both the encoder and decoder. To exploit both temporal and spatial information, Zhang and Tian proposed a two-stream encoder comprised of two 3D CNNs [45, 46] and one parallel fully connected layer to learn the features from the frames [42]. Besides, Pan et al. proposed a transfer unit to model the high-level semantic attributes from both images and videos, which are rendered as the complementary knowledge to video representations for boosting sentence generation [20].

More recently, reinforcement learning has shown benefits on video captioning tasks. Pasunuru and Bansal employed reinforcement learning to directly optimize the CIDEt scores (a variant metric of CIDEr) and achieved state-of-the-art results on the MSR-VTT dataset [47]. Wang et al. proposed a hierarchical reinforcement learning framework, where a manager guides a worker to generate semantic segments about activities to produce more detailed descriptions.

2.3 Dual Learning Approaches

As far as we know, dual learning mechanism has not been employed in video captioning but widely used in NMT [33, 34, 48]. In [33], the source sentences are reproduced from the target side hidden states, and the accuracy of reconstructed source provides a constraint for the decoder to embed more information of source language into target language. In [34], the dual learning is employed to train model of inter-translation of English-French, and obtains significant improvement on tasks of English to French and French to English.

In this paper, our proposed RecNet can be regarded as a sequence learning method. However, unlike the above conventional encoder-decoder models which only depend on the forward flow from video to sentence, RecNet can also benefit the backward flow from sentence to video. By fully considering the bidirectional flows between video and sentence, RecNet is capable of further boosting the video captioning. Besides, it is worth noting that this work is an extended version of [49]. The main improvements of this version are elaborated as follows: First, this work takes one step forward and presents a new reconstruction model, named as RecNet, which considers both global and local structures to further improve the reconstruction of the video representation. Second, the exposure bias between training and inference processes is studied in this work. We demonstrate that the proposed reconstructor can help regularize the transition dynamics of the decoder, and mitigate the discrepancy between training and inference processes. Besides, more ablation studies on reconstructors are conducted, including training with the self-critical algorithm, visualization of the hidden states of the decoder, and curves of the training losses and metrics used to examine how well the reconstrutor works. Also, an additional dataset ActivityNet [50] is included to further verify the effectiveness of the proposed reconstructor.

3 Architecture

We propose a novel RecNet with an encoder-decoder-reconstructor architecture for video captioning, which works in an end-to-end manner. The reconstructor imposes one constraint that the semantic information of one source video can be reconstructed from the hidden state sequence of the decoder. The encoder and decoder are thus encouraged to embed more semantic information about the source video. As illustrated in Fig. 2, the proposed RecNet consists of three components, specifically the encoder, the decoder, and the reconstructor:

  • Encoder. Given one video sequence, the encoder yields the semantic representation for each video frame.

  • Decoder. The decoder decodes the corresponding representations generated by the encoder into one caption describing the video content.

  • Reconstructor. The reconstructor takes the intermediate hidden state sequence of the decoder as input, and reconstructs the video global or local structure.

Moreover, our designed reconstructor can collaborate with different classical encoder-decoder architectures for video captioning. Our proposed reconstructor can be built on top of any classical encoder-decoder models for video captioning. In this paper, we employ the attention-based video captioning [25] and S2VT [18] as the classical encoder-decoder models. We first briefly introduce the encoder-decoder model for video captioning. Afterward, the proposed reconstructors with different architectures are described.

3.1 Encoder-Decoder

The aim of video captioning is to generate one sentence to describe the content of one given video

. Classical encoder-decoder architectures directly model the captioning generation probability word by word:

(1)

where keeps the parameters of the encoder-decoder model. denotes the length of the sentence, and (i.e., ) denotes the generated partial caption.

Encoder. To generate reliable captions, visual features need to be extracted to capture the high-level semantic information about the video. Previous methods usually rely on CNNs, such as AlexNet [19], GoogleNet [25], and VGG19 [51] to encode each video frame into a fixed-length representation with the high-level semantic information. By contrast, in this work, considering a deeper network is more plausible for feature extraction, we advocate using Inception-V4 [52] as the encoder. In this way, the given video sequence is encoded as a sequential representation , where denotes the total number of the video frames.

Decoder. Decoder aims to generate the caption word by word based on the video representation. LSTMs with the capabilities of modeling long-term temporal dependencies are used to decode video representation to video captions word by word. To further exploit the global temporal information of videos, a temporal attention mechanism [25] is employed to encourage the decoder to select the key frames/elements for captioning.

During the captioning process, the word prediction is generally made by the LSTM:

(2)

where

represents the LSTM activation function,

is the hidden state computed in the LSTM, and denotes the

context vector computed with the temporal attention mechanism, which is used to decode the

word. The temporal attention mechanism is used to assign weight to the representation of each frame at the time step as follows:

(3)

where denotes the number of the video frames and . In order to obtain the attentive weight at the time step for the video frame representation, we follow the traditional way in [25] to calculate the relevance score for the frame representation with respect to the hidden state :

(4)

where , , , and are the learnable parameters. The attentive weight is thereby obtained by:

(5)

The attentive weight reflects the relevance between the frame representation and the hidden state at the time step. The larger , the more relevant the video frame representation is to , which allows the decoder to focus more on the corresponding video frames to generate the word at the next time step.

3.2 Reconstructor

As shown in Fig. 2, the proposed reconstructor is built on top of the encoder-decoder, which is expected to reproduce the video from the hidden state sequence of the decoder. However, due to the diversity and high dimension of the video frames, directly reconstructing the video frames seems to be intractable. Therefore, in this paper, the reconstructor aims at reproducing the sequential video frame representations generated by the encoder, with the hidden states of the decoder as input. The benefits of such a structure are two-fold. First, the proposed encoder-decoder-reconstructor architecture can be trained in an end-to-end fashion. Second, with such a reconstruction process, the decoder is encouraged to embed more information from the input video sequence. Therefore, the relationship between the video sequence and caption can be further enhanced, which is expected to improve the video captioning performance.

In practice, the reconstructor is realized by LSTMs. Two different architectures are customized to summarize the hidden states of the decoder for video feature reproduction. More specifically, one focuses on reproducing the global structure of the provided video, while the other pays more attentions to the local structure by selectively attending to the hidden state sequence. Moreover, to comprehensively reconstruct the video sequence, we propose a new architecture to reconstruct both the global and local structures of the video features.

3.2.1 Reconstructing Global Structure

Fig. 3: An illustration of the proposed reconstructor that reproduces the global structure of the video sequence. The left mean pooling is employed to summarize the hidden states of the decoder for the global representation of the caption. The reconstructor aims to reproduce the feature representation of the whole video by mean pooling (the right one) using the global representation of the caption as well as the hidden state sequence of the decoder.

The architecture for reconstructing the global structure of the video sequence is illustrated in Fig. 3. The whole sentence is fully considered to reconstruct the video global structure. Therefore, besides the hidden state at each time step, the global representation characterizing the semantics of the whole sentence is also taken as the input at each step. Several methods like LSTM and multiple-layer perception can be employed to fuse the hidden sequential states of the decoder to generate the global representation. Inspired by [18], the mean pooling strategy is performed on the hidden states of the decoder to yield the global representation of the caption:

(6)

where denotes the mean pooling process, which yields a vector representation with the same size as . Thus, the LSTM unit of the reconstructor is further modified as:

(7)

where , , , , and denote the input, forget, memory, output, and hidden states of each LSTM unit, respectively. and denote the logistic sigmoid activation and the element-wise multiplication, respectively.

To reconstruct the video global structure from the hidden state sequence produced by the encoder-decoder, the global reconstruction loss is defined as:

(8)

where denotes the mean pooling process on the video frame features, yielding the ground-truth global structure of the input video sequence. works on the hidden states of the reconstructor, indicating the global structure recovered from the captions. The reconstruction loss is measured by , which is simply chosen as the Euclidean distance.

Fig. 4: An illustration of the proposed reconstructor that reproduces the local structure of the video sequence. The reconstructor works on the hidden states of the decoder by selectively adjusting the attentive weights, and reproduces the feature representation frame by frame.

3.2.2 Reconstructing Local Structure

Fig. 5: An illustration of the proposed reconstructor that reproduces both the global structure and local structure of the video sequence. The reconstructor works on the hidden states of the decoder by selectively adjusting the attentive weights, and reproduces the feature representation frame by frame. Mean pooling is conducted to summarize the reproduced feature representation sequence to yield the representation of the whole video.

The aforementioned reconstructor aims to reproduce the global representation for the whole video sequence, while neglects the local structures in each frame. In this subsection, we propose to learn and preserve the temporal dynamics by reconstructing each video frame as shown in Fig. 4

. Differing from the global structure estimation, we intend to reproduce the feature representation of each frame from the key hidden states of the decoder selected by the attention strategy

[53, 25]:

(9)

where and denotes the weight computed for the hidden state from the decoder at time step by the attention mechanism. Here measures the relevance of the hidden state of the decoder given the previously reconstructed frame representations . Similar to Eqs. (4) and (5), to calculate the attentive weight , the relevance score for the hidden state from the decoder with respect to the previous hidden state in the reconstructor is first computed by:

(10)

where , , , and are the learnable parameters, and denotes the total number of hidden states from the decoder.

Such a strategy encourages the reconstructor to work on the hidden states selectively by adjusting the attentive weight and yield the context information at each time step as in Eq. (9). As such, the proposed reconstructor can further exploit the temporal dynamics and the word compositions across the whole caption. The LSTM unit is thereby reformulated as:

(11)

Differing from the global structure recovery step in Eq. (7), the dynamically generated context is taken as the input other than the hidden state and its mean pooling representation . Moreover, instead of directly generating the global mean representation of the whole video sequence, we propose to produce the feature representation frame by frame. The reconstruction loss is thereby defined as:

(12)

3.2.3 Reconstructing both Global and Local Structures

In this subsection, we step further and intend to reconstruct both global and local structures of the video sequence. The architecture is illustrated in Fig. 5.

Different from the aforementioned two methods, we first reconstruct the feature representation of each frame with the local information of the input video. After that, mean pooling is conducted on the reproduced frame representation and the global representation of the video is yielded. The global reconstruction loss and the local reconstruction loss are jointly considered as follows:

(13)

where the first part denotes the global reconstruction loss computed by Eq. (8), while the second part is the local reconstruction loss computed by Eq. (12).

Such an architecture is designed to reproduce both the global and local information. It is expected to comprehensively exploit the backward information flow and further boost the performance of video captioning.

3.3 Training

3.3.1 Training Encoder-Decoder

Traditionally, the encoder-decoder model can be jointly trained by minimizing the negative log likelihood to produce the correct description sentence given the video as follows:

(14)

Moreover, in order to directly optimize the metrics instead of the cross entropy loss, we consider the video contents and words as ”environment”, and the encoder-decoder model as ”agent” which interacts with the environment. At each step for agent LSTMs, the policy takes ”action” to predict a word followed by the updating of ”state”, which denotes the hidden states and cell states of LSTMs. When a sentence is generated, the metric score is computed and regarded as the ”reward” (here the CIDEr score is taken as the reward), which we intend to optimize by minimizing the negative expected reward:

(15)

where denotes the reward of the word sequence sampled from the model. The gradient of the non-differentiable loss can be obtained by Eq. (15) using the REINFORCE algorithm.

(16)

In fact is typically estimated with a single sample from . Therefore, for each sample, Eq. (16) can be represented as:

(17)

To deal with the high variance of the gradient estimate with a single sample, we also add a baseline reward into the reward-based loss to generalize the policy gradient given by REINFORCE, which could reduce the variance without changing the expected gradient:

(18)

where denotes the baseline reward.

According to the chain rule, the gradient of the loss function can be calculated as:

(19)

where is the input to the Softmax layer in Eq. (2). The estimation of with the baseline is given by [54] as follows:

(20)

Usually the baseline is estimated by a trainable network [36], which significantly increases the computational complexity. To handle such a drawback, Rennie et al. proposed a self-critical sequence training method [37], and the baseline is considered as the CIDEr score of the sentence, which is generated by the current encoder-decoder model under the inference mode. It proved to be more effective for training, as such a scheme not only brings a lower gradient variance, but also avoids learning an estimate of expected future rewards as the baseline. Hence, in this paper, we employ the self-critical sequence training method and take the metric score of the sentence greedily decoded by the current model with the inference algorithm as the baseline, i.e., . As such, by replacing the baseline with , Eq. (20) can be rewritten as:

(21)

Hence, if a sample has a reward higher than baseline , the gradient is negative and such a distribution is encouraged by increasing the probability of the corresponding word. Similarly, the sample distribution that has a low reward is discouraged.

3.3.2 Training Encoder-Decoder-Reconstructor

Formally, we train the proposed encoder-decoder-reconstructor architecture by minimizing the whole loss defined in Eq. (22), which involves both the forward (video-to-sentence) likelihood and the backward (sentence-to-video) reconstruction loss:

(22)

The loss for the encoder-decoder can be either the traditional cross entropy loss or the reinforcement loss . The reconstruction loss can be set as the global loss in Eq. (8), the local loss in Eq. (12), or the joint global and local loss in Eq. (13). The hyper-parameter is introduced to seek a trade-off between the encoder-decoder and the reconstructor.

The training of our proposed RecNet model proceeds in two stages. First, we rely on the forward likelihood to train the encoder-decoder component of the RecNet, which is terminated by the early stopping strategy. Afterward, the reconstructor and the backward reconstruction loss are introduced. We use the whole loss defined in Eq. (22) to jointly train the reconstructor and fine-tune the encoder-decoder. For the reconstructor, the reconstruction loss is calculated using the hidden state sequence generated by the LSTM units in the reconstructor as well as the video frame feature sequence.

3.4 RecNet vs. Autoencoder

The whole architecture of the RecNet can be regarded as one autoencoder. Specifically, the encoder-decoder framework acts as the “encoder” component in the autoencoder, and its aim is to generate the fluent sentences describing the video contents. The reconstructor performs as the “decoder”, and its aim is to ensure the semantic correlations between the generated caption and the input video sequence. Also, the reconstruction losses defined in Eqs. (

8), (12), and (13) act similarly to the reconstruction error defined in the autoencoder, which can further guarantee that the model can learn effective feature representation from the input video sequence [55]. Compared to the encoder-decoder, the proposed reconstructor is able to exploit the backward flow from sentence to video, and acts as one regularizer to make the encoder-decoder component produce captions semantically correlated to input video sequences. Consequently, the video captioning performance can be improved further.

4 Experimental Results

In this section, we evaluate the proposed RecNet on the video captioning task over the benchmark datasets such as Microsoft Research video to text (MSR-VTT) [51], Microsoft Research Video Description Corpus (MSVD) [56], and ActivityNet Captions (ActivityNet) [50]. We compute the traditional evaluation metrics, namely BLEU-4 [57], METEOR [58], ROUGE-L [59], and CIDEr [60] with the codes released on the Microsoft COCO evaluation server111https://github.com/tylin/coco-caption [61] and ActivityNet evaluation server222https://github.com/ranjaykrishna/densevid_eval. In the following, we first introduce the benchmark datasets and the implementation details of the proposed method. Then, the experimental results are provided and discussed.

4.1 Datasets and Implementation Details

4.1.1 Datasets

MSR-VTT. It is the largest dataset for video captioning so far in terms of the number of video-sentence pairs and the vocabulary size. In the experiments, we use the initial version of MSR-VTT, referred to as MSR-VTT-10K, which contains 10K video clips from 20 categories. Each video clip is annotated with 20 sentences by 1,327 workers from Amazon Mechanical Turk. Therefore, the dataset results in a total of 200K clip-sentence pairs and 29,316 unique words. We use the public splits for training and testing, i.e., 6,513 for training, 497 for validation, and 2,990 for testing.

MSVD. It contains 1,970 YouTube short video clips with each one depicting a single activity in 10 seconds to 25 seconds. Each video clip has roughly 40 English descriptions. Similar to the prior work [28, 25], we take 1,200 video clips for training, 100 clips for validation and 670 clips for testing, respectively.

ActivityNet.

It is a large-scale video benchmark dataset with the complex human activities for high-level video understanding tasks, including temporal action proposal, temporal action localization, and dense video captioning. Specifically, there are 10,024 videos for training, 4,926 for validation, and 5,044 for testing, respectively. Each video has multiple annotated events with starting and ending time-stamps as well as the corresponding captions.

4.1.2 Implementation Details

For the sentence preprocessing, we remove the punctuations in each sentence, split each sentence with blank space, and convert all words into lowercase. The sentences longer than 30 are truncated, and the word embedding size for each word is set to 468.

For the encoder, we feed all frames of each video clip into Inception-V4 [52] which is pretrained on the ILSVRC-2012-CLS [62] classification dataset for feature extraction after resizing them to the standard size of , and extract the 1,536 dimensional semantic feature of each frame from the last pooling layer. Inspired by [25]

, we choose the equally-spaced 28 features from one video, and pad them with zero vectors if the number of features is less than 28. The input dimension of the decoder is 468, the same to that of the word embedding, while the hidden layer contains 512 units. For the reconstructor, the inputs are the hidden states of the decoder and thus have the dimension of 512. To ease the reconstruction loss computation, the dimension of the hidden layer is set to 1,536 same to that of the frame features produced by the encoder.

During the training, AdaDelta [63] is employed for optimization with the cross-entropy loss while Adam [64]

is applied with a learning rate 1e-5 under the RL. To help the models under the REINFORCE algorithm converge fast, we initialize them as the pre-trained models having the best CIDEr scores under cross entropy training. The training stops when the CIDEr value on the validation dataset stops increasing in the following 20 successive epochs. During the testing phase, beam search with size 5 is used for the final caption generation.

4.2 Performance Comparisons

4.2.1 Performance on MSR-VTT

Model BLEU-4 METEOR ROUGE-L CIDEr
MP-LSTM (AlexNet) 32.3 23.4 - -
MP-LSTM (GoogleNet) 34.6 24.6 - -
MP-LSMT (VGG19) 34.8 24.7 - -
SA-LSTM (AlexNet) 34.8 23.8 - -
SA-LSTM (GoogleNet) 35.2 25.2 - -
SA-LSTM (VGG19) 35.6 25.4 - -
SA-LSTM (Inception-V4) 36.3 25.5 58.3 39.9
RecNet 38.3 26.2 59.1 41.7
RecNet 39.1 26.6 59.3 42.7
SA-LSTM (RL) 38.3 26.3 59.5 46.1
RecNet(RL) 38.8 27.5 60.2 48.4
RecNet(RL) 39.2 27.5 60.3 48.7
TABLE I: Performance evaluation of different video captioning models on the testing set of the MSR-VTT dataset in terms of BLEU-4, METEOR, ROUGE-L, and CIDEr scores (%). The encoder-decoder framework is equipped with different CNN structures such as AlexNet, GoogleNet, VGG19, and Inception-V4. Except Inception-V4, the metric values of the other published models are referred from the work in [65], and the symbol “-” indicates that such a metric is unreported.

In this section, we first test the impacts of different encoder-decoder architectures in video captioning, such as SA-LSTM and MP-LSTM. Both are popular encoder-decoder models and share similar LSTM structure, except that SA-LSTM introduced an attention mechanism to aggregate frame features, while MP-LSTM relies on the mean pooling. As shown in Table I, with the same encoder VGG19, SA-LSTM yielded 35.6 and 25.4 on the BLEU-4 and METEOR metrics respectively, while MP-LSTM only produced 34.8 and 24.7, respectively. The same results can be obtained when using AlexNet and GoogleNet as the encoder. Hence, it is concluded that exploiting the temporal dynamics among frames with the attention mechanism performed better in sentence generation than mean pooling on the whole video.

Fig. 6: Effects of the trade-off parameter for RecNet and RecNet in terms of BLEU-4 metric on MSR-VTT. It is noted that means that the reconstructor is off, and the RecNet turns to be a conventional encoder-decoder model.

Besides, we also introduced Inception-V4 as an alternative CNN for feature extraction in the encoder. It is observed that with the same encoder-decoder structure SA-LSTM, Inception-V4 yielded the best captioning performance comparing to the other CNNs such as AlexNet, GoogleNet, and VGG19. This is probably because Inception-V4 is a deeper network and better at semantic feature extraction. Hence, SA-LSTM equipped with Inception-V4 is employed as the encoder-decoder model in the proposed RecNet.

By stacking the global or local reconstructor on the encoder-decoder model SA-LSTM, we can have the proposed encoder-decoder-reconstructor architecture: RecNet. Apparently, such a structure yielded significant gains to the captioning performance in all metrics. This proved that the backward flow information introduced by the proposed reconstructor could encourage the decoder to embed more semantic information and regularize the generated caption to be more consistent with the video contents. Actually, RecNet can be viewed as a dual learning model, where video captioning is the primal task and reconstruction behaves as its dual task. In the dual task, a reconstruction loss measuring the difference between the reconstructed and original visual features is employed additionally to train the primal task and optimize the parameters of the encoder and decoder. With such a reconstructor, the decoder is encouraged to embed more information from the input video sequence. Therefore, the relationship between the video sequence and caption can be enhanced. And the decoder can generate more semantically correlated sentences to describe the visual contents of the video sequences, yielding significant performance gains. More discussions and experimental results about the proposed reconstructor will be given in Section 4.4.

Moreover, we compared RecNet with two models [24, 66], which achieved great successes in 2016 MSR-VTT Challenge. As they utilized the multimodal features like aural and speech to boost captioning, which is beyond the focus of this work, we set the comparison on the C3D features only. Also, as they have reported results on the validation set of MSR-VTT, for a fair comparison we report out results on the same validation set. Our proposed RecNet shows superiority over these two methods as in Table II. Incorporating the multimodal features into the proposed RecNet would be an interesting direction for us to pursue in the future.

Model BLEU-4 METEOR ROUGE-L CIDEr
Qin et al. [24] 36.9 27.3 58.4 41.9
Shetty et al. [66] 36.9 26.8 57.7 41.3
SA-LSTM 37.9 26.9 59.5 41.5
RecNet 38.5 27.5 59.7 42.7
TABLE II: Performance evaluation of different video captioning models with C3D features on the validation set of the MSR-VTT dataset in terms of BLEU-4, METEOR, ROUGE-L, and CIDEr scores (%).

Furthermore, we introduce the reinforcement learning into the SA-LSTM model and mix it with the proposed reconstructor. The results on MSR-VTT are shown in the bottom column of Table I. It can be observed that REINFORCE algorithm makes a big contribution to the performance improvement of the baseline model, especially on the CIDEr value which is directly optimized by the model. Once again, we see the importance of backward flow mined by reconstructor: even if the network, such as the SA-LSTM(RL), has achieved great performance, collaborating with the backward information flow leads to a further improvement.

4.2.2 Performance on MSVD

Also, we tested the proposed RecNet on the MSVD dataset [56], and compared it to more benchmark encoder-decoder models, such as GRU-RCN[67], HRNE[68], h-RNN[69], LSTM-E[28], aLSTMs[26], and LSTM-LS[29]. The quantitative results are given in Table III. It is observed that the RecNet and RecNet with SA-LSTM performed the best and second best in all metrics, respectively. Besides, we introduced the reconstructor to S2VT[18] to build another encoder-decoder-reconstructor model. The results show that both global and local reconstructors bring improvements to the original S2VT in all metrics, which again demonstrate the benefits of video captioning using bidirectional cue modeling.

Model BLEU-4 METEOR ROUGE-L CIDEr
MP-LSTM (AlexNet)[19] 33.3 29.1 - -
GRU-RCN[67] 43.3 31.6 - 68.0
HRNE[68] 43.8 33.1 - -
LSTM-E[28] 45.3 31.0 - -
LSTM-LS (VGG19)[29] 46.5 31.2 - -
h-RNN[69] 49.9 32.6 - 65.8
aLSTMs [26] 50.8 33.3 - 74.8
S2VT (Inception-V4) 39.6 31.2 67.5 66.7
SA-LSTM (Inception-V4) 45.3 31.9 64.2 76.2
RecNet (S2VT) 42.9 32.3 68.5 69.3
RecNet (S2VT) 43.7 32.7 68.6 69.8
RecNet (SA-LSTM) 51.1 34.0 69.4 79.7
RecNet (SA-LSTM) 52.3 34.1 69.8 80.3
TABLE III: Performance evaluation of different video captioning models on the MSVD dataset in terms of BLEU-4, METEOR, ROUGE-L, and CIDEr scores (%). The symbol ”-” indicates that such a metric is unreported.

One interesting point of the proposed reconstructor is about the capability of learning from limited amounts of data, which is brought by the dual learning mechanism. The benefit of RecNet on one limited training set can be demonstrated by comparing the results on MSVD and MSR-VTT. The gains on MSVD are more evident than those on MSR-VTT, while the size of MSVD is only one third of MSR-VTT. Taking the ResNet as an example, we halve the size of MSVD to further demonstrate the proposed model on learning from limited data. As shown in Table IV, the performance on CIDEr declines when the data size is reduced. However, a larger improvement is achieved, with gain of 9.3% on half of MSVD vs. 5.7% on full MSVD.

Data Size SA-LSTM (Inception-V4) RecNet Performance Gain
50% 72.2 78.9 +9.3%
100% 76.2 80.3 +5.4%
TABLE IV: Performance on MSVD with reduced size in terms of CIDEr scores (%). The RecNet is token as the exemplary model.

4.2.3 Performance on ActivityNet

To further verify the effectiveness of the proposed RecNet, experiments on the ActivityNet dataset [50] are conducted and shown in Table V. We construct the video-sentence pairs by extracting the video segments indicated by the starting and ending timestamps as well as their associated sentences. As the ground-truth captions and timestamps of the test split are unavaliable, we validate our model on the validation split.

Model BLEU-4 METEOR ROUGE-L CIDEr
SA-LSTM 1.58 8.69 18.24 28.50
RecNet 1.67 9.23 20.07 34.78
RecNet 1.71 9.70 20.50 35.52
SA-LSTM (RL) 1.65 9.37 21.03 33.57
RecNet(RL) 1.72 10.43 23.25 37.96
RecNet(RL) 1.74 10.47 23.49 38.43
TABLE V: Performance evaluation of different video captioning models on the validation split of the ActivityNet dataset in terms of BLEU-4, METEOR, ROUGE-L, and CIDEr scores (%). ”RL” in brackets means that the model is trained by the self-critical algorithm.

Similar to the results on MSR-VTT and MSVD, it can be observed that the proposed RecNet outperforms the base model SA-LSTM. The main reason can be attributed to that the backward flow is well captured by the proposed reconstructor. The same situation occurs when models are trained by the self-critical sequence training strategy. The methods realized in a traditional encoder-decoder architecture mainly focus on exploiting the forward flow (video to sentence), while neglect the backward flow (sentence to video). In contrast, the proposed RecNet, realized in a novel encoder-decoder-reconstructor architecture, leverages both the forward and backward flows. Therefore, the relationship between the video sequence and caption can be further enhanced, which can improve the video captioning performance.

4.3 Study on the Trade-off Parameter

In this section, we discuss the influence of the trade-off parameter in Eq. (22). With different values, the obtained BLEU-4 metric values are given in Fig. 6. First, it can be concluded again that adding the reconstruction loss () did improve the performance of video captioning in terms of BLEU-4. Second, there is a trade-off between the forward likelihood loss and the backward reconstruction loss, as too large may incur noticeable deterioration in the caption performance. Thus, needs to be more carefully selected to balance the contributions of the encoder-decoder and the reconstructor. As shown in Fig. 6, we empirically set to 0.2 and 0.1 for RecNet and RecNet, respectively.

4.4 Study on the Reconstructors

A deeper study on the proposed reconstructor is discussed in this section.

4.4.1 Performance Comparisons between Different Reconstructors

The quantitative results of RecNet and RecNet on MSR-VTT are given on the bottom two rows of Table I. It can be observed that RecNet performs slightly better than RecNet. The reason mainly lies in the temporal dynamic modeling. RecNet employs mean pooling to reproduce the video representation and misses the local temporal dynamics, while the attention mechanism is included in RecNet to exploit the local temporal dynamics for each frame reconstruction.

The performance gap on some metrics, such as METEOR and ROUGE-L, between RecNet and RecNet may be not significant. One possible reason is that the visual information of consecutive frames is very similar. As the video clips from the available datasets are short, the visual representations of frames are of small differences from each other. Therefore, the global and local structure information seems to be similar. Another possible reason is the complicated video-sentence relationship, which may lead to similar input information for RecNet and RecNet.

4.4.2 Effects of Reconstruction Order

Another interesting point about the proposed reconstructor is that we do not need to constrain the reconstruction order. In fact, the reconstruction ordering is unnecessary when reproducing the global structure of the video. We may discard the idea in RecNet, as we finally put the mean pooling operation on the reconstructed feature sequence to acquire the video global structure (mean frame feature), which will disrupt the original feature order. For the RecNet, we need the correct reconstruction order and address it in an implicit manner, where the employed LSTM in RecNet reconstructs the video features frame by frame and then the local losses are pooled together as in Eq. (12).

Besides, we have tested another method to constrain the correct reconstruction ordering for the RecNet. Specifically, we have tried replacing the attention mechanism in the reconstructor with the transposed attention matrix obtained in the decoder which has the ordering information about the input frame features. The results are given in Table VI. It is observed that to a certain extent transposing the attention matrix for feature reconstruction does help boost the baseline encoder-encoder model. However, it is inferior to the proposed RecNet. The reason is that simply matrix transposing cannot effectively exploit the complicated relationship between the video features and the hidden states of the decoder.

Model BLEU-4 METEOR ROUGE-L CIDEr
SA-LSTM (Inception-V4) 36.3 25.5 58.3 39.9
RecNet 39.1 26.6 59.3 42.7
RecNet(transpose) 36.7 25.7 57.9 40.4
TABLE VI: Performance evaluation of different video captioning models on the testing set of the MSR-VTT dataset in terms of BLEU-4, METEOR, ROUGE-L, and CIDEr scores (%).
Dataset Model BLEU-4 METEOR CIDEr
MSR-VTT RecNet 38.3 26.2 41.7
RecNet 39.1 26.6 42.3
RecNet 38.7 26.7 43.1
RecNet(RL) 38.8 27.5 48.4
RecNet(RL) 39.2 27.5 48.7
RecNet(RL) 39.3 27.7 49.5
MSVD RecNet 51.1 34.0 79.7
RecNet 52.3 34.1 80.3
RecNet 51.5 34.5 81.8
RecNet(RL) 52.3 34.1 82.9
RecNet(RL) 52.4 34.3 83.6
RecNet(RL) 52.9 34.8 85.9
ActivityNet RecNet 1.67 9.23 34.78
RecNet 1.71 9.70 35.52
RecNet 1.72 9.98 35.84
RecNet(RL) 1.72 10.43 37.96
RecNet(RL) 1.74 10.47 38.43
RecNet(RL) 1.75 10.61 38.88
TABLE VII: Performance evaluation of the combined architecture on MSR-VTT, MSVD, and ActivityNet in terms of BLEU-4, METEOR, and CIDEr scores (%).

4.4.3 Jointly Reconstructing the Global and Local Structures

We also propose one new model named RecNet, which simultaneously considers both global and local structures to produce more reliable reconstruction of the video representation. The performances of RecNet on MSR-VTT, MSVD, and ActivityNet are illustrated in Table VII. Obviously, RecNet yields consistent performance improvements over both RecNet and RecNet, especially on the METEOR and CIDEr metrics. The consistent performance improvement can also be observed when RecNets are trained by the self-critical strategy. For example, on the MSR-VTT dataset, RecNet(RL) has a better CIDEr score (49.5) than RecNet(RL) (48.4) and RecNet(RL) (48.7), and the METEOR score of RecNet(RL)(27.7) also outperforms RecNet(RL) (27.5) and RecNet(RL) (27.5). Such improvements on different benchmark datasets show that jointly learning the global and local reconstructors can help exploit the backward flow (sentence to video) comprehensively, thereby further boosting the performance of video captioning.

4.4.4 Discussions on the Reconstructor

Moreover, we make more detailed studies on how well the reconstruction is performed. First, we record the training losses as well as the CIDEr scores during the training process (with and without the proposed reconstructor), as illustrated in Fig. 7 to examine the effects of the reconstructor.

Fig. 7: The curves of training loss and CIDEr score during the training process (with and without the proposed reconstructor). The base model denotes the traditional encoder-decoder structure for video captioning, while the RecNet stacks the reconstructor on top of the encoder-decoder model. The solid lines in red, blue, and black indicate the cross-entropy loss, the reconstruction loss, and the total training loss, respectively. The dotted line in green indicates the CIDEr score.
Fig. 8: Visualization of the distribution of the decoder hidden states in base model (a) and RecNet (b). The dots in red represent the hidden states generated in training process, and dots in green represent the hidden states generated during the inference process.
Fig. 9: Visualization of some video captioning examples on the MSR-VTT dataset with different models. Due to the page limit, only one ground-truth sentence is given as the reference. Compared to SA-LSTM, the proposed RecNet is able to yield more vivid and descriptive words highlighted in red boldface, such as “fighting”, “makeup”, “face”, and “horse”.

It can be observed that the cross-entropy loss defined by Eq. (1) and the CIDEr score consistently converge during the training process of the baseline model. Relying on the early stopping strategy, the parameters of the baseline model are determined. Afterward, the reconstructor, specifically the RecNet, is employed to reconstruct the local structure of the original video from the hidden state sequence generated by the decoder. During such a process, the reconstruction loss defined by Eq. (12) is used to train the reconstructor and the total training loss defined by Eq. (22) is used to train the encoder and decoder. It can be observed that the reconstruction loss decreases as the training proceeds, while the cross-entropy loss increases slightly in the beginning and then consistently decreases. It is reasonable as the untrained reconstructor can hardly provide valid information to train the encoder and decoder in the first few iterations. The same observations can be obtained for the CIDEr score. Hence, it can be concluded that the proposed reconstructor can help train the encoder and decoder and thereby improves the video captioning performance. Also, the decrease of the reconstruction loss clearly demonstrates that such reconstruction can reproduce meaningful visual information from the hidden state sequence generated by the decoder.

Second, we examine how the hidden state in the decoder changes after the reconstructor is employed. Generally, in the video captioning models trained with the cross-entropy loss, the decoder is encouraged to maximize the likelihood of each word given the previous hidden state and the ground-truth word at the previous step. Whereas for inference, the errors are accumulated along the generated sentence, as no ground-truth word is observed by the decoder before predicting a word. As such, the discrepancy between training and inference is inevitably introduced, which is also referred to as the exposure bias [36].

To examine how the hidden state in the decoder changes after the reconstruction is introduced. We consider the distributions of the last hidden states of the sequence in the decoder as in [22], since they encode the necessary information about the whole sequential input. Specifically, we visualize the distributions of the last hidden states when the LSTM in the decoder reaches the maximum step or ‘’ (the end signal of the sentence) is predicted, in the training and inference stages, respectively. The hidden state visualizations are illustrated in Fig. 8. The hidden states are from the same batch with size 200. We reduce the dimension of the hidden states to 2 with T-SNEs [70]. It is obvious that the baseline model (Fig. 8 (a)) in training and inference stages presents a large discrepancy, while the proposed RecNet (Fig. 8 (b)) significantly reduces such a discrepancy. This is also one of the reasons that RecNet performs better than the competitor models.

4.5 Qualitative Analysis

Besides, some qualitative examples are shown in Fig. 9. Still, it can be observed that the proposed RecNets with local and global reconstructors generally produced more accurate captions than the typical encoder-decoder model SA-LSTM. For example, in the second example, SA-LSTM generated “a woman is talking”, which missed the core subject of the video, i.e., “makeup”. By contrast, the captions produced by RecNet and RecNet are “a woman is talking about makeup” and “a women is putting makeup on her face”, which apparently are more accurate. RecNet even generates the word of “face”, which results in a more descriptive caption. More results can be found in the supplementary file.

5 Conclusions

In this paper, we proposed a novel RecNet in the encoder-decoder-reconstructor architecture for video captioning, which exploits the bidirectional cues between natural language description and video content. Specifically, to address the backward information from description to video, two types of reconstructors were devised to reproduce the global and local structures of the input video, respectively. An additional architecture by fusing the two types of reconstructors was also presented and compared with the models that can reproduce either the global or the local structure separately. The forward likelihood and backward reconstruction losses were jointly modeled to train the proposed network. Besides, we employed the REINFORCE algorithm to directly optimize the CIDEr score and fused the reward-based loss with the traditional loss from the reconstructor for further improving the captioning performance. The extensive experimental results on the benchmark datasets demonstrate the superiority of the proposed RecNet over the existing encoder-decoder models in terms of the typical metrics for video captioning.

Acknowledgments

This work was supported by the National Key Research and Development Plan of China under Grant 2017YFB1300205, NSFC Grant no. 61573222, and Major Research Program of Shandong Province 2018CXGC1503.

References