Non-Autoregressive Video Captioning with Iterative Refinement

11/27/2019 ∙ by Bang Yang, et al. ∙ Peking University 9

Existing state-of-the-art autoregressive video captioning methods (ARVC) generate captions sequentially, which leads to low inference efficiency. Moreover, the word-by-word generation process does not fit human intuition of comprehending video contents (i.e., first capturing the salient visual information and then generating well-organized descriptions), resulting in unsatisfied caption diversity. In order to press close to the human manner of comprehending video contents and writing captions, this paper proposes a non-autoregressive video captioning (NAVC) model with iterative refinement. We then further propose to exploit external auxiliary scoring information to assist the iterative refinement process, which can help the model focus on the inappropriate words more accurately. Experimental results on two mainstream benchmarks, i.e., MSVD and MSR-VTT, show that our proposed method generates more felicitous and diverse captions with a generally faster decoding speed, at the cost of up to 5% caption quality compared with the autoregressive counterpart. In particular, the proposal of using auxiliary scoring information not only improves non-autoregressive performance by a large margin, but is also beneficial for the caption diversity.



There are no comments yet.


page 5

page 7

page 12

Code Repositories


The PyTorch code of the AAAI2021 paper "Non-Autoregressive Coarse-to-Fine Video Captioning".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video captioning aims at automatically describing video contents with complete and natural sentences, where most recent works [34, 25, 6, 27] adopt the encoder-decoder framework [33]

and achieve promising performance. In general, deep Convolutional Neural Networks (CNNs) are usually used to encode a video, whereas for decoding, Recurrent Neural Networks (RNNs) like LSTM

[21] or GRU [8], CNNs [1, 4] and self-attention based transformer [31, 5] have been adopted to generate captions. However, these decoders generate captions in a sequential manner, i.e., they condition each output word on previously generated outputs. Such autoregressive (AR) property results in low inference efficiency and is prone to error accumulation during inference [3]

. Moreover, AR decoders favor the frequent n-grams in training data

[10] and tend to generate repeated captions [13]. Thus captions generated by AR methods are usually rigid and lacking in variability. While non-autoregressive

(NA) decoding is proposed recently in neural machine translation (NMT)

[16] to generate sentences in parallel, the performance inevitably decline due to the poor approximation to the target distribution. Among the technical innovations to compensate the performance degradation, iterative refinement [24, 14] can effectively bridge the performance gap between NA decoding and AR decoding.

Figure 1: A illustration of different decoding methods. Specifically, for (a) autoregressive decoding, a description is generated word by word. But for (b) iterative non-autoregressive decoding, a description is produced in parallel by first capturing the salient visual information (words in blue and underline) and then correcting some grammatical errors (words in red and italic), which is more consistent with human intuition of comprehending video contents.

Figure 1 illustrates the differences between AR decoding and iterative NA decoding. The word-by-word generation of AR decoding in Figure 1 (a) may not fit the human intuition of comprehending video contents. In general, given a video clip, we are likely to be attracted by the salient visual information. Next, relying on the first-glance gist of the scene, we can produce a well-organized description by adjusting words order or replacing inappropriate words with proper ones. Thus, comprehending a video can be treated as a first visual then linguistic generation process. It can be observed that the iterative NA decoding in Figure 1 (b) is more consistent with such process, whereas takes fewer steps than AR decoding to generate a natural description.

To achieve the aforementioned first visual then linguistic generation, we first propose a non-autoregressive video captioning model (NAVC) using a self-attention based decoder. For parallelization, the decoder is modified so that the prediction of each word can attend to the entire sequence. Besides, masked language modeling objective [12]

is employed to train the proposed NAVC. To enhance the capability of NAVC, we provide it with examples of different difficulties during training, i.e., target sequences are masked randomly according to a uniformly distributed masking ratio (ranging from zero to one). After training, NAVC is capable of not only generating sentences from totally masked sequences but also predicting any proper subset of target words with rich bi-directional contexts.

To make full use of the capability of NAVC, we propose an improved highly parallel decoding algorithm named mask-predict-interact (MPI) motivated by [14] to implement an iterative refinement procedure during inference. The original mask-predict algorithm [14] repeatedly masks out and re-predicts the subset of words that the NA model is least confident about. However, the confidence of NAVC on words, as we will show, is not so credible. Therefore, we propose to additionally exploit external auxiliary scoring information to assist iterative refinement in focusing on inappropriate words accurately that need to be reconsidered. To reduce the workload, we simply adopt ARVC that have the same architecture with our proposed NAVC as an external teacher to provide complementary scoring information in this work. With such delicate designed iterative refinement procedure, a high-quality caption can be generated in a compositional manner, which can be treated as a first visual then linguistic generation process because salient visual information is more likely captured given few unmasked words in early iterations whereas grammatical errors are gradually removed in later iterations.

The main contributions of this work are three-fold. (1) To our knowledge, we are the first to sufficiently study non-autoregressive decoding in video captioning and systemically develop a non-autoregressive video captioning model (NAVC) with iterative refinement. (2) We further propose to exploit external auxiliary scoring information to assist the ARVC in precisely focusing on those inappropriate words during iterative refinement. (3) Extensive experiments show that with the aid of auxiliary information and iterative refinement, our proposed NAVC generate more felicitous and diverse captions and decodes faster at a cost of as little as 5% caption quality compared with the autoregressive counterpart.

2 Background

In this section, we mainly talk about the differences between autoregressive (AR) decoding and non-autoregressive (NA) decoding.

2.1 Autoregressive decoding

Given a clip of video , the encoder aims to encode sampled frames/clips to the video representation , while the decoder attempts to decode to a sentence with length . In AR decoding, the generation process is triggered by the [BOS] (begin-of-sentence) token and terminated after generating [EOS] (end-of-sentence) token . Formally, the model over parameters factors the distribution over possible output sentences

into a chain of conditional probabilities with a left-to-right causal structure:


During training, the ground-truth sentences are usually used for supervision, which is known as teacher forcing technique [36]. Therefore, the model is trained to minimize the cross-entropy loss as follows:


The mainstream AR decoders applied in video captioning include RNNs [27, 6, 34], CNNs [1, 4] and self-attention based transformer [31, 5]. Table 1 shows parallelization capability of different models in training and inference stages. It can be concluded that all AR models are not parallelizable during inference.

Models Training Inference
AR models RNNs based
CNNs based
self-attention based
NA models
Table 1: Parallelization capability of autoregressive (AR) models and non-autoregressive (NA) models in different stages.

2.2 Non-autoregressive decoding

To achieve parallelization in the inference stage, a solution is to remove the sequential dependency ( depends on ). Assuming the target sequence length can be modeled with a separate conditional distribution , then the NA decoding can be expressed as:


There are two main differences between NA decoding and AR decoding according to the Eq. (3) and Eq. (1).

Sequential dependency. After removing the sequential dependency, it makes a strong assumption in NA decoding that the individual token predictions are conditionally independent of each other. Although this modification allows for highly parallel decoding during inference, another problem is inevitably introduced and leads to performance degradation. For example, a non-autoregressive translation model might consider two or more possible translations, A and B, it could predict one token from A while another token from B due to the complete conditional independence mentioned above. This problem is termed as “multimodality problem” [16] in non-autoregressive machine translation (NAT). To avoid conceptual confusion with “multimodality problem” and multimodal inputs in video captioning, we stick to use double quotation marks whenever this problem is mentioned. To compensate for the performance degradation brought by this problem, several technical innovations have been proposed in NAT, including but not limited to:

  • Iterative refinement. Ghazvininejad et al. [14] show that iterative refinement during inference can collapse the multi-modal distribution into a sharper uni-modal distribution, thus alleviating the “multimodality problem”.

  • Knowledge distillation. Gu et al. [16] have discovered that training a NA model with distilled data generated by the AR counterpart performs better than the original ground-truth data. One possible explanation is that the distilled data is less noisy and more deterministic, eschewing the “multimodality problem”.

Target sequence length. In AR decoding, target sequence length of a generated caption can be dynamically determined by the [EOS] token. However, in NA decoding, is modeled with a separate conditional distribution . Prior works in NAT determine with following different methods: (1) using a fertility model [16]

, (2) applying a length classifier to process the encoder’s outputs

[24], (3) adding a special [LENGTH] token to the encoder for prediction [14], and (4) leveraging the statistics in training data [18, 35].

3 Non-Autoregressive Video Captioning

We begin this section by first presenting the architecture of our proposed non-autoregressive video captioning model (NAVC), followed by its training objective. Finally, inference rules of NAVC are discussed, where the mask-predict-interact decoding algorithm is introduced in details.

Figure 2: The architecture of our proposed NAVC, which is consist of a CNN-based encoder, a length predictor and a self-attention based decoder.

3.1 Architecture

As shown in Figure 2, the architecture of our proposed NAVC consists of three parts, including a CNN-based encoder, a length predictor and a self-attention based decoder. Next, we will introduce them separately.

3.1.1 Encoder

In particular for a given video , we first sample it with frames/clips . Then the representation can be obtained by two subsequent transforms, which can be formalized as:


where are trainable parameters within and , respectively. Specifically, represents a series of convolution and pooling operations in pre-trained 2D/3D CNNs, and it produces output of dimension . Then maps the input of dimension to the output of dimension (model dimension of the decoder). Within , the shortcut connection in highway networks [28] is adopted. So can be defined as (omitting biases for clarity):


where is the Hadamard (element-wise) product, , are learnable weights and

is sigmoid activation function. In our preliminary experiments, adding the original encoder of transformer

[31] right after the input embedding layer (IEL) does not promote the performance, so we just exclude it. In terms of integrating representations of multiple modalities (), the final video representation can be achieved by concatenating them along the temporal axis, i.e., .

3.1.2 Length Predictor

Referring to Eq. (3), NAVC should know the target sequence length before decoding. We here introduce a length predictor (LP) over parameters to predict length distribution :


where and denotes the pre-defined maximum sequence length, denotes mean pooling (along the temporal axis),

is the ReLU activation function,

and are learnable weights. It is noteworthy that we formulated the length prediction as regression rather than classification since the ground-truth length distribution in training data is available. During training, we directly use the sequence length of ground-truth sentence. As for inference, we use the predicted distribution and further discuss it in Section 3.3.2.

3.1.3 Decoder

The decoder in NAVC has two kinds of capability, namely generating captions in parallel and predicting any subset of target words. To achieve the former capability, we adopt the original decoder of transformer [31] with only one modification, that is, removing the self-attention mask. By doing so, our decoder becomes bi-directional, thus the prediction of each token can use both left and right contexts. The latter capability, predicting any subset of target words, is essentially equivalent to do a cloze test. Specifically, given the ground-truth sentences with length , we mask some percentage of them with the special [MASK] token (in analogous to BERT [12]) to get partially masked sentences , where the masked tokens are denoted as . Then the decoder predicts each masked token , conditioned on and the video representation :


where denotes the transform of self-attention based decoder over trainable parameters and produce output of dimension , and denotes the transform of a projection layer over trainable parameters :


where is learnable weights and denotes the vocabulary size. Unlike the relative small masking ratio (e.g., 0.15) in BERT [12], we adopt a uniformly distributed masking ratio ranging from zero to one during training. Therefore, the decoder either deals with a hard example when is close to 1 or a easy example when is close to zero. This allows for improving robustness.

Figure 3: An example () from the MSR-VTT dataset that illustrates how our mask-predict-interact (MPI) works and the differences between our MPI and mask-predict [14]. At each iteration, in red means the confidence of the token is relatively low, so these tokens will be masked and re-predicted in the next iteration. Best viewed in color.

3.2 Training Objective

Instead of training length predictor separately [24], we train end-to-end NAVC by jointly minimizing the length prediction loss and the captioning loss :


where denotes all parameters in the NAVC, and denote the parameters that will be tuned under and , respectively. In particular, we fix the parameters in pre-trained CNN (i.e., ) like most previous works [6, 27] did, although fine-tuning might bring performance improvement [25]. In terms of length prediction loss, given the predicted distribution and ground-truth distribution , we adopt the smooth loss [15]:


As for the captioning loss, we optimize the cross-entropy loss over every masked token :


3.3 Inference Rules

In order to make full use of the capability of well-trained NAVC, we propose a decoding algorithm named mask-predict-interact (MPI) that enables the model to refine captions iteratively. Compared with the original mask-predict algorithm [14], our MPI additionally exploits external auxiliary scoring information because of the low reliability of NAVC’s confidence on generated tokens, which will be shown in Section 3.3.3. In order to reduce the workload, we simply utilize a well-trained ARVC that have the same architecture (but with shifted mask in the decoder) with our proposed NAVC as an external teacher in this work.

During inference, the decoding starts with totally masked sequences. Then the proposed MPI is able to produce high-quality sentences within a few numbers of iterations. At each iteration, tokens with low confidence are masked and re-predicted, allowing the model to give a second thought to those uncertain tokens. Moreover, with the aid of auxiliary information provided by ARVC, the inappropriate tokens (e.g. words that make the sentence not smooth) are more likely to be focused on.

3.3.1 Formal Description

Here, we define six variables: the number of iterations , the input sequence (), the output sequence , the NAVC’s confidence on , the ARVC’s confidence on , and the overall confidence on .

. For the first iteration (), is a totally masked sequence. While for later iterations (), we apply Select-and-Mask-Out () operation, where tokens of previous generated sequence will be masked based on the confidence , to yield :


where is the [MASK] token. We adopt a deterministic operation like [14], i.e., we mask out top tokens with the lowest confidence and keep the rest unchanged:


We use linear decay masking ratio with lower bound to decide , thus .

. Next, we feed to the NAVC and collect its prediction results and confidence . Specifically, we adjust and if only if is a [MASK] token:


Since NAVC is trained with masked language modeling objective [12], reveals the semantic consistency of .

. After masking and predicting, the teacher (ARVC) will give a mark to the generated sentence :


Since autoregressive decoders favor the frequent n-grams appeared in the training data [10], to some extent reveals the coherence between and previous generated words . For the next operation (Eq. (12, 13)), if , it is exactly the mask-predict algorithm [14]. We, instead, jointly consider the confidence from both NAVC and auxiliary teacher (ARVC) on :


3.3.2 Deciding Target Sequence Length

Following the common practice of noisy parallel decoding [16, 35], we select top length candidates with the highest values from during inference, which is similar to the beam size in beam search, and decode the same example with different lengths in parallel. We then utilize the overall confidence at the last iteration and select the sequence with the highest average log-probability as our hypothesis:

Model Iteration () MSVD MSR-VTT
B@4 M C Latency SpeedUp B@4 M C Latency SpeedUp
ARVC () 48.6 34.3 87.5 19.1 ms 1.55 41.6 28.3 48.6 22.3 ms 1.60
ARVC () 49.7 34.9 91.8 29.6 ms 1.00 42.4 29.1 50.8 35.6 ms 1.00

Base (NAVC)
1 52.4 33.3 81.3 6.6 ms 4.48 24.2 21.5 35.1 7.1 ms 5.01
2 53.1 33.5 80.6 11.4 ms 2.60 36.0 25.4 42.2 12.7 ms 2.80
5 52.7 34.0 82.0 26.0 ms 1.14 38.9 26.4 45.0 26.9 ms 1.32

Full (NAVC)
1 51.7 34.3 85.4 9.3 ms 3.18 26.2 22.6 37.4 9.7 ms 3.67
2 52.8 35.5 89.4 16.4 ms 1.80 39.6 27.0 46.3 16.8 ms 2.12
5 51.4 35.1 88.0 38.1 ms 0.78 42.5 28.0 49.4 39.7 ms 0.90

Table 2: Captioning performance comparison on both MSVD and MSR-VTT datasets, where B@4, M and C are short for BLEU@4, METEOR and CIDEr, respectively.

means the beam size in beam search. Latency is computed as the time to decode a single sentence without minibatching, averaged over the whole test set. The decoding is implemented in PyTorch on a single NVIDIA Titan X.

3.3.3 Example

Figure 3 illustrates how our proposed MPI can generate a good caption with the sequence length in just three iterations (). And meanwhile, a comparison between MPI and MP within finite iterations is also detailed.

How our MPI works. At the first iteration (), the input sequence is a totally masked sequence. Then NAVC predicts and in a purely non-autoregressive manner, producing a ungrammatical caption () with captured salient visual information (“a group of people”). Next, is fed into ARVC (the auxiliary teacher) to achieve its scoring information , so that the overall confidence of can be calculated.

At the second iteration (), we select 4 of the 6 tokens () of previously generated with lowest confidence , mask them out to obtain and re-predict them to generate . We observe that describes video contents correctly with no grammatical errors. Moreover, the token “are” () have higher score than , showing that ARVC believes “are” is more suitable than “a” for putting after “a group of people”.

At the last iteration (), we mask out 2 of the 6 tokens of with lowest confidence to get . Now that 4 of the 6 tokens are available, NAVC is able to predict more precisely. Although in this case, NAVC is more confident about “group” and “dancing”.

The differences between MPI and MP. In terms of formula, our MPI additionally influences masking decision (Eq. (17, 12)) and candidate decision (Eq. (18)). As for the qualitative example (the right part of Figure 3), the final sentence generated by MP is quite different from our expectations. Specifically, at the end of the first iteration, MP decides to mask out and re-predict the tokens “group”, “of”, “people” and “class”, which imperceptibly leads to a wrong direction. At later iterations, due to the decay of the masking ratio, the sentence becomes more deterministic. Therefore, the inappropriate token “is” () is hard to be adjusted again, especially with the immediately following token “in” rather than “people”. In short, iterative refinement might exhibit an inherent defect that unsuitable operations in early iterations might result in infelicitous captions. Compared with MP, MPI is more likely to focus on inappropriate words, relieving the inherent defect.

4 Experiments

4.1 Datasets and Implementation Details

We evaluate using two popular benchmark datasets from the existing literature in video captioning, namely Microsoft Video Description (MSVD) dataset [17], and MSR-Video To Text (MSR-VTT) dataset [37]. Three common metrics are adopted, including BLEU [26], METEOR [2] and CIDEr [32]. All metrics are computed using the API111 released by Microsoft COCO Evaluation Server [7]. More details can be found in supplementary materials.

Model B@4 M C B@4 M C
RecNet (I) [34] 52.3 34.1 80.3 39.1 26.6 42.7
E2E (IR) [25] 50.3 34.1 87.5 40.4 27.0 48.3
MGSA (IR+C) [6] 53.4 35.0 86.7 42.4 27.6 47.5

MARN (R+RX) [27]
48.6 35.1 92.2 40.4 28.1 47.1
ARVC (R) 46.2 33.4 77.0 40.4 28.4 48.0
ARVC (RX) 45.1 33.1 83.8 40.2 28.4 48.1
ARVC (R+RX) 49.7 34.9 91.8 42.4 29.1 50.8
Table 3: Autoregressive captioning performance on both MSVD and MSR-VTT datasets, where B@4, M and C are short for BLEU@4, METEOR and CIDEr. Features are denoted in brackets: I (Inception-V4 [29]), IR (Inception-ResNet-V2 [29]), C (C3D [30]), R (ResNet-101 [20]), RX (ResNeXt-101 [19]).

4.2 Experiment Results

4.2.1 Autoregressive Results

We first compare our ARVC with recent state-of-the-art approaches to verify the effectiveness of the overall architecture. The experimental results on both MSVD and MSR-VTT datasets is presented in Table 3. Specifically, ARVC (R+RX) adopts the same features as MARN [27] and achieves competitive results across all metrics on both datasets. Therefore, our proposed architecture is reasonable, i.e., the following results produced by the NAVC (with the same architecture) are credible. In the remaining experiments, we use R+RX features by default.

Figure 4: Visualization of some examples on MSR-VTT test set with ARVC, Base (NAVC) and Full (NAVC) models. GT is a random ground-truth sentence. For the sentences generated at the first iteration (), we underline words (also in blue) that are consistent with the salient visual information within presented frames. We also highlight some keywords of the final outputs in bold type.

4.2.2 Non-Autoregressive Results

We adopt (see Section 3.1.2) by default for NAVC. The performance comparison between ARVC and different NAVCs is shown in Table 2. Compared with ARVC with beam search, our full model (NAVC with MPI decoding algorithm, NAVC) maintains more than 96% caption quality in both datasets and remarkably obtains relative improvement of 6.2% and 1.7% for BLEU@4 and METEOR metrics in MSVD dataset. This shows that NAVC has the potential to approach autoregressive performance. Besides, our full model outperforms the base model (NAVC) with a relative improvement of 6.0% and 10.9% for METEOR and CIDEr metrics in MSVD dataset (when ) and a relative improvement of 9.3%, 6.1% and 9.8% for BLEU@4, METEOR and CIDEr metrics in MSR-VTT dataset (when ). This superior performance proves the effectiveness of auxiliary scoring information. In the case of decoding speed, NAVC generally shows a speedup of a factor of 2 over ARVC with beam search due to the highly parallel property. Because of the time cost brought by the external teacher, our full model decodes slower than the base model. Last but not least, we can conclude that increasing the number of iterations appropriately can generate captions of higher quality, though slowing down the decoding speed. But NAVC converges more quickly in MSVD. We will further analyze this in Section 5.4.

38.3 26.4 44.4

38.9 26.4 45.0

40.9 27.1 45.4

41.2 27.6 48.3

41.5 27.2 46.4

41.5 27.6 48.8

42.1 27.9 48.7

42.5 28.0 49.4
Table 4: The effects of de-duplicate (DD), masking decision (MD) and candidate decision (CD) on MSR-VTT dataset. .

5 Analysis

To complement the aforementioned results, we present analyses that provide some intuition as to why our proposed method works. The effects of knowledge distillation and different number of length candidates are discussed in supplementary materials due to space limitation. Here are their conclusions. (1) Knowledge distillation is helpful for quick convergence (e.g., is enough to converge in MSR-VTT dataset), but does not promote the performance. (2) Proper (e.g., 5) can improve the caption quality.

5.1 Ablation Study and Diversity Study

In this work, we also find “repeated translations” problem [35], which is common for NA models in the context of NMT, and thus simply adopts the de-duplication technique. The effects brought by de-duplication (DD), masking decision (MD) and candidate decision (CD) are shown in Table 4. These techniques can be arranged in order of importance, i.e., CD MD DD. The improvement brought by CD and MD indicates that auxiliary scoring information is not only beneficial for selecting suitable candidates but also helpful for NAVC to focus on inappropriate tokens. From another perspective, it shows that higher NAVC’s confidence on a sentence may not reflect its quality. All these findings suggest that an internal auxiliary scoring module is worth exploring in the future to get rid of external constraints.

To analyze the diversity of generated captions, following the literature in image captioning

[9], we compute three metrics, namely Novel (the percentage of captions that have not been seen in the training data), Unique (the percentage of captions that are unique among the generated captions) and Vocab Usage (the percentage of words in the vocabulary that are adopted to generate captions). The results in Table 5 indicates that NAVC based on iterative refinement can generate more diverse captions.

Model Novel (%) Unique (%) Vocab Usage (%)
ARVC 20.9 30.6 3.6
Base 24.2 32.6 3.0
Full 26.7 36.0 3.5
Table 5: Illustration of the diversity of different models from different aspects on MSR-VTT dataset. .

5.2 Qualitative Examples

Figure 4 visualizes four examples on MSR-VTT test set. It can be observed that NAVC based on iterative refinement has a distinct characteristic, i.e., capturing salient visual information in early iterations whereas gradually correcting grammatical errors in later iterations. For example, only conditioning on the video representation (), NAVC is able to capture “minecraft”, “paper” and “food”, all of which are salient within two presented frames. And these are exactly the information we can get at a glance. Remarkably, NAVC can recognize “wrestlers”, although this word is replaced by “men” later (in our full model) since “two men” is more common than “two wrestlers”. Moreover, due to the limited information, the initial sentences generated by NAVC suffer from a high repetition rate. But this problem can be mitigated after repeatedly refining the sentences, e.g., the repeated “a” tokens are replaced with more informative words in later iterations. Compared with the base model, our full model usually preserves more semantic information, which is exactly attributed to the auxiliary information. Generally, the captioning process of our proposed method can be treated as a first visual then linguistic generation process, which avoids generating rigid captions.

Iteration () MSVD MSR-VTT
CIDEr RepR (%) CIDEr RepR (%)
1 85.4 5.16 37.4 18.76
2 89.4 3.16 46.3 13.95
3 89.0 1.51 48.2 5.40
4 89.3 0.08 48.9 0.58
5 88.2 0.05 49.4 0.28
Table 6: The CIDEr score and repetition ratio of tokens (RepR) when decoding with a different on both datasets.
Figure 5: POS tagging analysis on MSR-VTT test set.

5.3 The Necessity of Iterative Refinement

Based on the discussion in Section 5.2, we further study the effect of iterative refinement. Table 6 shows that iterative refinement drastically reduces the repetition ratio of tokens in the first few iterations. It can be observed that the model reaches performance saturation early in MSVD dataset, thus we are hard to draw conclusions in this dataset. As for the MSR-VTT dataset, the decrease in repetition ratio correlates with the steep rise in CIDEr score, suggesting that iterative refinement is able to relieve “multimodality problem” [16] (which has been introduced in Section 2.2).

Moreover, we utilize the NLTK tool to analyze the POS tagging of captions generated by our full model, as shown in Figure 5. The smooth curve of nouns indicates that NAVC generates most of the nouns, i.e., captures most of the visual information at the first iteration. Combining the decline of the curve of determiner, the repeated token “a” shown in Figure 4 and the decrease in repetition ratio in Table 6, we conclude that NAVC also tends to generate words of high frequency (i.e., determiner) at the first few iterations, resulting in high repetition ratio. Besides, considering the steeply rising curve of gerund (e.g., talking and playing) and preposition (e.g., to and on), linguistic generation is emphasized in later iterations. Therefore, these findings support our claim that iterative refinement enables the model to generate captions in a first visual then linguistic generation manner.

5.4 The Interpretation of Behavioral Differences

NAVC behaves differently in two datasets according to Table 2 and 6, i.e., NAVC converges more quickly in MSVD dataset. We suspect that NAVC confr onts a more serious “multimodality problem” in the MSR-VTT dataset. To test our hypothesis, we measure per-position vocabulary usage in training data, which may reveal NAVC’s perplexity on each position, as a proxy metric for the “multimodality problem”. The results shown in Figure 6 match our hypothesis. For example, more than one-quarter words from the vocabulary can be placed at the first position (the start of the sentence) in MSR-VTT dataset, more than twice the proportion in MSVD dataset, making the NAVC more confused when predicting tokens individually. This finding suggests that a larger is required if the dataset is more “diverse”, i.e., having higher averaged per-position vocabulary usage.

Figure 6: The statistics of training data about per-position vocabulary usage, which may reveal NAVC’s perplexity on each position.

6 Conclusion

In this work, we present a first attempt to propose a non-autoregressive video captioning model (NAVC) with iterative refinement. With highly parallel supported decoding algorithm and our proposal to exploit external auxiliary scoring information, our proposed NAVC allows for decoding in parallel and producing high-quality captions in just a few cycles. The experimental results show that our method can generate captions in a first visual then a linguistic generation manner, which is more consistent with human intuition than sequential manner and avoids the generated sentences lacking in variability. In particular, the superior improvement brought by external auxiliary scoring information encourages us to explore an internal auxiliary scoring module in the future work to get rid of external constraints.


  • [1] J. Aneja, A. Deshpande, and A. G. Schwing (2018) Convolutional image captioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5561–5570. Cited by: §1, §2.1.
  • [2] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.1, §7.2.
  • [3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §1.
  • [4] J. Chen, Y. Pan, Y. Li, T. Yao, H. Chao, and T. Mei (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8167–8174. Cited by: §1, §2.1.
  • [5] M. Chen, Y. Li, Z. Zhang, and S. Huang (2018)

    TVT: two-view transformer network for video captioning


    Asian Conference on Machine Learning

    pp. 847–862. Cited by: §1, §2.1.
  • [6] S. Chen and Y. Jiang (2019) Motion guided spatial attention for video captioning. Cited by: §1, §2.1, §3.2, Table 3.
  • [7] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.1, §7.2.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §1.
  • [9] B. Dai, S. Fidler, and D. Lin (2018) A neural compositional paradigm for image captioning. In Advances in Neural Information Processing Systems, pp. 658–668. Cited by: §5.1.
  • [10] B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2970–2979. Cited by: §1, §3.3.1.
  • [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §7.2.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §3.1.3, §3.3.1.
  • [13] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell (2015-07) Language models for image captioning: the quirks and what works. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

    Beijing, China, pp. 100–105. External Links: Link, Document Cited by: §1.
  • [14] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Constant-time machine translation with conditional masked language models. arXiv preprint arXiv:1904.09324. Cited by: §1, §1, 1st item, §2.2, Figure 3, §3.3.1, §3.3.1, §3.3.
  • [15] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.2.
  • [16] J. Gu, J. Bradbury, C. Xiong, V. O.K. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §1, 2nd item, §2.2, §2.2, §3.3.2, §5.3.
  • [17] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE international conference on computer vision, pp. 2712–2719. Cited by: §4.1, §7.1.
  • [18] J. Guo, X. Tan, D. He, T. Qin, L. Xu, and T. Liu (2019) Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3723–3730. Cited by: §2.2.
  • [19] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: Table 3, §7.2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 3.
  • [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • [22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §7.2.
  • [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.2.
  • [24] J. Lee, E. Mansimov, and K. Cho (2018-October-November) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Link, Document Cited by: §1, §2.2, §3.2.
  • [25] L. Li and B. Gong (2019)

    End-to-end video captioning with multitask reinforcement learning

    In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 339–348. Cited by: §1, §3.2, Table 3.
  • [26] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1, §7.2.
  • [27] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. Tai (2019) Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8347–8356. Cited by: §1, §2.1, §3.2, §4.2.1, Table 3.
  • [28] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. In Advances in neural information processing systems, pp. 2377–2385. Cited by: §3.1.1.
  • [29] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Table 3, §7.2.
  • [30] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: Table 3.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §3.1.1, §3.1.3, §7.2.
  • [32] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.1, §7.2.
  • [33] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. Cited by: §1.
  • [34] B. Wang, L. Ma, W. Zhang, and W. Liu (2018) Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631. Cited by: §1, §2.1, Table 3, §7.1.
  • [35] Y. Wang, F. Tian, D. He, T. Qin, C. Zhai, and T. Liu (2019) Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5377–5384. Cited by: §2.2, §3.3.2, §5.1.
  • [36] R. J. Williams and D. Zipser (1989) A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §2.1.
  • [37] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) Msr-vtt: a large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288–5296. Cited by: §4.1, §7.1, §7.1.
  • [38] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pp. 4507–4515. Cited by: §7.1.

7 Supplementary Materials

7.1 Datasets

We evaluate our technique using two popular benchmark datasets from the existing literature in video captioning, namely Microsoft Video Description (MSVD) dataset [17], and MSR-Video To Text (MSR-VTT) dataset [37].

MSVD Dataset. It contains 1,970 video clips with an average duration of 9 seconds. Each clip is labeled with about 40 English sentences, resulting in approximately 80,000 description pairs. Following the split settings in prior works [38, 34], we split the dataset into training, validation and testing set with 1,200, 100 and 670 videos, respectively. The resulting training set has a vocabulary size of 9,750.

MSR-VTT Dataset. It is one of the largest vision-to-language datasets, consisting of 10,000 web video clips with 20 human-annotated captions per clip. Following the official split [37], we utilize 6,513, 497 and 2,990 video clips for training, validation and testing, respectively. After keeping words appear more than twice, the resulting training set has a vocabulary size of 10,546.

7.2 Implementation Details

Feature Extraction.To extract image features, we sample the video at 3 fps for the MSVD dataset whereas 5 fps for the MSR-VTT dataset, and set the maximum number of frames as 60. In terms of motion features, we sample the videos at 25 fps and extract features for every 16 consecutive frames with 8 frames overlap for both datasets. We extract 1536-D image features from Inception-Resnet-V2 [29] pre-trained on the ImageNet dataset [11], whereas extract 2048-D motion features from ResNeXt-101 [19] pre-trained on the Kinetics dataset [22]. We include the coarse category information for the MSR-VTT dataset.

Training Settings. To train effectively, we set for each modality. Assuming we obtained image features, we first divide into snippets, then randomly/uniformly sample 1 feature from each snippet to achieve features during training/evaluation process. If , then we use linear mapping

. So is motion features. As for the self-attention based decoder, we use a set of smaller hyperparameters compared with the base configuration

[31], i.e., 1 layer per decoder stack (), 512 model dimensions (, the same with word embedding size), 2048 hidden dimensions, 8 attention heads per layer. Both positional encoding in the decoder and the category information are implemented by trainable 512-D embedding layer. For regularization, we use 0.5 dropout and weight decay. We train batches of 64 video-sentence pairs using ADAM [23] with an initial learning rate of

. The learning rate decreases at a rate of 0.9 per epoch until it reaches the minimum learning rate

. We stop training our model until 50 epochs are reached and the best model is selected according to the performance on the validation set.

Evaluation Settings. We adopt three common metrics including BLEU [26], METEOR [2] and CIDEr [32]. All metrics are computed using the API111 released by Microsoft COCO Evaluation Server [7].

7.3 The Effect of Different Numbers of Length Candidates

Figure 7 shows the effect of different numbers of length Candidates on MSR-VTT dataset, whereas Figure 8 is for MSVD dataset. According to the curve under different , different metrics and different datasets, we can draw a unified conclusion that larger is beneficial for improving performance, and generally when , the saturation of performance can be reached.

7.4 The Effect of Knowledge Distillation

In the context of NMT, knowledge distillation can improve translation quality for non-autoregressive models. Here, we analyze the effect of it in video captioning. Figure

9 shows the effect of knowledge distillation on MSR-VTT dataset, whereas Figure 10 is for MSVD dataset. The results show that knowledge distillation can not improve the upper limit of the performance. But however, in MSR-VTT dataset, NAVC with KD only need 2 iterations rather than 5 iterations in the one without KD to reach promising results, i.e., NAVC with KD converges faster in large dataset.