Log In Sign Up

Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-designed experiment, we empirically proved that iteration times can be effectively reduced when providing sufficient prior knowledge for the language decoder. Towards that end, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off between performance and speed. The proposed SAIC model maintains autoregressive property in global but relieves it in local. Specifically, SAIC model first jumpily generates an intermittent sequence in an autoregressive manner, that is, it predicts the first word in every word group in order. Then, with the help of the partially deterministic prior information and image features, SAIC model non-autoregressively fills all the skipped words with one iteration. Experimental results on the MS COCO benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup. Code is available at


Semi-Autoregressive Transformer for Image Captioning

Current state-of-the-art image captioning models adopt autoregressive de...

Efficient Modeling of Future Context for Image Captioning

Existing approaches to image captioning usually generate the sentence wo...

Fast Image Caption Generation with Position Alignment

Recent neural network models for image captioning usually employ an enco...

Uncertainty-Aware Image Captioning

It is well believed that the higher uncertainty in a word of the caption...

Non-Autoregressive Video Captioning with Iterative Refinement

Existing state-of-the-art autoregressive video captioning methods (ARVC)...

Compact Bidirectional Transformer for Image Captioning

Most current image captioning models typically generate captions from le...

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Video captioning combines video understanding and language generation. D...

1. Introduction

Image captioning is one of the fundamental tasks in multimedia analysis and computer vision, which aims to generate a natural description for the visual content of the given image automatically. Most image captioning systems follow an encoder-decoder paradigm (Sutskever et al., 2014; Vinyals et al., 2015; Xu et al., 2015; Chen and Lawrence Zitnick, 2015; Anderson et al., 2018; Yao et al., 2018; Huang et al., 2019; Cornia et al., 2020). Among these methods, the visual encoder, e.g.

, convolutional neural network (CNN), first extracts features from the input image. The descriptive sentence is then decoded according to these refined features, one word at each time, using a recurrent neural network (RNN). Despite their remarkable performance, the sequential nature of text generation,

i.e., word by word, makes the decoding procedure not parallelizable and results in a high latency, which is challenging to work effectively in some real-time production applications (Guo et al., 2020)

. Inspired from neural machine translation

(Gu et al., 2018), one straight forward solution is the non-autoregressive image captioning (NAIC) (Gu et al., 2018; Fei, 2019), which predicts the entire sentence in one shot. However, such a one-pass NAIC model usually lacks dependencies between words and struggles to produce smooth and accurate descriptions.

Figure 1. Illustration of AIC, NAIC and SAIC. AIC models generate the next word conditioned on all preceding subsentence, while NAIC models output all words in parallel in one step. Comparatively, our SAIC model considers a caption as a sequence of word groups. The first word of each group is generated one-by-one while the remaining words are predicted simultaneously conditioned on both the visual representation and jumpily generated words.

Recent studies show that extending one-pass NAIC to constant multi-pass, called iterative refinement (IR-NAIC), is promising to break the captioning performance dilemma (Jason et al., 2018; Gao et al., 2019a; Fei, 2020; Yang et al., 2019). Unlike one-pass NAIC, which outputs the total descriptions immediately, IR-NAIC takes the caption hypothesis from the previous iteration as a reference and regularly modifies the new sentence, with a masking strategy or fusion network. The refinement process is terminated until reaching a pre-defined iteration count or no changes appear in the new sentence. Compared with conventional autoregressive image captioning (AIC), IR-NAIC with three iteration steps runs an average 3 times faster with a comparable captioning quality, as reported by (Fei, 2020).

However, in this paper, we highlight that decoding with multiple iterative refinements in NAIC is unnecessary when providing a good partial prior knowledge. To verify this statement, we carefully design an experiment to understand the relationship between segmental dependency context and iteration times. In practice, we first mask some proportion of words on the caption hypothesis generated by a trained AIC model with different strategies, i.e., Head, Tail, Random, and Group. Then the residual segments are taken as the input to the language decoder, aiming to measure the quality of the final output caption. Surprisingly, we observed that even masking 70% of AIC hypothesis, the remaining prior knowledge can still help one-shot NAIC model compete with the standard IR-NAIC model. This result confirms a promising improvement direction about the form of iteration.

Inspired by this, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), which combines the advantage of both AIC and NAIC. After extracting the visual representations with the visual encoder, SAIC first utilizes an autoregressive decoder, named Outliner, to produce several partial discontinuous words in a caption. Then a non-autoregressive decoder, named Filler, predicts previously skipped words with one iteration according to the deterministic prior knowledge. Since both the Outliner and Filler share the same model architecture and parameters, SAIC does not increase extra parameters significantly. To train the SAIC model effectively and efficiently, we further propose three training techniques including group-aware sampling, curriculum learning (Bengio et al., 2009) as well as hybrid knowledge distillation. Experiment results on the MS COCO dataset show that our proposed SAIC brings a consistent decoding speedup relative to the autoregressive counterpart while gains far superior performance than the state-of-the-art IR-NAIC models.

To sum up, our main contributions are as follows:

  • Through a well-designed experiment, we demonstrate that the number of IR-NAIC can be significantly reduced when providing a good prior knowledge.

  • We propose a new two-stage semi-autoregressive framework to take a better trade-off between caption generation speed and quality performance. To be specific, Outliner first jumpily generates a series of discontinuous words in an autoregressive manner, and Filler then pads all previously skipped words in one non-autoregressive step.

  • We introduce three strategies to improve the model training procedure: group-aware sampling, curriculum learning, and hybrid knowledge distillation. Experimentally, SAIC is able to decode faster than the sequential counterpart while strikingly narrowing the performance gap.

2. Background

2.1. Autoregressive Image Caption

Autoregressive generation is the mainstream approach in conventional image captioning, which decomposes the distribution of target sentence

into a chain of conditional probabilities in a directional manner as:


where denotes the generated historical sub-sentence before time step . In particular, beam search (Wiseman and Rush, 2016; Vijayakumar et al., 2016)

is commonly used as a heuristic search technique, because it maintains multiple hypotheses at each decoding step and leads to a satisfactory captioning performance. However, the existence of condition

requires that the AIC model have to wait for to be produced before predicting current , which hinders the possibility of parallel computation along with time step.

2.2. Non-autoregressive Image Caption

Regardless of its effectiveness, the sequential inference methods have two major drawbacks. One is that it cannot generate multiple words simultaneously, leading to inefficient use of parallel computing hardware such as GPUs (Ren et al., 2020). The other is that beam search has been found to output low-quality when applied to large search spaces (Vijayakumar et al., 2016). Non-autoregressive method is first proposed by (Gu et al., 2018; Gao et al., 2019a) to address the above issues, allowing the image captioning model to generate all target words simultaneously. NAIC replaces with independent latent variable to remove the sequential dependencies and rewrite Equation 1 as:


Since words are generated independently from each other during the entire decoding process, it usually results in duplicated or missing words in the obtained sentences. Subsequently, researchers developed more advanced methods to enhance the modeling of , such as reordered latent variable (Fei, 2019) and objective function optimization (Guo et al., 2020), but there still exists a significant performance gap between AIC and NAIC.

2.3. Iterative Refinement based Non-autoregressive Image Caption

The previous one-pass NAIC can be further boosted by introducing the multi-pass refinement mechanism (Gao et al., 2019a; Fei, 2020; Gao et al., 2019b). Specifically, IR-NAIC applies a fusion function to deal with the sentence produced in the preceding stage and comprehensively predict the new sentence by:


where corresponds to the number of newly generated words in , and is the real position of -th refined word in . In this way, the sentence generation process of IR-NAIC can be concluded as: first, the NAIC model produces a coarse caption as the initial hypothesis and then iteratively refines it until reaches a pre-set criteria with constant steps. In this work, we utilize masked prediction (Gao et al., 2019a; Jason et al., 2018) as the representation of IR-NAIC due to its excellent performance and simplicity, where tokens are randomly masked in training but selected with low confidences during inference.

3. Is Iterative Refinement All You Need?

Previous work has pointed out that heuristic NAIC with multiple iterations can accelerate the overall speed to a significant extent, however, it slows down severely in some special cases (Jason et al., 2018; Fei, 2020). Correspondingly, this section starts from theoretic computation complexity analysis of IR-NAIC, then a delicate experiment is conducted to verify the assumption that a sufficiently good input context for the language decoder can help reduce the number of iterations. Here we construct the decoder input from the caption hypothesis produced by a well-trained AIC model.

3.1. Computation Complexity Analysis of IR-NAIC

For a conventional NAIC model, we assume that the computation cost of each iteration is proportional to the size of input tensor, denoted as

, where is the batch size, is the beam size, is the predicted target length, and is the network dimension. In this way, the total cost of iteration, generally , is . For convenience, we omit and , which simplify to . Likely, the computation cost for AIC model is . Note that only decoder with multi-head self-attention needs to consider the previous words. Then we can define the speedup ratio as:


Therefore, fewer iterations and faster parallel computation are the key points to IR-NAIC.

3.2. Preliminary Experimental Setting

Image Captioning Model.

We adopt the official captioning model implementation111 proposed by (Jason et al., 2018) with denoising strategy. For convenience, the regional features of images extracted from faster-rcnn (Ren et al., 2015) on the backone of ResNet-101 (He et al., 2016) are utilized to retrain the image captioner with the standard Transformer-base configuration (Vaswani et al., 2017). Our AIC model is first trained with XE loss and then fine-tuned with SCST (Rennie et al., 2017).

Decoding Strategy.

We set the beam size of AIC model to 5 to achieve a good caption hypothesis as input to the decoder. Then we replace a certain percentage of words with [mask] symbol and feed the processed sentence to the iterative refinement decoder. Unlike the standard iterative refinement model that iterates several times, we only iterate fixed one step refinement with the setting of beam size being 1 and substitute all input [mask] symbol with the prediction of the language decoder to determine the final descriptions.

Different Masking Methods.

In this work, we provide four methods to mask caption hypothesis generated from AIC: , , and . Given the masking rate and the length of caption , the number of masked words can be computed as . Then masking method always masks the first/last words in the sentence, while masks the position of caption randomly. is slightly different from the above three strategies. It first divides the sentence into groups, where and is the group size. Then in each group, we maintain the first word and mask the remaining -1 words. Thus, the actual masking rate in masking can be calculated as:


To exclude the experimental randomness, we run masking method four times with different random seeds and report the average performance result.

3.3. Results Discussion

Figure 2. Performance comparison of four masking methods under different masking rate in refinement experiments on MS COCO test set.
1 1 31.0 108.5
5 1 31.2 109.0
1 5 31.0 108.8
5 5 31.3 109.2
Table 1. Evaluation results of different decoding parameters under masking strategy on MS COCO test set. denotes the beam size, is the fixed iteration times and is set to 0.7. The metric value does not fluctuate significantly.

The experimental results are shown in Figure 2 and Table 1 respectively, where we can conclude that:

Figure 3. Overview of the proposed two-stage SAIC framework (group size = 2), which consists of a visual encoder for image content, an Outliner for skipped caption and a Filler for complete caption. Those two language decoders, sharing the same architecture and parameters, build the sentence in a paradigm where the first word of each group sequentially but remaining words inside in parallel.

Uniform partial prior knowledge is critical. Compared with and masking, it is obvious that both and

can gain a better captioning performance. We attribute it to the fact that the prior knowledge as input to the language decoder is uniformly distributed in

and , while the and are only provided with concentrated context on one side, i.e., prefix and suffix. is superior to , which indicates that the balanced distribution of deterministic word hypothesis is necessary. masking method guarantees that each word without masking can meet at least one deterministic word within the window size of .

Small beam size and one iteration are sufficient. Compared with the standard IR-NAIC model with the beam size of 5 and multiple iterative refinements, it is interesting to find that even if only 30% of the inputs to the decoder are exposed, the masking-based decoder with greedy search can achieve quite comparable performance in fixed 1 iteration.

4. Approach

Based on the above analysis, we believe that the iterative refinement in IR-NAIC is unnecessary. In other words, we can obtain a high-quality caption in one shot without going through multi-pass modification under group-aware prior knowledge. In this end, we propose a two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC). Briefly speaking, SAIC autoregressively generates a discontinuous sequence with group size

(stage I), and then fills the remaining words (stage II) in one neural network via a non-autoregressive manner. Note that the standard AIC can be regarded as the special case of SAIC when group size equals to 1 and without stage II.

4.1. Model Architecture


Generally, our SAIC model consists of three components, a visual encoder, and two language decoders: Outliner for stage I, and Filler for stage II, as displayed in Figure 3. All components adopt the Transformer architecture (Vaswani et al., 2017)

containing self-attention sublayers and feedforward sublayers. Particularly, additional cross-attention sublayers are added to the two decoders. Furthermore, all of the sublayers are followed by the residual connection and a layer normalization operation. Note that Outliner and Filler have the same network structure and

share the parameters, so the number of parameters remains the same as that of standard AIC and NAIC models.

Differences from Previous Works

Our network architecture takes inspiration from the prior work (Jason et al., 2018) and incorporates three key differences with respect to all previous NAIC algorithms as:

(1) The major difference lies in the masking strategies in the self-attention sublayer. The Outliner masks future words discontinuously and causally guarantees a strict left-to-right generation, while the Filler eliminates this limitation to leverage a bi-directional context (Devlin et al., 2018), as the toy examples of group mask matrix and no mask matrix shown in the Figure 3.

(2) Our self-attention layer additionally equips with relative position representation (RPR) (Dai et al., 2019) to enable the language decoders to capture the sequence relationship between words easily and effectively. To be specific, the self-attention layer of the language decoders with RPR can be represented by:


where , and corresponds to query, key, and value matrices, is the dimension of the key, denotes the relative position embedding matrix, and denotes the window size, which can also be regarded as the group size .

(3) Most previous NAIC methods need to train the captioning model with an extra independent length predictor (Fei, 2019; Jason et al., 2018; Gao et al., 2019a). However, such a length predictor is implicitly modeled in our Outliner-Filler module because the length of captions is a by-product of autoregressive Outliner, i.e., , where is the length of sequence produced by Outliner and is the final sequence length from Filler. Another bonus is that we can avoid carefully tuning the weighting coefficient between the length loss and the word prediction loss for performance optimization.

4.2. Training Strategies

Directly training our proposed SAIC model is not trivial since a single SAIC model needs to learn to generate captions in both autoregressive and non-autoregressive manners. This section will introduce three training strategies in detail.

Group-Aware Sampling

Compared with conventional AIC, the length of the caption generated by our Outliner shrinks the caption length from to . Meanwhile, the masking method in Outliner, referred to as group masking, as illustarted in Figure 3, is deterministic. That is, all non-first words are masked in each group. Compared with previous random masking, the training sampling method of our SAIC model is potentially group-aware.

Competence-Aware Curriculum Learning

Jointly training of Outliner and Filler is problematic since the group-aware training sample cannot make full use of the information about all the words in the sentences. In this work, we propose to gradually transfer from join training {AIC, NAIC} to {Outliner, Filler} with competence-aware curriculum learning (Platanios et al., 2019). That is, the captioning model is trained from group size to group size . More concretely, given a batch of original image-text pairs , we first let the proportion of group size in the Batch be and construct the training samples of AIC and NAIC models for all pairs. And then we gradually increase the group rate to introduce more group-aware learning signals for Outliner and Filler until . In implementation, we schedule as follows:


where and are the current and total training steps, respectively.

is a hyperparameter indicating the degree of change and we utilize

to increase the curriculum difficulty linearly.

Hybrid Knowledge Distillation

In general, the NAIC model usually train the data generated from a teacher AIC model as knowledge distillation due to the smooth data distribution (Kim and Rush, 2016; Guo et al., 2018). However, making full use of distillation data may miss the diversity of the raw data. To combine the advantages of both distillation and raw data, we propose a simple yet effective approach – hybrid knowledge distillation. Specifically, hybrid knowledge distillation randomly samples a target sentence from the raw version with a probability or its distillation version with a probability during the entire training stage.

Complete Training Algorithm

Input: Training data including distillation targets, pretrained AIC model , group size , hybrid distillation rate
Output: Semi-Autoregressive Image Captioning model
1 Fine-tune on pre-trained AIC model;
2 ;
3 for each training batch  do
4       Hybrid distillation;
5       for each sentence  do
6            Sampling with probability ;
7       end for
8       Curriculum learning;
9       Get the group-aware proportion ;
10       Split the batch for training into {AIC, NAIC} and {Outliner, Filler} proportionally;
11       Joint optimization for ;
13 end for
Algorithm 1 Semi-Autoregressive Image Captioning Training Algorithm

Algorithm 1 describes the procedure of training a SAIC model. To be specific, the SAIC is first initialized by a pre-trained AIC model (Line 2). Then, the training batch randomly selects a raw sentence or its distilled version based on probability (Line 5-7). Next, according to the Equation 9, we can divide the batch data into two parts: conventional {AIC, NAIC} batch and group-aware {Outliner, Filler} , where (Line 10). We can construct four kinds of training samples based on corresponding data. Finally, we collect all training samples together and accumulate their gradients to update the SAIC model’s parameters, which leads to the double of the batch size of standard training.

Autoregressive Image Captioning models
NIC-v2 (Vinyals et al., 2015) - 32.1 25.7 - 99.8 - - -
Up-Down (Anderson et al., 2018) 79.8 36.3 27.7 56.9 120.1 21.4 - -
AoANet (Huang et al., 2019) 80.2 38.9 29.2 58.8 129.8 22.4 - -
M2-T(Cornia et al., 2020) 80.8 39.1 29.2 58.6 131.2 22.6 - -
AIC 80.2 38.7 28.8 58.2 128.7 22.0 185ms 1.00
Non-Autoregressive Image Captioning models
MNIC (Gao et al., 2019a) 75.4 30.9 27.5 55.6 108.1 21.0 - 2.80
FNIC (Fei, 2019) - 36.2 27.1 55.3 115.7 20.2 - 8.15
MIR (Jason et al., 2018) - 32.5 27.2 55.4 109.5 20.6 - 1.56
CMAL (Guo et al., 2020) 80.3 37.3 28.1 58.0 124.0 21.8 - 13.90
IBM (Fei, 2020) 77.2 36.6 27.8 56.2 113.2 20.9 - 3.06
Semi-Autoregressive Image Captioning models
SAIC (= 5, = 5) 80.4 38.7 29.4 58.5 128.3 22.2 119ms 1.55
SAIC (= 5, = 1) 80.4 38.6 29.3 58.3 127.8 22.1 101ms 1.83
SAIC (= 1, = 1) 80.3 38.4 29.0 58.1 127.1 21.9 54ms 3.42
Table 2.

Performance comparisons of different captioning models using different evaluation metrics on the MS COCO Karpathy test set. All values except Latency and SpeedUp are reported as a percentage (%). “

” denotes the model is based on Transformer architecture. AIC is our implementation of autoregressive teacher model, which has the same structure as SAIC. The SpeedUp values of NAIC models are from the corresponding papers. represents the NAIC model is iterative refinement-based.

4.3. Inference Process

After encoding the image features, the Outliner starts from [bog] (the beginning of a group) to sequentially generate a jumped subsequence with group size until meeting [eos] (end of sentence). Then we construct the input of Filler by appending [mask] symbols after every . The final description is generated by replacing all [mask] with the predicted words from Filler with one iteration. If there are multiple [eos] existing, we truncate the sentence where the first [eos] appeared. Note that the beam size in Outliner can be different from the beam size in Filler subject to: . In particular, if , then we only feed the Filler with the top hypothesis. Finally, we select the caption hypothesis with the highest score as:


where . Due to the joint training of group size 1 and simultaneously, SAIC model can also behave like a standard AIC model by forcing decoding with group size fixed to 1. In this way, we can only use the Outliner to generate the entire sequence without the help of Filler. Thus, AIC can be regarded as a special case.

Method Steps Computing Cost
Table 3. Computation complexity analysis for different caption decoding methods. denotes the computation cost in autoregressive mode when producing the -th word. Normally, = 3 6 and is set to 3.

4.4. Computation Complexity Comparison

We also provide a theoretic complexity comparison with AIC, one-shot NAIC, IR-NAIC, and our proposed SAIC in Table 3. Although both SAIC and AIC models contain a slow generation process, the length of SAIC is times shorter than conventional AIC. Considering that the computational complexity of self-attention is quadratic with its length, SAIC can save more time during the sequence inference. On the other hand, thanks to the Outliner provides a high-quality semantic context, SAIC does not need to employ a large beam size and multiple iterations like IR-NAIC to maintain a good captioning quality. Experimental results also illustrate that our light Filler can compensate for the extra computation cost in Outliner effectively and can achieve stable acceleration.

c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Up-Down (Anderson et al., 2018) 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5
AoANet (Huang et al., 2019) 81.0 95.0 65.8 89.6 51.4 81.3 39.4 71.2 29.1 38.5 58.9 74.5 126.9 129.6
M2-T (Cornia et al., 2020) 81.6 96.0 66.4 90.8 51.8 82.7 39.7 72.8 29.4 39.0 59.2 74.8 129.3 132.1
CMAL (Guo et al., 2020) 79.8 94.3 63.8 87.2 48.8 77.2 36.8 66.1 27.9 36.4 57.6 72.0 119.3 121.2
SAIC 80.0 94.5 64.1 88.2 49.2 78.8 37.2 67.8 28.0 36.8 57.7 72.4 121.4 123.7
Table 4. Leaderboard of different image captioning models on the online MS COCO test server. denotes the ensemble model.
Raw 37.5 28.1 57.3 123.3 21.3
Seq. Dist. 38.4 28.9 58.1 126.6 21.8
Hybr. Dist. 38.4 29.0 58.1 127.1 21.9
Table 5. Performance against different distillation strategies. The results show that our hybrid distillation is superior to the other two distillation strategies.
2 38.6 29.1 58.3 128.4 22.0
3 38.4 28.9 58.2 127.4 21.9
4 38.4 29.0 58.1 127.1 21.9
Table 6. Evaluation results of different group size . The results show that with the increasing group size , worse quality and faster speed are obtained by our SAIC model.

5. Experiments

5.1. Experimental Preparation


MS COCO (Chen et al., 2015)

is a standard estimation benchmark for image captioning tasks. To be consistent with previous works,

(Huang et al., 2019; Cornia et al., 2020), we adopted the Karpathy split (Karpathy and Fei-Fei, 2015) that contains 113,287 training images equipped with 5 human-annotated sentences each and 5,000 images for validation and test splits, respectively. We omit words that occur less than 5 times. The vocabulary size is 10,369 words. Image features are pre-extracted as (Anderson et al., 2018).

Evaluation Metrics

Six metrics are utilized to comprehensively estimate the model performance: BLEU@ (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), ROUGE-L (Lin, 2004), CIDEr-D (Vedantam et al., 2015), SPICE (Anderson et al., 2016) and Latency (Fei, 2019; Guo et al., 2020). Concretely, SPICE focuses on semantic analysis and has a higher correlation with human judgment, and other metrics except Latency favor frequent -grams and measure the overall sentence fluency. Latency is computed based on the time to decode a single sentence without mini batching, and the values are averaged over the whole off-line test set.

Implementation Details

Our proposed SAIC model closely follows the same network architecture and hyper-parameters setting as Transformer-base model (Vaswani et al., 2017)

. Specifically, the number of stacked blocks is 6, hidden size is 512, and feed-forward Filler size is 2048. During training, we first initialize the weights of SAIC model with the pre-trained AIC teacher model. We then train the model for 25 epochs with an initial learning rate of 3e-5 and it decays by 0.9 every 5 epochs. Adam

(Kingma and Ba, 2014) optimizer is employed. We use the group size = 4, = 1 and by default. The decoding time is measured on a single NVIDIA GeForce GTX 1080 TI as prior works reported (Guo et al., 2020; Fei, 2020). All speeds are measured by running three times and reporting the average value.

5.2. Overall Performance

Table 2 and 4 summarize the performance of both AIC and NAIC models on off-line and on-line MS COCO testing, respectively. According to the evaluation results, we can find that IR-NAIC models, e.g., IBM (Fei, 2020), outperform those one-pass NAIC models while slow down significantly. However, our small beam variant (= 1, = 1, = 4) can defeat the existing multiple-iteration models. Also, when the beam size increases to 5, SAIC equipped with a standard Transformer achieves +1.2 CIDEr point improvement compared to the previous best results. We can easily trade-off between performance and speed by using = 5 and = 1. It is somehow surprising that besides NAIC models, SAIC can even outperform the AIC models trained from scratch on some metrics. We attribute it to two reasons: (1) SAIC is fine-tuned on a well-trained AIC, making the training process easier and smoother, (2) Mixing up AIC and NAIC has a better regularization effect than training alone. Remarkably, SAIC is consistently achieving good performance with various beam search sizes. These results show that our SAIC is more promising than conventional AIC and IR-NAIC.

5.3. Model Analysis

Effect of Hybrid Distillation

To illustrate the success of incorporating hybrid distillation, we compared different distillation strategies, including raw data (Raw), sequence-level knowledge distillation (Seq. Dist.), and hybrid distillation (Hybr. Dist.) for captioning. The results are listed in Table 5. Overall, Hybr. Dist. is superior to the other two methods across the board, which indicates that training with raw data and distillation data is complementary. We also find that the performance of distillation data is higher than the raw data, which is consistent with the previous work (Guo et al., 2020).

Effect of Group Size

We also test the different group size , ranging from {2,3,4}, and the results are reported in Table 6. Obviously, we can find that: (1) A larger has more significant acceleration on decoding speed because fewer autoregressive steps are required; (2) As increases, the performance of SAIC drops, e.g., CIDEr with is 1.3 points lower than that of . It illustrates that the learning difficulty of SAIC increases as providing less dependency information.

5.4. Ablation study

SAIC -FT -RPR -CL -Hybr. Dist.
BLEU-4 38.4 38.3 38.2 38.0 37.5
CIDEr 127.1 126.8 126.5 126.0 123.3
Table 7. Evaluation results of ablation study. The results show that all introduced techniques help to improve captioning performance effectively.

We also provide an entire ablation study on the MS COCO testing set. As shown in Table 7, we find that all the proposed help to improve captioning performance to some extent. In particular, the employment of hybrid distillation prevents the SAIC from over-fitting and leads to +3.8 CIDEr score improvement compared to the standard distillation (-Hybr. Dist.). In addition, the other three methods, including training the SAIC model from a pre-trained AIC model (FT), using a relative position representation on decoder (RPR), and using curriculum learning (CL), can bring about 0.3 - 1.1 CIDEr score improvement, respectively.

5.5. Case Study

For more intuitive understanding, we present several examples of generated image captions from AIC, NAIC, and our SAIC ( = 4) models, which hold the same model architecture, coupled with human-annotated ground truth sentences (GT) in Figure 4. As we can be seen, in general, all captioning models hold the capability to reflect the content of the given image accurately. Meantime, the incoherent problem, including repeated words and incomplete content, is severe in the sentence generated by pure NAIC, while it can be effectively alleviated by SAIC, i.e., two “train” terms in the first sample. This again confirms that our proposed two-stage framework, including the Outliner and Filler, can guide the captioning model to reduce word prediction errors.

5.6. Human Evaluation

Following previous works (Huang et al., 2019; Yao et al., 2018), we conduct a human study to further compare our SAIC against two patterns, i.e., AIC and NAIC. To be specific, we randomly select 200 samples from the MS COCO testing set and recruit eight workers to compare model performances. Each time, we show only one sentence paired with a corresponding image generated by different approaches or human annotation and ask: can you determine whether the given sentence has been generated by a system or a person? We then calculate the captions that pass the Turing test. The results of Human, SAIC, AIC, and NAIC are 93.6%, 82.6%, 81.1% and 65.0%, separately. It shows the superiority of SAIC in providing human-like captions.

Figure 4. Examples of the generated captions from AIC, SAIC, and NAIC models with the same Transformer architecture. Equipped GT represents human-annotated ground-truth captions.

6. Related Work

State-of-the-art image captioning systems are major in autoregressive manner (Bai and An, 2018), meaning that the model generates captions word by word and is not suitable to modern hardware optimized for parallel execution. The early pioneering work about parallel generation is (Zheng et al., 2019), which generated the words of selected objects first, and the rest of sentences was filled with a two-pass process. Several recent works attempt to accelerate generation by introducing a NAIC framework (Gu et al., 2018; Wei et al., 2019), which produces the entire sentences simultaneously. Although accelerating the decoding process significantly, NAIC models suffer from repetitive and missing problems. Therefore, more efforts are devoted to mitigating those issues in the later image captioning work. Fei et al. (Fei, 2019) reorders words detected in the image with a light RNN to form better latent variables before decoding. Cho et al. (Jason et al., 2018) and Gao et al. (Gao et al., 2019a) introduce an iterative mask refinement strategy to learn the position matching information. Lu et al. (Guo et al., 2020) addresses the inconsistency problem with a multi-agent learning paradigm and sentence-level optimization. The most relevant to our proposed method is (Wang et al., 2018; Qiang et al., 2020) for neural machine translation. The biggest difference lies that our model focuses cross-modal information processing and changes one-step generation to a hierarchical form, which maintains a considerable speedup and enables the caption decoder to view more abundant local history and future information to avoid errors.

7. Conclusion

Through a well-designed experiment, we first point out that provided a sufficient decoding prior knowledge, the number of iterations in NAIC could be dramatically reduced. Inspired by this, we propose a two-stage framework, named Semi-Autoregressive Image Captioning, to combine the advantage of AIC and NAIC. In particular, SAIC keeps the sentence outline via an autoregressive manner, and finishes the remaining filling by way of non-autoregression. To facilitate model training, we further introduce three strategies, including group-aware sampling, curriculum learning, and hybrid knowledge distillation. Compared with previously conventional baselines in autoregressive models, our SAIC model achieves a better balance between captioning quality and decoding speed. Extensive experimental results show that SAIC has an equivalent or even higher performance while maintaining a 50% faster captioning speed. For future works, we are curious about how to improve or correct the prior knowledge from the Outliner which correlates well with final outputs.

This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0102003, in part by National Natural Science Foundation of China: 62022083, 61620106009, 61836002 and 61931008, and in part by Key Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In Proc. ECCV, pp. 382–398. Cited by: §5.1.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE CVPR, pp. 6077–6080. Cited by: §1, Table 2, Table 4, §5.1.
  • S. Bai and S. An (2018) A survey on automatic image caption generation. Neurocomputing 311, pp. 291–304. Cited by: §6.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proc. ICML, pp. 41–48. Cited by: §1.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §5.1.
  • X. Chen and C. Lawrence Zitnick (2015) Mind’s eye: a recurrent visual representation for image caption generation. In Proc. IEEE CVPR, pp. 2422–2431. Cited by: §1.
  • M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara (2020) Meshed-memory transformer for image captioning. In Proc. IEEE CVPR, pp. 10578–10587. Cited by: §1, Table 2, Table 4, §5.1.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
  • Z. Fei (2019) Fast image caption generation with position alignment. arXiv preprint arXiv:1912.06365. Cited by: §1, §2.2, §4.1, Table 2, §5.1, §6.
  • Z. Fei (2020) Iterative back modification for faster image captioning. In Proc. ACM MM, pp. 3182–3190. Cited by: §1, §2.3, §3, Table 2, §5.1, §5.2.
  • J. Gao, X. Meng, S. Wang, X. Li, S. Wang, S. Ma, and W. Gao (2019a) Masked non-autoregressive image captioning. arXiv preprint arXiv:1906.00717. Cited by: §1, §2.2, §2.3, §4.1, Table 2, §6.
  • L. Gao, K. Fan, J. Song, X. Liu, X. Xu, and H. T. Shen (2019b) Deliberate attention networks for image captioning. In Proc. AAAI, Vol. 33, pp. 8320–8327. Cited by: §2.3.
  • J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In Proc. ICLR, Cited by: §1, §2.2, §6.
  • L. Guo, J. Liu, X. Zhu, X. He, J. Jiang, and H. Lu (2020) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. arXiv preprint arXiv:2005.04690. Cited by: §1, §2.2, Table 2, Table 4, §5.1, §5.1, §5.3, §6.
  • M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In Proc. ECCV, pp. 270–287. Cited by: §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE CVPR, pp. 770–778. Cited by: §3.2.
  • L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In Proc. IEEE ICCV, pp. 4634–4643. Cited by: §1, Table 2, Table 4, §5.1, §5.6.
  • L. Jason, M. Elman, G. Neubig, and C. Kyunghyun (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. EMNLP, pp. 1138–1149. Cited by: §1, §2.3, §3.2, §3, §4.1, §4.1, Table 2, §6.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE CVPR, pp. 3128–3137. Cited by: §5.1.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proc. ACL Workshop, pp. 228–231. Cited by: §5.1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Proc. ACL Workshops, pp. 74–81. Cited by: §5.1.
  • K. Papineni, S. Roukos, T. Ward, and W. J. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pp. 311–318. Cited by: §5.1.
  • E. A. Platanios, O. Stretcu, G. Neubig, B. Poczos, and T. M. Mitchell (2019) Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848. Cited by: §4.2.
  • W. Qiang, Y. Heng, K. Shaohui, and W. Luo (2020) Hybrid-regressive neural machine translation. preprint. Cited by: §6.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. NIPS, pp. 91–99. Cited by: §3.2.
  • Y. Ren, J. Liu, X. Tan, S. Zhao, Z. Zhao, and T. Liu (2020) A study of non-autoregressive model for sequence generation. arXiv preprint arXiv:2004.10454. Cited by: §2.2.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proc. IEEE CVPR, pp. 1179–1195. Cited by: §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §3.2, §4.1, §5.1.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proc. IEEE CVPR, pp. 4566–4575. Cited by: §5.1.
  • A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424. Cited by: §2.1, §2.2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proc. IEEE CVPR, pp. 3156–3164. Cited by: §1, Table 2.
  • C. Wang, J. Zhang, and H. Chen (2018) Semi-autoregressive neural machine translation. In Proc. EMNLP, pp. 479–488. Cited by: §6.
  • B. Wei, M. Wang, H. Zhou, J. Lin, J. Xie, and X. Sun (2019) Imitation learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.02041. Cited by: §6.
  • S. Wiseman and A. M. Rush (2016) Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960. Cited by: §2.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proc. ICML, pp. 2048–2057. Cited by: §1.
  • B. Yang, F. Liu, C. Zhang, and Y. Zou (2019) Non-autoregressive coarse-to-fine video captioning. arXiv preprint arXiv:1911.12018. Cited by: §1.
  • T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In Proc. ECCV, pp. 684–699. Cited by: §1, §5.6.
  • Y. Zheng, Y. Li, and S. Wang (2019) Intention oriented image captions with guiding objects. In Proc. IEEE CVPR, pp. 8395–8404. Cited by: §6.