Semi-Autoregressive Transformer for Image Captioning

06/17/2021
by   Yuanen Zhou, et al.
0

Current state-of-the-art image captioning models adopt autoregressive decoders, they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning (dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local. Based on Transformer, there are only a few modifications needed to implement SATIC. Extensive experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles. Code is available at <https://github.com/YuanEZhou/satic>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2021

Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt...
research
08/05/2021

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Video captioning combines video understanding and language generation. D...
research
05/10/2020

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

Most image captioning models are autoregressive, i.e. they generate each...
research
07/22/2022

Efficient Modeling of Future Context for Image Captioning

Existing approaches to image captioning usually generate the sentence wo...
research
11/30/2022

Uncertainty-Aware Image Captioning

It is well believed that the higher uncertainty in a word of the caption...
research
12/12/2022

Video Prediction by Efficient Transformers

Video prediction is a challenging computer vision task that has a wide r...
research
01/06/2022

Compact Bidirectional Transformer for Image Captioning

Most current image captioning models typically generate captions from le...

Please sign up or login with your details

Forgot password? Click here to reset