Non-Autoregressive Video Captioning with Iterative Refinement

11/27/2019
by   Bang Yang, et al.
9

Existing state-of-the-art autoregressive video captioning methods (ARVC) generate captions sequentially, which leads to low inference efficiency. Moreover, the word-by-word generation process does not fit human intuition of comprehending video contents (i.e., first capturing the salient visual information and then generating well-organized descriptions), resulting in unsatisfied caption diversity. In order to press close to the human manner of comprehending video contents and writing captions, this paper proposes a non-autoregressive video captioning (NAVC) model with iterative refinement. We then further propose to exploit external auxiliary scoring information to assist the iterative refinement process, which can help the model focus on the inappropriate words more accurately. Experimental results on two mainstream benchmarks, i.e., MSVD and MSR-VTT, show that our proposed method generates more felicitous and diverse captions with a generally faster decoding speed, at the cost of up to 5% caption quality compared with the autoregressive counterpart. In particular, the proposal of using auxiliary scoring information not only improves non-autoregressive performance by a large margin, but is also beneficial for the caption diversity.

READ FULL TEXT

page 5

page 7

page 12

research
06/03/2019

Masked Non-Autoregressive Image Captioning

Existing captioning models often adopt the encoder-decoder architecture,...
research
10/11/2021

Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt...
research
02/12/2021

Annotation Cleaning for the MSR-Video to Text Dataset

The video captioning task is to describe the video contents with natural...
research
07/19/2020

Length-Controllable Image Captioning

The last decade has witnessed remarkable progress in the image captionin...
research
09/15/2020

Semantically Sensible Video Captioning (SSVC)

Video captioning, i.e. the task of generating captions from video sequen...
research
08/05/2021

O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Video captioning combines video understanding and language generation. D...
research
09/07/2021

Sensor-Augmented Egocentric-Video Captioning with Dynamic Modal Attention

Automatically describing video, or video captioning, has been widely stu...

Please sign up or login with your details

Forgot password? Click here to reset