Paraphrasing Is All You Need for Novel Object Captioning

09/25/2022
by   Cheng-Fu Yang, et al.
3

Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training. Due to the absence of caption annotation, captioning models cannot be directly optimized via sequence-to-sequence training or CIDEr optimization. As a result, we present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which would heuristically optimize the output captions via paraphrasing. With P2C, the captioning model first learns paraphrasing from a language model pre-trained on text-only corpus, allowing expansion of the word bank for improving linguistic fluency. To further enforce the output caption sufficiently describing the visual content of the input image, we perform self-paraphrasing for the captioning model with fidelity and adequacy objectives introduced. Since no ground truth captions are available for novel object images during training, our P2C leverages cross-modality (image-text) association modules to ensure the above caption characteristics can be properly preserved. In the experiments, we not only show that our P2C achieves state-of-the-art performances on nocaps and COCO Caption datasets, we also verify the effectiveness and flexibility of our learning framework by replacing language and cross-modality association models for NOC. Implementation details and code are available in the supplementary materials.

READ FULL TEXT

page 9

page 18

page 19

research
08/20/2021

Group-based Distinctive Image Captioning with Memory Attention

Describing images using natural language is widely known as image captio...
research
06/26/2023

Semi-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, s...
research
12/20/2018

nocaps: novel object captioning at scale

Image captioning models have achieved impressive results on datasets con...
research
12/27/2017

Consensus-based Sequence Training for Video Captioning

Captioning models are typically trained using the cross-entropy loss. Ho...
research
12/12/2019

Meaning guided video captioning

Current video captioning approaches often suffer from problems of missin...
research
08/04/2021

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Video captioning is an essential technology to understand scenes and des...
research
01/03/2020

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Small satellite constellations provide daily global coverage of the eart...

Please sign up or login with your details

Forgot password? Click here to reset