Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

12/27/2022
by   Wooyoung Kang, et al.
0

Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at <https://github.com/kakaobrain/noc>.

READ FULL TEXT

page 2

page 8

page 11

page 12

page 16

page 17

research
07/19/2023

Improving Multimodal Datasets with Image Captioning

Massive web datasets play a key role in the success of large vision-lang...
research
11/22/2021

RedCaps: web-curated image-text data created by the people, for the people

Large datasets of paired images and text have become increasingly popula...
research
03/10/2023

ICStega: Image Captioning-based Semantically Controllable Linguistic Steganography

Nowadays, social media has become the preferred communication platform f...
research
06/07/2022

Improving Image Captioning with Control Signal of Sentence Quality

In the dataset of image captioning, each image is aligned with several c...
research
03/22/2021

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Controllable Image Captioning (CIC) – generating image descriptions foll...
research
08/25/2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Supervised visual captioning models typically require a large scale of i...
research
06/06/2023

SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning

In scholarly documents, figures provide a straightforward way of communi...

Please sign up or login with your details

Forgot password? Click here to reset