Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning

10/22/2021
by   Yang Yang, et al.
0

The task of image captioning aims to generate captions directly from images via the automatically learned cross-modal generator. To build a well-performing generator, existing approaches usually need a large number of described images, which requires a huge effects on manual labeling. However, in real-world applications, a more general scenario is that we only have limited amount of described images and a large number of undescribed images. Therefore, a resulting challenge is how to effectively combine the undescribed images into the learning of cross-modal generator. To solve this problem, we propose a novel image captioning method by exploiting the Cross-modal Prediction and Relation Consistency (CPRC), which aims to utilize the raw image input to constrain the generated sentence in the commonly semantic space. In detail, considering that the heterogeneous gap between modalities always leads to the supervision difficulty of using the global embedding directly, CPRC turns to transform both the raw image and corresponding generated sentence into the shared semantic space, and measure the generated sentence from two aspects: 1) Prediction consistency. CPRC utilizes the prediction of raw image as soft label to distill useful supervision for the generated sentence, rather than employing the traditional pseudo labeling; 2) Relation consistency. CPRC develops a novel relation consistency between augmented images and corresponding generated sentences to retain the important relational knowledge. In result, CPRC supervises the generated sentence from both the informativeness and representativeness perspectives, and can reasonably use the undescribed images to learn a more effective generator under the semi-supervised scenario.

READ FULL TEXT

page 1

page 8

page 9

page 10

page 12

page 14

research
03/21/2023

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

The CLIP model has been recently proven to be very effective for a varie...
research
03/14/2019

Show, Translate and Tell

Humans have an incredible ability to process and understand information ...
research
12/04/2018

A Deep Learning Framework for Semi-Supervised Cross-Modal Retrieval with Label Prediction

Due to abundance of data from multiple modalities, cross-modal retrieval...
research
10/18/2022

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

In recent years, vision and language pre-training (VLP) models have adva...
research
12/14/2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Image captioning models require the high-level generalization ability to...
research
09/05/2019

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Constructing an organized dataset comprised of a large number of images ...
research
01/26/2023

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

We present a novel data-efficient semi-supervised framework to improve t...

Please sign up or login with your details

Forgot password? Click here to reset