From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

04/26/2023
by   Junyang Wang, et al.
0

With the development of Vision-Language Pre-training Models (VLPMs) represented by CLIP and ALIGN, significant breakthroughs have been achieved for association-based visual tasks such as image classification and image-text retrieval by the zero-shot capability of CLIP without fine-tuning. However, CLIP is hard to apply to generation-based tasks. This is due to the lack of decoder architecture and pre-training tasks for generation. Although previous works have created generation capacity for CLIP through additional language models, a modality gap between the CLIP representations of different modalities and the inability of CLIP to model the offset of this gap, which fails the concept to transfer across modalities. To solve the problem, we try to map images/videos to the language modality and generate captions from the language modality. In this paper, we propose the K-nearest-neighbor Cross-modality Mapping (Knight), a zero-shot method from association to generation. With text-only unsupervised training, Knight achieves state-of-the-art performance in zero-shot methods for image captioning and video captioning. Our code is available at https://github.com/junyangwang0410/Knight.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 11

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
03/06/2023

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate stro...
research
07/31/2023

Transferable Decoding with Visual Entities for Zero-Shot Image Captioning

Image-to-text generation aims to describe images using natural language....
research
05/23/2023

Parts of Speech-Grounded Subspaces in Vision-Language Models

Latent image representations arising from vision-language models have pr...
research
06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...
research
08/18/2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision repre...
research
03/14/2023

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Compared with ample visual-text pre-training research, few works explore...

Please sign up or login with your details

Forgot password? Click here to reset