Do DALL-E and Flamingo Understand Each Other?

12/23/2022
by   Hang Li, et al.
0

A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.

READ FULL TEXT

page 1

page 7

page 8

research
05/04/2023

Image Captioners Sometimes Tell More Than Images They See

Image captioning, a.k.a. "image-to-text," which generates descriptive te...
research
05/25/2022

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...
research
07/21/2018

Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text

Images and text in advertisements interact in complex, non-literal ways....
research
05/09/2023

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Webpages have been a rich resource for language and vision-language task...
research
06/29/2023

CLIPAG: Towards Generator-Free Text-to-Image Generation

Perceptually Aligned Gradients (PAG) refer to an intriguing property obs...
research
08/18/2022

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

Automatically discovering failures in vision models under real-world set...
research
05/05/2023

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Webpages have been a rich, scalable resource for vision-language and lan...

Please sign up or login with your details

Forgot password? Click here to reset