Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

05/25/2022
by   Jin-Hwa Kim, et al.
0

Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.

READ FULL TEXT

page 16

page 18

page 21

research
12/23/2022

Do DALL-E and Flamingo Understand Each Other?

A major goal of multimodal research is to improve machine understanding ...
research
05/27/2020

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mu...
research
02/07/2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

In this work, we pursue a unified paradigm for multimodal pretraining to...
research
06/02/2023

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Vision-language models (VLMs) discriminatively pre-trained with contrast...
research
05/05/2023

Data Curation for Image Captioning with Text-to-Image Generative Models

Recent advances in image captioning are mainly driven by large-scale vis...
research
12/15/2022

Are Multimodal Models Robust to Image and Text Perturbations?

Multimodal image-text models have shown remarkable performance in the pa...
research
06/20/2019

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Two modalities are often used to convey information in a complementary a...

Please sign up or login with your details

Forgot password? Click here to reset