JourneyDB: A Benchmark for Generative Image Understanding

07/03/2023
by   Junting Pan, et al.
0

While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.

READ FULL TEXT

page 1

page 5

page 7

page 8

page 16

research
09/23/2020

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Mirroring the success of masked language models, vision-and-language cou...
research
11/10/2022

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation

Responding with multi-modal content has been recognized as an essential ...
research
12/19/2022

MetaCLUE: Towards Comprehensive Visual Metaphors Research

Creativity is an indispensable part of human cognition and also an inher...
research
04/24/2019

Understanding Art through Multi-Modal Retrieval in Paintings

In computer vision, visual arts are often studied from a purely aestheti...
research
07/05/2023

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have di...
research
11/24/2021

Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

While captioning models have obtained compelling results in describing n...
research
09/30/2020

Demographic Influences on Contemporary Art with Unsupervised Style Embeddings

Computational art analysis has, through its reliance on classification t...

Please sign up or login with your details

Forgot password? Click here to reset