VisorGPT: Learning Visual Prior via Generative Pre-Training

05/23/2023
by   Jinheng Xie, et al.
0

Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, e.g., bounding boxes, human pose, and instance masks, into sequences,  can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that  can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.

READ FULL TEXT

page 2

page 5

page 7

page 8

page 9

page 13

page 16

page 17

research
07/13/2021

How Much Can CLIP Benefit Vision-and-Language Tasks?

Most existing Vision-and-Language (V L) models rely on pre-trained vis...
research
03/09/2023

Rethinking Self-Supervised Visual Representation Learning in Pre-training for 3D Human Pose and Shape Estimation

Recently, a few self-supervised representation learning (SSL) methods ha...
research
12/07/2021

Grounded Language-Image Pre-training

This paper presents a grounded language-image pre-training (GLIP) model ...
research
04/03/2023

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in d...
research
07/10/2017

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

The success of deep learning in vision can be attributed to: (a) models ...
research
05/03/2022

In Defense of Image Pre-Training for Spatiotemporal Recognition

Image pre-training, the current de-facto paradigm for a wide range of vi...
research
12/16/2022

GFPose: Learning 3D Human Pose Prior with Gradient Fields

Learning 3D human pose prior is essential to human-centered AI. Here, we...

Please sign up or login with your details

Forgot password? Click here to reset