What Makes for Good Visual Tokenizers for Large Language Models?

05/20/2023
by   Guangzhi Wang, et al.
0

We empirically investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs). In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset. ii) Self-supervised models are better at fine-grained perception, where patch-level supervision is particularly effective. iii) Tuning the visual tokenizer leads to the loss of semantics obtained from large-scale pretraining, which is unfavorable with relatively small-scale instruction-tuning dataset. Given the findings, we reviewed methods that attempted to unify semantics and fine-grained visual understanding, e.g., patch-level feature distillation with semantically-rich targets. We obtain an intriguing insight mask-based strategies that were once all the rage may not be applicable for obtaining good visual tokenizers. Based on this critical observation, we obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales. In particular, without introducing extra parameters and task-specific fine-tuning, GVT achieves superior performance on visual question answering, image captioning, and other fine-grained visual understanding tasks such as object counting and multi-class identification.

READ FULL TEXT

page 2

page 5

page 8

research
05/28/2022

Few-shot Subgoal Planning with Language Models

Pre-trained large language models have shown successful progress in many...
research
02/09/2022

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual d...
research
10/08/2022

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

While recent large-scale video-language pre-training made great progress...
research
06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...
research
05/23/2023

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Recent work in vision-and-language pretraining has investigated supervis...
research
08/17/2022

ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Bootstrapping from pre-trained language models has been proven to be an ...
research
06/09/2022

On Data Scaling in Masked Image Modeling

An important goal of self-supervised learning is to enable model pre-tra...

Please sign up or login with your details

Forgot password? Click here to reset