What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

07/05/2023
by   Yan Zeng, et al.
0

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

READ FULL TEXT

page 6

page 8

page 11

page 24

page 25

page 29

page 30

page 31

research
05/05/2023

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Large language models (LLMs) have demonstrated significant universal cap...
research
04/28/2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

How to efficiently transform large language models (LLMs) into instructi...
research
06/26/2023

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi...
research
07/03/2023

JourneyDB: A Benchmark for Generative Image Understanding

While recent advancements in vision-language models have revolutionized ...
research
09/13/2023

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Multi-modal large language models (MLLMs) are trained based on large lan...
research
08/31/2023

Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from ...
research
04/17/2022

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Most of the existing work in one-stage referring expression comprehensio...

Please sign up or login with your details

Forgot password? Click here to reset