Visual Instruction Tuning with Polite Flamingo

07/03/2023
by   Delong Chen, et al.
0

Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately – for instance, its "politeness" – due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations.

READ FULL TEXT

page 5

page 7

page 10

page 16

research
09/13/2023

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Multi-modal large language models (MLLMs) are trained based on large lan...
research
06/08/2023

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

High-quality instructions and responses are essential for the zero-shot ...
research
05/05/2023

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Large language models (LLMs) have demonstrated significant universal cap...
research
03/08/2022

Multi-Modal Mixup for Robust Fine-tuning

Pre-trained large-scale models provide a transferable embedding, and the...
research
08/18/2023

PUMGPT: A Large Vision-Language Model for Product Understanding

Recent developments of multi-modal large language models have demonstrat...
research
06/11/2023

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Large language models have become a potential pathway toward achieving a...
research
08/21/2023

Multi-Modal Dataset Acquisition for Photometrically Challenging Object

This paper addresses the limitations of current datasets for 3D vision t...

Please sign up or login with your details

Forgot password? Click here to reset