Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

09/13/2023
by   Haoqin Tu, et al.
0

Multi-modal large language models (MLLMs) are trained based on large language models (LLM), with an enhanced capability to comprehend multi-modal inputs and generate textual responses. While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested. In this study, we get out of the box and unveil an intriguing characteristic of MLLMs – our preliminary results suggest that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure NLP context. For example, a visual-instruction-tuned LLaMA2 7B model surpasses the performance of the LLaMA2-chat 7B model, fine-tuned with over one million human annotations, on TruthfulQA-mc and Ethics benchmarks. Further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data. In releasing our code at github.com/UCSC-VLAA/Sight-Beyond-Text, we aspire to foster further exploration into the intrinsic value of visual-text synergies and, in a broader scope, multi-modal interactions in alignment research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Although instruction-tuned large language models (LLMs) have exhibited r...
research
07/03/2023

Visual Instruction Tuning with Polite Flamingo

Recent research has demonstrated that the multi-task fine-tuning of mult...
research
06/29/2023

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Mod...
research
09/18/2023

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Visual instruction tuning has recently shown encouraging progress with o...
research
01/28/2019

Multi-modal dialog for browsing large visual catalogs using exploration-exploitation paradigm in a joint embedding space

We present a multi-modal dialog system to assist online shoppers in visu...
research
04/24/2023

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

The immense scale of the recent large language models (LLM) allows many ...
research
07/05/2023

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have di...

Please sign up or login with your details

Forgot password? Click here to reset