Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

06/27/2023
by   Keqin Chen, et al.
0

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

READ FULL TEXT

page 1

page 3

page 13

page 14

page 15

research
02/28/2023

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

The demand for multimodal dialogue systems has been rising in various do...
research
03/08/2023

FaceChat: An Emotion-Aware Face-to-face Dialogue Framework

While current dialogue systems like ChatGPT have made significant advanc...
research
05/24/2023

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal ...
research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
06/21/2023

Mass-Producing Failures of Multimodal Systems with Language Models

Deployed multimodal systems can fail in ways that evaluators did not ant...
research
06/08/2023

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Conversation agents fueled by Large Language Models (LLMs) are providing...
research
04/12/2019

Evaluating the Representational Hub of Language and Vision Models

The multimodal models used in the emerging field at the intersection of ...

Please sign up or login with your details

Forgot password? Click here to reset