A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

05/08/2023
by   Yunxin Li, et al.
0

Conditional inference on joint textual and visual clues is a multi-modal reasoning task that textual clues provide prior permutation or external knowledge, which are complementary with visual content and pivotal to deducing the correct option. Previous methods utilizing pretrained vision-language models (VLMs) have achieved impressive performances, yet they show a lack of multimodal context reasoning capability, especially for text-modal information. To address this issue, we propose a Multi-modal Context Reasoning approach, named ModCR. Compared to VLMs performing reasoning via cross modal semantic alignment, it regards the given textual abstract semantic and objective image information as the pre-context information and embeds them into the language model to perform context reasoning. Different from recent vision-aided language models used in natural language processing, ModCR incorporates the multi-view semantic alignment information between language and vision by introducing the learnable alignment prefix between image and text in the pretrained language model. This makes the language model well-suitable for such multi-modal reasoning scenario on joint textual and visual clues. We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance (exact gain by 4.8 compared to previous strong baselines. Code Link: <https://github.com/YunxinLi/Multimodal-Context-Reasoning>.

READ FULL TEXT

page 1

page 3

page 8

research
09/14/2023

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Starting from the resurgence of deep learning, vision-language models (V...
research
05/12/2023

Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

Humans learn language via multi-modal knowledge. However, due to the tex...
research
11/30/2021

Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources

Misinformation is now a major problem due to its potential high risks to...
research
01/10/2022

Better Modeling the Programming World with Code Concept Graphs-augmented Multi-modal Learning

The progress made in code modeling has been tremendous in recent years t...
research
04/21/2022

Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing

Multi-modal data abounds in biomedicine, such as radiology images and re...
research
06/14/2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable ad...
research
05/03/2023

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Pretrained Vision-Language Models (VLMs) have achieved remarkable perfor...

Please sign up or login with your details

Forgot password? Click here to reset