Evaluating the Representational Hub of Language and Vision Models

04/12/2019
by   Ravi Shekhar, et al.
0

The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the `Hub and Spoke' architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing diagnostic task designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.

READ FULL TEXT
research
07/06/2023

UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

In recent years, artificial intelligence has played an important role in...
research
08/14/2019

Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling

Visual question answering and visual dialogue tasks have been increasing...
research
10/28/2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Vision-and-language (V-L) tasks require the system to understand both vi...
research
01/12/2023

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Foundation models or pre-trained models have substantially improved the ...
research
12/02/2022

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representa...
research
06/27/2023

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

In human conversations, individuals can indicate relevant regions within...
research
05/26/2022

A Hybrid Neural Autoencoder for Sensory Neuroprostheses and Its Applications in Bionic Vision

Sensory neuroprostheses are emerging as a promising technology to restor...

Please sign up or login with your details

Forgot password? Click here to reset