A Visual Tour Of Current Challenges In Multimodal Language Models

10/22/2022
by   Shashank Sonkar, et al.
0

Transformer models trained on massive text corpora have become the de facto models for a wide range of natural language processing tasks. However, learning effective word representations for function words remains challenging. Multimodal learning, which visually grounds transformer models in imagery, can overcome the challenges to some extent; however, there is still much work to be done. In this study, we explore the extent to which visual grounding facilitates the acquisition of function words using stable diffusion models that employ multimodal models for text-to-image generation. Out of seven categories of function words, along with numerous subcategories, we find that stable diffusion models effectively model only a small fraction of function words – a few pronoun subcategories and relatives. We hope that our findings will stimulate the development of new datasets and approaches that enable multimodal models to learn better representations of function words.

READ FULL TEXT

page 3

page 4

page 5

page 6

page 10

page 12

page 14

page 16

research
06/17/2018

Multimodal Grounding for Language Processing

This survey discusses how recent developments in multimodal processing f...
research
07/14/2023

Are words equally surprising in audio and audio-visual comprehension?

We report a controlled study investigating the effect of visual informat...
research
10/24/2022

Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Humans are excellent at understanding language and vision to accomplish ...
research
08/11/2023

Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing

The advanced language processing abilities of large language models (LLM...
research
07/16/2023

Planting a SEED of Vision in Large Language Model

We present SEED, an elaborate image tokenizer that empowers Large Langua...
research
05/01/2020

Probing Text Models for Common Ground with Visual Representations

Vision, as a central component of human perception, plays a fundamental ...
research
09/18/2023

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels

Offline handwriting recognition (HWR) has improved significantly with th...

Please sign up or login with your details

Forgot password? Click here to reset