Language Models as Zero-shot Visual Semantic Learners

07/26/2021
by   Yue Jiao, et al.
0

Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short. We notice that current visual semantic embedding models lack a mutual exclusivity bias which limits their performance.

READ FULL TEXT

page 4

page 8

research
01/03/2022

Semantically Grounded Visual Embeddings for Zero-Shot Learning

Zero-shot learning methods rely on fixed visual and semantic embeddings,...
research
07/26/2021

What Remains of Visual Semantic Embeddings

Zero shot learning (ZSL) has seen a surge in interest over the decade fo...
research
05/29/2023

A Method for Studying Semantic Construal in Grammatical Constructions with Interpretable Contextual Embedding Spaces

We study semantic construal in grammatical constructions using large lan...
research
10/18/2022

Perceptual Grouping in Vision-Language Models

Recent advances in zero-shot image recognition suggest that vision-langu...
research
07/14/2020

COBE: Contextualized Object Embeddings from Narrated Instructional Video

Many objects in the real world undergo dramatic variations in visual app...
research
08/11/2023

Compositional Learning in Transformer-Based Human-Object Interaction Detection

Human-object interaction (HOI) detection is an important part of underst...
research
12/12/2022

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Many visual recognition models are evaluated only on their classificatio...

Please sign up or login with your details

Forgot password? Click here to reset