Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

11/09/2020
by   Ece Takmaz, et al.
0

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural-particularly when gaze is encoded with a dedicated recurrent component.

READ FULL TEXT

page 1

page 8

research
07/19/2017

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

The attention mechanisms in deep neural networks are inspired by human's...
research
08/18/2016

Seeing with Humans: Gaze-Assisted Neural Image Captioning

Gaze reflects how humans process visual scenes and is therefore increasi...
research
10/18/2022

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

In recent years, vision and language pre-training (VLP) models have adva...
research
11/02/2022

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation

Radiology report generation (RRG) has gained increasing research attenti...
research
10/15/2020

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

A lack of corpora has so far limited advances in integrating human gaze ...
research
05/15/2021

Premise-based Multimodal Reasoning: A Human-like Cognitive Process

Reasoning is one of the major challenges of Human-like AI and has recent...
research
03/06/2019

A Synchronized Multi-Modal Attention-Caption Dataset and Analysis

In this work, we present a novel multi-modal dataset consisting of eye m...

Please sign up or login with your details

Forgot password? Click here to reset