Object Referring in Visual Scene with Spoken Language

11/10/2017
by   Arun Balajee Vasudevan, et al.
0

Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.

READ FULL TEXT

page 2

page 4

page 5

page 6

page 7

research
05/08/2023

Putting Natural in Natural Language Processing

Human language is firstly spoken and only secondarily written. Text, h...
research
06/01/2020

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

In this work, we propose an effective approach for training unique embed...
research
10/23/2020

Show and Speak: Directly Synthesize Spoken Description of Images

This paper proposes a new model, referred to as the show and speak (SAS)...
research
12/21/2018

Symbolic inductive bias for visually grounded learning of spoken language

A widespread approach to processing spoken language is to first automati...
research
09/15/2021

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...
research
06/29/2022

Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models

In Spoken Language Understanding (SLU) the task is to extract important ...
research
03/29/2022

Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Person name capture from human speech is a difficult task in human-machi...

Please sign up or login with your details

Forgot password? Click here to reset