Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?

10/21/2022
by   Pradip Pramanick, et al.
0

The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot's visual information into an ASR system and improve the recognition of a spoken utterance containing a visible entity. Specifically, we propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59 unmodified ASR system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2021

Accented Speech Recognition: A Survey

Automatic Speech Recognition (ASR) systems generalize poorly on accented...
research
02/13/2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Speech is understood better by using visual context; for this reason, th...
research
05/25/2023

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

Multi-talker overlapped speech poses a significant challenge for speech ...
research
08/17/2021

A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems

It's challenging to customize transducer-based automatic speech recognit...
research
09/18/2023

CB-Whisper: Contextual Biasing Whisper using TTS-based Keyword Spotting

End-to-end automatic speech recognition (ASR) systems often struggle to ...
research
09/11/2023

Hybrid ASR for Resource-Constrained Robots: HMM - Deep Learning Fusion

This paper presents a novel hybrid Automatic Speech Recognition (ASR) sy...
research
07/13/2023

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

This paper explores the integration of Large Language Models (LLMs) into...

Please sign up or login with your details

Forgot password? Click here to reset