Multimodal Speech Recognition for Language-Guided Embodied Agents

02/27/2023
by   Allen Chang, et al.
0

Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context. We train our model on a dataset of spoken instructions, synthesized from the ALFRED task completion dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30 words than unimodal baselines. We also find that a text-trained embodied agent successfully completes tasks more often by following transcribed instructions from multimodal ASR models.

READ FULL TEXT
research
02/13/2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Speech is understood better by using visual context; for this reason, th...
research
09/18/2023

Instruction-Following Speech Recognition

Conventional end-to-end Automatic Speech Recognition (ASR) models primar...
research
07/14/2023

SGGNet^2: Speech-Scene Graph Grounding Network for Speech-guided Navigation

The spoken language serves as an accessible and efficient interface, ena...
research
06/11/2023

Impact of Experiencing Misrecognition by Teachable Agents on Learning and Rapport

While speech-enabled teachable agents have some advantages over typing-b...
research
10/14/2021

Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts

As the volume of long-form spoken-word content such as podcasts explodes...
research
05/19/2022

Content-Context Factorized Representations for Automated Speech Recognition

Deep neural networks have largely demonstrated their ability to perform ...
research
08/17/2022

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Lecture slide presentations, a sequence of pages that contain text and f...

Please sign up or login with your details

Forgot password? Click here to reset