ViLaS: Integrating Vision and Language into Automatic Speech Recognition

05/31/2023
by   Minglun Han, et al.
0

Employing additional multimodal information to improve automatic speech recognition (ASR) performance has been proven effective in previous works. However, many of these works focus only on the utilization of visual cues from human lip motion. In fact, context-dependent visual and linguistic cues can also be used to improve ASR performance in many scenarios. In this paper, we first propose a multimodal ASR model (ViLaS) that can simultaneously or separately integrate visual and linguistic cues to help recognize the input speech, and introduce a training strategy that can improve performance in modal-incomplete test scenarios. Then, we create a multimodal ASR dataset (VSDial) with visual and linguistic cues to explore the effects of integrating vision and language. Finally, we report empirical results on the public Flickr8K and self-constructed VSDial datasets, investigate cross-modal fusion schemes, and analyze fine-grained cross-modal alignment on VSDial.

READ FULL TEXT
research
07/04/2021

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

We propose a cross-modal transformer-based neural correction models that...
research
01/30/2023

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Large-scale pre-trained language models (PLMs) with powerful language mo...
research
11/28/2019

ASR is all you need: cross-modal distillation for lip reading

The goal of this work is to train strong models for visual speech recogn...
research
10/07/2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Instructional videos get high-traffic on video sharing platforms, and pr...
research
07/27/2023

Cascaded Cross-Modal Transformer for Request and Complaint Detection

We propose a novel cascaded cross-modal transformer (CCMT) that combines...
research
03/11/2023

Transcription free filler word detection with Neural semi-CRFs

Non-linguistic filler words, such as "uh" or "um", are prevalent in spon...
research
07/03/2022

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Leveraging context information is an intuitive idea to improve performan...

Please sign up or login with your details

Forgot password? Click here to reset