DeepAI
Log In Sign Up

Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

11/08/2020
by   Shahram Ghorbani, et al.
0

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. This architecture consists of separate encoders for the two modalities and a single decoder that attends over them. We show that this architecture is better than fusing modalities at the signal level. Additionally, we also explore leveraging the visual information in a second pass model, which has also been referred to as a `deliberation model'. The deliberation model accepts audio representations and text hypotheses from the first pass ASR and combines them with a visual stream for an improved visual context-aware recognition. The proposed deliberation scheme can work on top of any well trained ASR and also enabled us to leverage the pre-trained text model to ground the hypotheses with the visual features. Our experiments on HOW2 dataset show that multi-stream and deliberation architectures are very effective at the VC-ASR task. We evaluate the proposed models for two scenarios; clean audio stream and distorted audio in which we mask out some specific words in the audio. The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6 for the clean and masked data, respectively, compared to an audio-only model. The deliberation model also improves recovering the masked words by 59 relative.

READ FULL TEXT

page 2

page 6

04/29/2020

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...
08/09/2022

Thai Wav2Vec2.0 with CommonVoice V8

Recently, Automatic Speech Recognition (ASR), a system that converts aud...
10/12/2022

A context-aware knowledge transferring strategy for CTC-based ASR

Non-autoregressive automatic speech recognition (ASR) modeling has recei...
11/05/2021

Context-Aware Transformer Transducer for Speech Recognition

End-to-end (E2E) automatic speech recognition (ASR) systems often have d...
01/10/2017

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Multi-task learning (MTL) involves the simultaneous training of two or m...
08/27/2018

Learning behavioral context recognition with multi-stream temporal convolutional networks

Smart devices of everyday use (such as smartphones and wearables) are in...
02/17/2022

MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

We propose multi-layer perceptron (MLP)-based architectures suitable for...