Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

09/28/2018
by   Stavros Petridis, et al.
0

Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3 achieves the new state-of-the-art performance on LRS2 database (7 rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9 several different types of noise as the signal-to-noise ratio decreases.

READ FULL TEXT
research
11/13/2018

Modality Attention for End-to-End Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) system is thought to be one of th...
research
12/21/2018

An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition

In this project, we worked on speech recognition, specifically predictin...
research
01/20/2020

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300

It is generally believed that direct sequence-to-sequence (seq2seq) spee...
research
10/20/2022

Play It Back: Iterative Attention for Audio Recognition

A key function of auditory cognition is the association of characteristi...
research
01/04/2023

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural ne...
research
10/21/2020

Improving Audio Anomalies Recognition Using Temporal Convolutional Attention Network

Anomalous audio in speech recordings is often caused by speaker voice di...
research
02/18/2018

Visual-Only Recognition of Normal, Whispered and Silent Speech

Silent speech interfaces have been recently proposed as a way to enable ...

Please sign up or login with your details

Forgot password? Click here to reset