Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

02/15/2018
by   M Faisal, et al.
0

Human lip-reading is a challenging task. It requires not only knowledge of underlying language but also visual clues to predict spoken words. Experts need certain level of experience and understanding of visual expressions learning to decode spoken words. Now-a-days, with the help of deep learning it is possible to translate lip sequences into meaningful words. The speech recognition in the noisy environments can be increased with the visual information [1]. To demonstrate this, in this project, we have tried to train two different deep-learning models for lip-reading: first one for video sequences using spatiotemporal convolution neural network, Bi-gated recurrent neural network and Connectionist Temporal Classification Loss, and second for audio that inputs the MFCC features to a layer of LSTM cells and output the sequence. We have also collected a small audio-visual dataset to train and test our model. Our target is to integrate our both models to improve the speech recognition in the noisy environment

READ FULL TEXT
research
09/06/2018

Deep Audio-Visual Speech Recognition

The goal of this work is to recognise phrases and sentences being spoken...
research
03/13/2018

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Today's Automatic Speech Recognition systems only rely on acoustic signa...
research
11/12/2021

A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal

Speech recognition is a technique that converts human speech signals int...
research
11/16/2016

Lip Reading Sentences in the Wild

The goal of this work is to recognise phrases and sentences being spoken...
research
05/10/2019

MobiVSR: A Visual Speech Recognition Solution for Mobile Devices

Visual speech recognition (VSR) is the task of recognizing spoken langua...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...
research
03/28/2023

LIPSFUS: A neuromorphic dataset for audio-visual sensory fusion of lip reading

This paper presents a sensory fusion neuromorphic dataset collected with...

Please sign up or login with your details

Forgot password? Click here to reset