LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

06/25/2019
by   Dilip Kumar Margam, et al.
0

In recent years, deep learning based machine lipreading has gained prominence. To this end, several architectures such as LipNet, LCANet and others have been proposed which perform extremely well compared to traditional lipreading DNN-HMM hybrid systems trained on DCT features. In this work, we propose a simpler architecture of 3D-2D-CNN-BLSTM network with a bottleneck layer. We also present analysis of two different approaches for lipreading on this architecture. In the first approach, 3D-2D-CNN-BLSTM network is trained with CTC loss on characters (ch-CTC). Then BLSTM-HMM model is trained on bottleneck lip features (extracted from 3D-2D-CNN-BLSTM ch-CTC network) in a traditional ASR training pipeline. In the second approach, same 3D-2D-CNN-BLSTM network is trained with CTC loss on word labels (w-CTC). The first approach shows that bottleneck features perform better compared to DCT features. Using the second approach on Grid corpus' seen speaker test set, we report 1.3% WER - a 55% improvement relative to LCANet. On unseen speaker test set we report 8.6% WER which is 24.5% improvement relative to LipNet. We also verify the method on a second dataset of 81 speakers which we collected. Finally, we also discuss the effect of feature duplication on BLSTM-HMM model performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/20/2016

Speaker Cluster-Based Speaker Adaptive Training for Deep Neural Network Acoustic Modeling

A speaker cluster-based speaker adaptive training (SAT) method under dee...
research
09/09/2019

DNN-based cross-lingual voice conversion using Bottleneck Features

Cross-lingual voice conversion (CLVC) is a quite challenging task since ...
research
10/16/2021

A Unified Speaker Adaptation Approach for ASR

Transformer models have been used in automatic speech recognition (ASR) ...
research
04/09/2021

Feature Replacement and Combination for Hybrid ASR Systems

Acoustic modeling of raw waveform and learning feature extractors as par...
research
02/02/2020

DropClass and DropAdapt: Dropping classes for deep speaker representation learning

Many recent works on deep speaker embeddings train their feature extract...
research
05/11/2019

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

There are a number of studies about extraction of bottleneck (BN) featur...
research
01/11/2022

Exploiting Hybrid Models of Tensor-Train Networks for Spoken Command Recognition

This work aims to design a low complexity spoken command recognition (SC...

Please sign up or login with your details

Forgot password? Click here to reset