Is Lip Region-of-Interest Sufficient for Lipreading?

05/28/2022
by   Jing-Xuan Zhang, et al.
0

Lip region-of-interest (ROI) is conventionally used for visual input in the lipreading task. Few works have adopted the entire face as visual input because lip-excluded parts of the face are usually considered to be redundant and irrelevant to visual speech recognition. However, faces contain much more detailed information than lips, such as speakers' head pose, emotion, identity etc. We argue that such information might benefit visual speech recognition if a powerful feature extractor employing the entire face is trained. In this work, we propose to adopt the entire face for lipreading with self-supervised learning. AV-HuBERT, an audio-visual multi-modal self-supervised learning framework, was adopted in our experiments. Our experimental results showed that adopting the entire face achieved 16 on the lipreading task, compared with the baseline method using lip as visual input. Without self-supervised pretraining, the model with face input achieved a higher WER than that using lip input in the case of limited training data (30 hours), while a slightly lower WER when using large amount of training data (433 hours).

READ FULL TEXT

page 4

page 7

research
05/04/2020

Does Visual Self-Supervision Improve Learning of Speech Representations?

Self-supervised learning has attracted plenty of recent research interes...
research
10/27/2020

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

Self-supervised visual pretraining has shown significant progress recent...
research
01/05/2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual informati...
research
12/12/2022

Jointly Learning Visual and Auditory Speech Representations from Raw Data

We present RAVEn, a self-supervised multi-modal approach to jointly lear...
research
02/07/2022

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across m...
research
11/02/2022

More Speaking or More Speakers?

Self-training (ST) and self-supervised learning (SSL) methods have demon...
research
11/24/2017

Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

This paper presents a self-supervised method for detecting the active sp...

Please sign up or login with your details

Forgot password? Click here to reset