LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

02/04/2023
by   Feng Xue, et al.
0

Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 6

page 8

research
08/23/2022

Towards Accurate Facial Landmark Detection via Cascaded Transformers

Accurate facial landmarks are essential prerequisites for many tasks rel...
research
05/14/2020

FaceFilter: Audio-visual speech separation using still images

The objective of this paper is to separate a target speaker's speech fro...
research
01/08/2021

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a ...
research
06/25/2023

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extr...
research
11/06/2018

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...
research
10/13/2020

Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

Lip motion reflects behavior characteristics of speakers, and thus can b...
research
07/01/2019

Synchronising audio and ultrasound by learning cross-modal embeddings

Audiovisual synchronisation is the task of determining the time offset b...

Please sign up or login with your details

Forgot password? Click here to reset