Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition

by   Jianrong Wang, et al.

Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two-dimensional (2D) lip images to recognize speaker in a textdependent context. However, 2D lip easily suffers from various face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at Motion-Network-for-Text-Independent-Speaker-Recognition.



There are no comments yet.


page 1

page 2

page 3

page 4


An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

This paper presents an end-to-end text-independent speaker verification ...

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

In this paper, we propose a Convolutional Neural Network (CNN) based spe...

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...

The speaker-independent lipreading play-off; a survey of lipreading machines

Lipreading is a difficult gesture classification task. One problem in co...

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

This work describes the speaker verification system developed by Human L...

SAR-Net: A End-to-End Deep Speech Accent Recognition Network

This paper proposes a end-to-end deep network to recognize kinds of acce...

Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation

We address talker-independent monaural speaker separation from the persp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.