Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

06/25/2021
by   Jianrong Wang, et al.
0

Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British English (97 sentences). In this work, we propose a cross-modal knowledge distillation method with teacher-student structure, which transfers audio speech information to CS to overcome the limited data problem. Firstly, we pretrain a teacher model for CS recognition with a large amount of open source audio speech data, and simultaneously pretrain the feature extractors for lips and hands using CS data. Then, we distill the knowledge from teacher model to the student model with frame-level and sequence-level distillation strategies. Importantly, for frame-level, we exploit multi-task learning to weigh losses automatically, to obtain the balance coefficient. Besides, we establish a five-speaker British English CS dataset for the first time. The proposed method is evaluated on French and British English CS datasets, showing superior CS recognition performance to the state-of-the-art (SOTA) by a large margin.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/26/2021

An Attention Self-supervised Contrastive Learning based Three-stage Model for Hand Shape Feature Representation in Cued Speech

Cued Speech (CS) is a communication system for deaf people or hearing im...
04/07/2019

Long-Term Vehicle Localization by Recursive Knowledge Distillation

Most of the current state-of-the-art frameworks for cross-season visual ...
10/20/2021

Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

The remarkable performance of the pre-trained language model (LM) using ...
11/28/2019

ASR is all you need: cross-modal distillation for lip reading

The goal of this work is to train strong models for visual speech recogn...
01/03/2020

Re-synchronization using the Hand Preceding Model for Multi-modal Fusion in Automatic Continuous Cued Speech Recognition

Cued Speech (CS) is an augmented lip reading complemented by hand coding...
01/03/2020

A New Re-synchronization Method based Multi-modal Fusion for Automatic Continuous Cued Speech Recognition

Cued Speech (CS) is an augmented lip reading complemented by hand coding...
01/03/2020

A Pilot Study on Mandarin Chinese Cued Speech

Cued Speech (CS) is a communication system developed for deaf people, wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.