SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

08/31/2021
by   Zhijie Lin, et al.
0

Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios. Although prior works that explore lip reading have obtained salient achievements, they are all trained in a non-simultaneous manner where the predictions are generated requiring access to the full video. To breakthrough this constraint, we study the task of simultaneous lip reading and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory from three aspects: (1) To address the challenge of monotonic alignments while considering the syntactic structure of the generated sentences under simultaneous setting, we build a transducer-based model and design several effective training strategies including CTC pre-training, model warm-up and curriculum learning to promote the training of the lip reading transducer. (2) To learn better spatio-temporal representations for simultaneous encoder, we construct a truncated 3D convolution and time-restricted self-attention layer to perform the frame-to-frame interaction within a video segment containing fixed number of frames. (3) The history information is always limited due to the storage in real-time scenarios, especially for massive video data. Therefore, we devise a novel attention-guided adaptive memory to organize semantic information of history segments and enhance the visual representations with acceptable computation-aware latency. The experiments show that the SimulLR achieves the translation speedup 9.10× compared with the state-of-the-art non-simultaneous methods, and also obtains competitive results, which indicates the effectiveness of our proposed methods.

READ FULL TEXT
research
06/12/2019

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

Simultaneous machine translation begins to translate each source sentenc...
research
03/29/2023

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Talking face generation, also known as speech-to-lip generation, reconst...
research
09/01/2022

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), wh...
research
12/28/2020

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

In this paper, we propose a novel deep learning architecture to improvin...
research
10/16/2021

Visual-aware Attention Dual-stream Decoder for Video Captioning

Video captioning is a challenging task that captures different visual pa...
research
07/03/2023

Implicit Memory Transformer for Computationally Efficient Simultaneous Speech Translation

Simultaneous speech translation is an essential communication task diffi...

Please sign up or login with your details

Forgot password? Click here to reset