"Notic My Speech" – Blending Speech Patterns With Multimedia

06/12/2020
by   Dhruva Sahrawat, et al.
0

Speech as a natural signal is composed of three parts - visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact how speech is delivered and heard. To close the gap between speech understanding and multimedia video applications, in this paper, we show the initial experiments by modelling the perception on visual speech and showing its use case on video compression. On the other hand, in the visual speech recognition domain, existing studies have mostly modeled it as a classification problem, while ignoring the correlations between views, phonemes, visemes, and speech perception. This results in solutions which are further away from how human perception works. To bridge this gap, we propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. We conduct experiments on three public visual speech recognition datasets. The experimental results show that our proposed method outperformed the existing work by 4.99 is a strong correlation between our model's understanding of multi-view speech and the human perception. This characteristic benefits downstream applications such as video compression and streaming where a significant number of less important frames can be compressed or eliminated while being able to maximally preserve human speech understanding with good user experience.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 10

research
09/05/2022

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

Audio and visual signals complement each other in human speech perceptio...
research
07/02/2018

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Speechreading or lipreading is the technique of understanding and gettin...
research
10/03/2017

Understanding the visual speech signal

For machines to lipread, or understand speech from lip movement, they de...
research
07/02/2018

Speech Reconstitution using Multi-view Silent Videos

Speechreading broadly involves looking, perceiving, and interpreting spo...
research
04/09/2021

Accented Speech Recognition Inspired by Human Perception

While improvements have been made in automatic speech recognition perfor...
research
08/28/2023

Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks

Human emotion understanding is pivotal in making conversational technolo...
research
05/07/2022

Timestamp-independent Haptic-Visual Synchronization

The booming haptic data significantly improves the users'immersion durin...

Please sign up or login with your details

Forgot password? Click here to reset