Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

07/02/2018
by   Yaman Kumar, et al.
0

Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.

READ FULL TEXT

page 3

page 7

research
07/02/2018

Speech Reconstitution using Multi-view Silent Videos

Speechreading broadly involves looking, perceiving, and interpreting spo...
research
06/28/2019

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Lipreading has a lot of potential applications such as in the domain of ...
research
06/12/2020

"Notic My Speech" – Blending Speech Patterns With Multimedia

Speech as a natural signal is composed of three parts - visemes (visual ...
research
10/08/2020

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

This work describes the speaker verification system developed by Human L...
research
07/05/2018

Volumetric performance capture from minimal camera viewpoints

We present a convolutional autoencoder that enables high fidelity volume...
research
10/26/2017

Lip2AudSpec: Speech reconstruction from silent lip movements video

In this study, we propose a deep neural network for reconstructing intel...
research
09/02/2014

Visual Passwords Using Automatic Lip Reading

This paper presents a visual passwords system to increase security. The ...

Please sign up or login with your details

Forgot password? Click here to reset