Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation

03/25/2021
by   Jiyoung Lee, et al.
0

In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. Thus, their performance heavily depends on the accuracy of audio-visual synchronization and the effectiveness of their representations. To overcome the frame discontinuity problem between two modalities due to transmission delay mismatch or jitter, we propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams. Given that the global term provides stability over a temporal sequence at the utterance-level, this resolves the label permutation problem characterized by inconsistent assignments. By extending the proposed cross-modal affinity on the complex network, we further improve the separation performance in the complex spectral domain. Experimental results verify that the proposed methods outperform conventional ones on various datasets, demonstrating their advantages in real-world scenarios.

READ FULL TEXT
research
03/02/2021

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

We present an audio-visual speech separation learning method that consid...
research
01/08/2021

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a ...
research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
08/14/2023

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from auto...
research
08/23/2022

CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding

Authors make their videos visually accessible by adding audio descriptio...
research
10/27/2020

Rule-embedded network for audio-visual voice activity detection in live musical video streams

Detecting anchor's voice in live musical streams is an important preproc...
research
02/11/2021

A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

It is already known that both auditory and visual stimulus is able to co...

Please sign up or login with your details

Forgot password? Click here to reset