Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

09/21/2018
by   Soo-Whan Chung, et al.
0

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronization task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.

READ FULL TEXT

page 2

page 3

page 4

research
04/29/2020

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

The goal of this work is to train discriminative cross-modal embeddings ...
research
12/18/2017

Objects that Sound

In this paper our objectives are, first, networks that can embed audio a...
research
08/14/2023

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from auto...
research
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...
research
03/01/2023

Cross-Modal Entity Matching for Visually Rich Documents

Visually rich documents (VRD) are physical/digital documents that utiliz...
research
07/23/2019

Multisensory Learning Framework for Robot Drumming

The hype about sensorimotor learning is currently reaching high fever, t...
research
01/07/2018

Cross-modal Embeddings for Video and Audio Retrieval

The increasing amount of online videos brings several opportunities for ...

Please sign up or login with your details

Forgot password? Click here to reset