Deep Co-Clustering for Unsupervised Audiovisual Learning

07/09/2018
by   Di Hu, et al.
2

The seen birds twitter, the running cars accompany with noise, people talks by face-to-face, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as Deep Co-Clustering (DCC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DCC can learn effective unimodal representation, with which the classifier can even outperform human. Further, DCC shows noticeable performance in the task of sound localization, multisource detection, and audiovisual understanding.

READ FULL TEXT

page 4

page 7

page 9

research
04/03/2020

S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Establishing robust and accurate correspondences is a fundamental backbo...
research
01/26/2018

Object category learning and retrieval with weak supervision

We consider the problem of retrieving objects from image data and learni...
research
09/16/2016

Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition

Here, we develop an audiovisual deep residual network for multimodal app...
research
02/07/2021

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Synthesized speech from articulatory movements can have real-world use f...
research
12/20/2019

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

We are witnessing a confluence of vision, speech and dialog system techn...
research
04/26/2022

Sound Localization by Self-Supervised Time Delay Estimation

Sounds reach one microphone in a stereo pair sooner than the other, resu...

Please sign up or login with your details

Forgot password? Click here to reset