Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

09/20/2023
by   Heeseung Yun, et al.
0

Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
11/25/2022

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

We present XKD, a novel self-supervised framework to learn meaningful re...
research
03/09/2020

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Humans can robustly recognize and localize objects by integrating visual...
research
01/18/2022

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

In this study, we aim to predict the plausible future action steps given...
research
03/03/2023

X^3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Recent advances in 3D object detection (3DOD) have obtained remarkably s...
research
03/14/2023

Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification

Data-Free Knowledge Distillation (DFKD) has recently attracted growing a...
research
11/23/2020

HoHoNet: 360 Indoor Holistic Understanding with Latent Horizontal Features

We present HoHoNet, a versatile and efficient framework for holistic und...
research
05/18/2022

Seeing Sounds, Hearing Shapes: a gamified study to evaluate sound-sketches

Sound-shape associations, a subset of cross-modal associations between t...

Please sign up or login with your details

Forgot password? Click here to reset