DeepAI AI Chat
Log In Sign Up

Learning Visual Styles from Audio-Visual Associations

by   Tingle Li, et al.

From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn to modify input images such that, after manipulation, they are more likely to co-occur with a given input sound. In quantitative and qualitative evaluations, our sound-based model outperforms label-based approaches. We also show that audio can be an intuitive representation for manipulating images, as adjusting a sound's volume or mixing two sounds together results in predictable changes to visual style. Project webpage:


page 1

page 4

page 10

page 12

page 13

page 14

page 23

page 24


The Sound of Pixels

We introduce PixelPlayer, a system that, by leveraging large amounts of ...

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

We present a novel framework, Localized Image Stylization with Audio (LI...

Deep Synthesizer Parameter Estimation

Sound synthesis is a complex field that requires domain expertise. Manua...

2.5D Visual Sound

Binaural audio provides a listener with 3D sound sensation, allowing a r...

Audio Captcha Recognition Using RastaPLP Features by SVM

Nowadays, CAPTCHAs are computer generated tests that human can pass but ...

Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Audio-Visual scene understanding is a challenging problem due to the uns...

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Referring expressions are commonly used when referring to a specific tar...