Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

06/09/2023
by   Yida Chen, et al.
0

Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process-well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output.

READ FULL TEXT

page 2

page 5

page 6

page 8

page 9

page 14

page 15

page 16

research
10/24/2022

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task

Language models show a surprising range of capabilities, but the source ...
research
03/24/2023

DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis

We present DiffuScene for indoor 3D scene synthesis based on a novel sce...
research
12/06/2021

Label-Efficient Semantic Segmentation with Diffusion Models

Denoising diffusion probabilistic models have recently received much res...
research
05/29/2023

Generating Driving Scenes with Diffusion

In this paper we describe a learned method of traffic scene generation d...
research
06/06/2023

Towards Visual Foundational Models of Physical Scenes

We describe a first step towards learning general-purpose visual represe...
research
07/20/2023

OBJECT 3DIT: Language-guided 3D-aware Image Editing

Existing image editing tools, while powerful, typically disregard the un...
research
11/19/2014

Visual Noise from Natural Scene Statistics Reveals Human Scene Category Representations

Our perceptions are guided both by the bottom-up information entering ou...

Please sign up or login with your details

Forgot password? Click here to reset