Pix2Shape – Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

by   Sai Rajeswar, et al.

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.




SynSin: End-to-end View Synthesis from a Single Image

Single image view synthesis allows for the generation of new views of a ...

Unsupervised Continuous Object Representation Networks for Novel View Synthesis

Novel View Synthesis (NVS) is concerned with the generation of novel vie...

Validation of Modulation Transfer Functions and Noise Power Spectra from Natural Scenes

The Modulation Transfer Function (MTF) and the Noise Power Spectrum (NPS...

Neural Volumes: Learning Dynamic Renderable Volumes from Images

Modeling and rendering of dynamic scenes is challenging, as natural scen...

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

We present a unified framework tackling two problems: class-specific 3D ...

Enriching StyleGAN with Illumination Physics

StyleGAN generates novel images of a scene from latent codes which are i...

Structural Autoencoders Improve Representations for Generation and Transfer

We study the problem of structuring a learned representation to signific...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.