CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images

08/23/2023
by   Sookwan Han, et al.
0

We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. Project Page: https://jellyheadandrew.github.io/projects/chorus

READ FULL TEXT

page 18

page 19

page 21

page 22

page 23

page 24

page 26

page 27

research
08/19/2021

D3D-HOI: Dynamic 3D Human-Object Interactions from Videos

We introduce D3D-HOI: a dataset of monocular videos with ground truth an...
research
09/06/2022

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

We present a method for inferring diverse 3D models of human-object inte...
research
05/27/2023

Self-Supervised Learning of Action Affordances as Interaction Modes

When humans perform a task with an articulated object, they interact wit...
research
06/03/2019

Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

Learning how to interact with objects is an important step towards embod...
research
07/30/2020

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

We present a method that infers spatial arrangements and shapes of human...
research
06/01/2023

Affinity-based Attention in Self-supervised Transformers Predicts Dynamics of Object Grouping in Humans

The spreading of attention has been proposed as a mechanism for how huma...
research
11/29/2017

Structured learning and detailed interpretation of minimal object images

We model the process of human full interpretation of object images, name...

Please sign up or login with your details

Forgot password? Click here to reset