Egocentric Scene Understanding via Multimodal Spatial Rectifier

07/14/2022
by   Tien Do, et al.
0

In this paper, we study a problem of egocentric scene understanding, i.e., predicting depths and surface normals from an egocentric image. Egocentric scene understanding poses unprecedented challenges: (1) due to large head movements, the images are taken from non-canonical viewpoints (i.e., tilted images) where existing models of geometry prediction do not apply; (2) dynamic foreground objects including hands constitute a large proportion of visual scenes. These challenges limit the performance of the existing models learned from large indoor datasets, such as ScanNet and NYUv2, which comprise predominantly upright images of static scenes. We present a multimodal spatial rectifier that stabilizes the egocentric images to a set of reference directions, which allows learning a coherent visual representation. Unlike unimodal spatial rectifier that often produces excessive perspective warp for egocentric images, the multimodal spatial rectifier learns from multiple directions that can minimize the impact of the perspective warp. To learn visual representations of the dynamic foreground objects, we present a new dataset called EDINA (Egocentric Depth on everyday INdoor Activities) that comprises more than 500K synchronized RGBD frames and gravity directions. Equipped with the multimodal spatial rectifier and the EDINA dataset, our proposed method on single-view depth and surface normal estimation significantly outperforms the baselines not only on our EDINA dataset, but also on other popular egocentric datasets, such as First Person Hand Action (FPHA) and EPIC-KITCHENS.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 9

page 10

page 11

page 12

research
03/13/2023

Contextually-rich human affect perception using multimodal scene information

The process of human affect understanding involves the ability to infer ...
research
08/23/2022

Multimodal Across Domains Gaze Target Detection

This paper addresses the gaze target detection problem in single images ...
research
07/17/2020

Surface Normal Estimation of Tilted Images via Spatial Rectifier

In this paper, we present a spatial rectifier to estimate surface normal...
research
12/20/2019

IRS: A Large Synthetic Indoor Robotics Stereo Dataset for Disparity and Surface Normal Estimation

Indoor robotics localization, navigation and interaction heavily rely on...
research
07/31/2023

DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Advances in neural fields are enabling high-fidelity capture of the shap...
research
05/08/2015

Learning image representations tied to ego-motion

Understanding how images of objects and scenes behave in response to spe...
research
11/10/2021

Structure from Silence: Learning Scene Structure from Ambient Sound

From whirling ceiling fans to ticking clocks, the sounds that we hear su...

Please sign up or login with your details

Forgot password? Click here to reset