Reconstructing and grounding narrated instructional videos in 3D

09/09/2021
by   Dimitri Zhukov, et al.
3

Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.

READ FULL TEXT

page 1

page 6

page 8

page 11

page 12

research
05/24/2021

Reconstructing Small 3D Objects in front of a Textured Background

We present a technique for a complete 3D reconstruction of small objects...
research
09/29/2022

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

3D visual grounding aims to find the objects within point clouds mention...
research
05/30/2019

Grounding Language Attributes to Objects using Bayesian Eigenobjects

We develop a system to disambiguate objects based on simple physical des...
research
03/25/2023

SUDS: Scalable Urban Dynamic Scenes

We extend neural radiance fields (NeRFs) to dynamic large-scale urban sc...
research
06/06/2023

Learning to Ground Instructional Articles in Videos through Narrations

In this paper we present an approach for localizing steps of procedural ...
research
12/15/2014

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Solving the visual symbol grounding problem has long been a goal of arti...
research
06/14/2023

Toward Grounded Social Reasoning

Consider a robot tasked with tidying a desk with a meticulously construc...

Please sign up or login with your details

Forgot password? Click here to reset