Affordance Grounding from Demonstration Video to Target Image

03/26/2023
by   Joya Chen, et al.
0

Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pre-training technique for synthesizing video-image data and simulating context changes, enhancing affordance grounding across video-image discrepancies. Afformer with MaskAHand pre-training achieves state-of-the-art performance on multiple benchmarks, including a substantial 37 https://github.com/showlab/afformer.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 12

page 13

research
11/26/2020

Fine-Grained Re-Identification

Research into the task of re-identification (ReID) is picking up momentu...
research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...
research
01/25/2022

Explore and Match: End-to-End Video Grounding with Transformer

We present a new paradigm named explore-and-match for video grounding, w...
research
03/08/2022

Part-Aware Self-Supervised Pre-Training for Person Re-Identification

In person re-identification (ReID), very recent researches have validate...
research
08/12/2021

Learning Visual Affordance Grounding from Demonstration Videos

Visual affordance grounding aims to segment all possible interaction reg...
research
06/28/2021

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

With rapidly evolving internet technologies and emerging tools, sports r...

Please sign up or login with your details

Forgot password? Click here to reset