Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

02/01/2022
by   Hangjie Yuan, et al.
0

Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined composes Object-guided Cross-modal Calibration Network (OCN). Experiments conducted on two popular HOI detection benchmarks demonstrate the significance of incorporating the statistical prior knowledge and produce state-of-the-art performances. More detailed analysis indicates proposed modules serve as a stronger verb predictor and a more superior method of utilizing prior knowledge. The codes are available at <https://github.com/JacobYuan7/OCN-HOI-Benchmark>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/21/2022

AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

Point clouds and RGB images are two general perceptional sources in auto...
research
08/11/2022

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

Cross-modal fashion image synthesis has emerged as one of the most promi...
research
08/07/2020

Polysemy Deciphering Network for Robust Human-Object Interaction Detection

Human-Object Interaction (HOI) detection is important to human-centric s...
research
04/17/2020

Detailed 2D-3D Joint Representation for Human-Object Interaction

Human-Object Interaction (HOI) detection lies at the core of action unde...
research
05/23/2023

Text-guided 3D Human Generation from 2D Collections

3D human modeling has been widely used for engaging interaction in gamin...
research
05/19/2023

Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Pre-trained vision-language models have inspired much research on few-sh...
research
03/01/2023

Cross-Modal Entity Matching for Visually Rich Documents

Visually rich documents (VRD) are physical/digital documents that utiliz...

Please sign up or login with your details

Forgot password? Click here to reset