RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

09/05/2022
by   Hangjie Yuan, et al.
3

The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications. Prior work has demonstrated the benefits of effective architecture design and integration of relevant cues for more accurate HOI detection. However, the design of an appropriate pre-training strategy for this task remains underexplored by existing approaches. To address this gap, we propose Relational Language-Image Pre-training (RLIP), a strategy for contrastive pre-training that leverages both entity and relation descriptions. To make effective use of such pre-training, we make three technical contributions: (1) a new Parallel entity detection and Sequential relation inference (ParSe) architecture that enables the use of both entity and relation descriptions during holistically optimized pre-training; (2) a synthetic data generation framework, Label Sequence Extension, that expands the scale of language data available within each minibatch; (3) mechanisms to account for ambiguity, Relation Quality Labels and Relation Pseudo-Labels, to mitigate the influence of ambiguous/noisy samples in the pre-training data. Through extensive experiments, we demonstrate the benefits of these contributions, collectively termed RLIP-ParSe, for improved zero-shot, few-shot and fine-tuning HOI detection performance as well as increased robustness to learning from noisy annotations. Code will be available at <https://github.com/JacobYuan7/RLIP>.

READ FULL TEXT

page 9

page 23

research
08/18/2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Relational Language-Image Pre-training (RLIP) aims to align vision repre...
research
03/28/2023

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Human-Object Interaction (HOI) detection aims to localize human-object p...
research
08/27/2023

Practical Edge Detection via Robust Collaborative Learning

Edge detection, as a core component in a wide range of visionoriented ta...
research
04/25/2020

Muscle Synergy and Coupling for Hand

The knowledge of the intuitive link between muscle activity and the fing...
research
06/18/2019

Zero-Shot Entity Linking by Reading Entity Descriptions

We present the zero-shot entity linking task, where mentions must be lin...
research
08/02/2023

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Current captioning approaches tend to generate correct but "generic" des...
research
05/08/2023

Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding

Recent works on form understanding mostly employ multimodal transformers...

Please sign up or login with your details

Forgot password? Click here to reset