HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

03/28/2023
by   Shan Ning, et al.
0

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

READ FULL TEXT

page 3

page 8

page 13

research
09/05/2022

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

The task of Human-Object Interaction (HOI) detection targets fine-graine...
research
03/26/2022

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

The task of Human-Object Interaction (HOI) detection could be divided in...
research
08/21/2023

Turning a CLIP Model into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image P...
research
03/03/2023

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Visual recognition in low-data regimes requires deep neural networks to ...
research
09/20/2022

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Open-world object detection, as a more general and challenging goal, aim...
research
09/10/2023

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

In this paper, we investigate the task of zero-shot human-object interac...
research
11/20/2018

Transferable Interactiveness Prior for Human-Object Interaction Detection

Human-Object Interaction (HOI) Detection is an important problem to unde...

Please sign up or login with your details

Forgot password? Click here to reset