Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

09/10/2023
by   Bo Wan, et al.
0

In this paper, we investigate the task of zero-shot human-object interaction (HOI) detection, a novel paradigm for identifying HOIs without the need for task-specific annotations. To address this challenging task, we employ CLIP, a large-scale pre-trained vision-language model (VLM), for knowledge distillation on multiple levels. Specifically, we design a multi-branch neural network that leverages CLIP for learning HOI representations at various levels, including global images, local union regions encompassing human-object pairs, and individual instances of humans or objects. To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions that serve as supervision signals. The extensive experiments demonstrate the effectiveness of our novel multi-level CLIP knowledge integration strategy. Notably, the model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods on the public HICO-DET benchmark.

READ FULL TEXT

page 1

page 2

page 4

page 8

research
03/20/2022

Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

Open-vocabulary object detection aims to detect novel object categories ...
research
04/28/2021

Zero-Shot Detection via Vision and Language Knowledge Distillation

Zero-shot image classification has made promising progress by training t...
research
05/24/2023

Large Language Model Distillation Doesn't Need a Teacher

Knowledge distillation trains a smaller student model to match the outpu...
research
08/26/2022

Disentangle and Remerge: Interventional Knowledge Distillation for Few-Shot Object Detection from A Conditional Causal Perspective

Few-shot learning models learn representations with limited human annota...
research
03/28/2023

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

Human-Object Interaction (HOI) detection aims to localize human-object p...
research
03/21/2023

Efficient Feature Distillation for Zero-shot Detection

The large-scale vision-language models (e.g., CLIP) are leveraged by dif...
research
03/02/2023

Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning

Human object interaction (HOI) detection plays a crucial role in human-c...

Please sign up or login with your details

Forgot password? Click here to reset