E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

08/06/2023
by   Jiazhou Zhou, et al.
0

Contrasting Language-image pertaining (CLIP) has recently shown promising open-world and few-shot performance on 2D image-based recognition tasks. However, the transferred capability of CLIP to the novel event camera data still remains under-explored. In particular, due to the modality gap with the image-text data and the lack of large-scale datasets, achieving this goal is non-trivial and thus requires significant research innovation. In this paper, we propose E-CLIP, a novel and effective framework that unleashes the potential of CLIP for event-based recognition to compensate for the lack of large-scale event-based datasets. Our work addresses two crucial challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. To this end, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile generates event prompts to promote the modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance the E-CLIP's generalization ability across diverse datasets. With the proposed event encoder, text encoder, and original image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We conduct extensive experiments on two recognition benchmarks, and the results demonstrate that our E-CLIP outperforms existing methods by a large margin of +3.94 and few-shot settings. Moreover, our E-CLIP can be flexibly extended to the event retrieval task using both text or image queries, showing plausible performance.

READ FULL TEXT

page 1

page 4

page 6

page 10

research
06/10/2023

EventCLIP: Adapting CLIP for Event-based Object Recognition

Recent advances in 2D zero-shot and few-shot recognition often leverage ...
research
03/06/2023

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate stro...
research
09/17/2023

Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera

The ability to detect objects in all lighting (i.e., normal-, over-, and...
research
05/06/2023

Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Cross-modal retrieval, where the query is an image and the doc is an ite...
research
07/04/2023

LPN: Language-guided Prototypical Network for few-shot classification

Few-shot classification aims to adapt to new tasks with limited labeled ...
research
05/09/2023

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

Text image machine translation (TIMT) aims to translate texts embedded i...
research
10/02/2021

Explainable Event Recognition

The literature shows outstanding capabilities for CNNs in event recognit...

Please sign up or login with your details

Forgot password? Click here to reset