Relational Language-Image Pre-training (RLIP) aims to align vision
repre...
Foundation models are pre-trained on massive data and transferred to
dow...
Many recent studies leverage the pre-trained CLIP for text-video cross-m...
The task of Human-Object Interaction (HOI) detection targets fine-graine...
The existence of noisy data is prevalent in both the training and testin...
Temporal action localization aims to localize starting and ending time w...
Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to
rec...
This technical report presents our solution for temporal action detectio...
This paper presents our solution to the AVA-Kinetics Crossover Challenge...
This technical report analyzes an egocentric video action detection meth...
With the recent surge in the research of vision transformers, they have
...