This paper introduces InternVid, a large-scale video-centric multimodal
...
In this study, we initiate an exploration into video understanding by
in...
Video Foundation Models (VFMs) have received limited exploration due to ...
The foundation models have recently shown excellent performance on a var...
Learning discriminative spatiotemporal representation is the key problem...
In this report, we present our champion solutions to five tracks at Ego4...
Human activity understanding is of widespread interest in artificial
int...
Convolutional video models have an order of magnitude larger computation...
Human attention mechanisms often work in a top-down manner, yet it is no...
Human-Object Interaction (HOI) consists of human, object and implicit
in...
Multi-object tracking is a fundamental vision problem that has been stud...