This paper introduces InternVid, a large-scale video-centric multimodal
...
We present an interactive visual framework named InternGPT, or iGPT for
...
We consider the problem of generating musical soundtracks in sync with
r...
The foundation models have recently shown excellent performance on a var...
In this report, we present our champion solutions to five tracks at Ego4...
Weakly-supervised audio-visual violence detection aims to distinguish
sn...
Although audio-visual representation has been proved to be applicable in...
Recognizing and localizing events in videos is a fundamental task for vi...
Audio-visual event localization aims to localize an event that is both
a...