While recent advancements in vision-language models have revolutionized
...
Video Question Answering (VideoQA) has been significantly advanced from ...
With the exponential growth of video data, there is an urgent need for
a...
Driven by large-data pre-training, Segment Anything Model (SAM) has been...
The foundation models have recently shown excellent performance on a var...
In this report, we present our champion solutions to five tracks at Ego4...
Capitalizing on large pre-trained models for various downstream tasks of...
Self-attention based models such as vision transformers (ViTs) have emer...
This technical report introduces our winning solution to the spatio-temp...
Localizing persons and recognizing their actions from videos is a challe...
This paper proposes the novel task of video generation conditioned on a
...
Imagining multiple consecutive frames given one single snapshot is
chall...
The goal of Online Action Detection (OAD) is to detect action in a timel...
We introduce SalGAN, a deep convolutional neural network for visual sali...
The prediction of salient areas in images has been traditionally address...
The prediction of saliency areas in images has been traditionally addres...