We propose a novel multimodal video benchmark - the Perception Test - to...
Large-scale visual language models are widely used as pre-trained models...
In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
Attention-based models are appealing for multimodal processing because i...
We aim to learn to temporally localize object state changes and the
corr...
Video question answering (VideoQA) is a complex task that requires diver...
Recent methods for visual question answering rely on large-scale annotat...
Building models that can be rapidly adapted to numerous tasks using only...
We consider the problem of localizing a spatio-temporal tube in a video
...
Human actions often induce changes of object states such as "cutting an
...
Our objective is language-based search of large-scale image and video
da...
Modern approaches to visual question answering require large annotated
d...
This paper introduces a manually annotated video dataset of unusual acti...
Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
Learning text-video embeddings usually requires a dataset of video clips...
Joint understanding of video and language is an active research area wit...
Discriminative clustering has been successfully applied to a number of
w...
Common video representations often deploy an average or maximum pooling ...