Online resources such as WikiHow compile a wide range of scripts for
per...
Causal Video Question Answering (CVidQA) queries not only association or...
Vision Transformers (ViTs) emerge to achieve impressive performance on m...
Understanding event relationships in videos requires a model to understa...
Video event extraction aims to detect salient events from a video and
id...
Given a long untrimmed video and natural language queries, video groundi...
Recent advances in pre-training vision-language models like CLIP have sh...
Understanding how events described or shown in multimedia content relate...
Multi-channel video-language retrieval require models to understand
info...
The goal of this work is to build flexible video-language models that ca...
In this paper we consider the problem of classifying fine-grained, multi...
Vision-language (V+L) pretraining models have achieved great success in
...
Recently, there has been an increasing interest in building question
ans...
Visual and textual modalities contribute complementary information about...
In this paper, we address the problem of referring expression comprehens...
We present Vx2Text, a framework for text generation from multimodal
inpu...
Two-stream networks have achieved great success in video recognition. A
...
Two-stream networks have achieved great success in video recognition. A
...
Recently, Weakly-supervised Temporal Action Localization (WTAL) has been...
As the basic building block of Convolutional Neural Networks (CNNs), the...
We propose an unsupervised hashing method which aims to produce binary c...
Motion has shown to be useful for video understanding, where motion is
t...