Causal Video Question Answering (CVidQA) queries not only association or...
Generalized few-shot object detection aims to achieve precise detection ...
We present a new paradigm for fine-tuning large-scale visionlanguage
pre...
Understanding event relationships in videos requires a model to understa...
Knowledge distillation (KD) is essentially a process of transferring a
t...
Given a long untrimmed video and natural language queries, video groundi...
Given an image and a reference caption, the image caption editing task a...
Semi-Supervised Learning (SSL) is fundamentally a missing label problem,...
Understanding how events described or shown in multimedia content relate...
Thanks to the large pre-trained vision-language models (VLMs) like CLIP,...
We address the overlooked unbiasedness in existing long-tailed classific...
Today's VidSGG models are all proposal-based methods, i.e., they first
g...
Question answering (QA) models are well-known to exploit data bias, e.g....
Today's VQA models still tend to capture superficial linguistic correlat...
Deep neural network based question answering (QA) models are neither rob...
Today's scene graph generation (SGG) task is still far from practical, m...
This paper is a winner report from team MReaL-BDAI for Visual Dialog
Cha...
Video action recognition, which is topical in computer vision and video
...
We focus on grounding (i.e., localizing or linking) referring expression...
Visual dialog is a challenging vision-language task, which requires the ...
We focus on grounding (i.e., localizing or linking) referring expression...
Large-scale image annotation is a challenging task in image content anal...