Deep learning has achieved great success in video recognition, yet still...
We introduce a novel visual question answering (VQA) task in the context...
Food image-to-recipe aims to learn an embedded space linking the rich
se...
Online media data, in the forms of images and videos, are becoming mains...
Deep transfer learning has been widely used for knowledge transmission i...
Semi-supervised action recognition is a challenging but critical task du...
Given sufficient training data on the source domain, cross-domain few-sh...
Recently, Cross-Domain Few-Shot Learning (CD-FSL) which aims at addressi...
Current video generation models usually convert signals indicating appea...
Video question answering (VideoQA) is an essential task in vision-langua...
Despite that leveraging the transferability of adversarial examples can
...
Fusing LiDAR and camera information is essential for achieving accurate ...
Multi-modal pre-training and knowledge discovery are two important resea...
Cross-modal recipe retrieval has attracted research attention in recent
...
Video moment retrieval aims at finding the start and end timestamps of a...
Recent advances in image editing techniques have posed serious challenge...
Previous few-shot learning (FSL) works mostly are limited to natural ima...
3D dense captioning is a recently-proposed novel task, where point cloud...
Recently, one-stage visual grounders attract high attention due to the
c...
Most existing vision-language pre-training methods focus on understandin...
Recent studies have shown that adversarial examples hand-crafted on one
...
Recent research has demonstrated that Deep Neural Networks (DNNs) are
vu...
The task of cross-modal retrieval between texts and videos aims to under...
Although deep-learning based video recognition models have achieved
rema...
Referring Image Segmentation (RIS) aims at segmenting the target object ...
Given a text description, Temporal Language Grounding (TLG) aims to loca...
Vision transformers (ViTs) have demonstrated impressive performance on a...
Controllable person image generation aims to produce realistic human ima...
Label distributions in real-world are oftentimes long-tailed and imbalan...
The widespread dissemination of forged images generated by Deepfake
tech...
In recent years, the abuse of a face swap technique called deepfake Deep...
Automatic colorectal polyp detection in colonoscopy video is a fundament...
Understanding food recipe requires anticipating the implicit causal effe...
In this paper we address the problem of unsupervised gaze correction in ...
The dominant speech separation models are based on complex recurrent or
...
Gaze redirection aims at manipulating a given eye gaze to a desirable
di...
Deep neural networks (DNNs) are vulnerable to backdoor attacks which can...
We study the problem of attacking video recognition models in the black-...
Gaze correction aims to redirect the person's gaze into the camera by
ma...