Recent advances on text-to-image generation have witnessed the rise of
d...
In this paper, we propose a novel deep architecture tailored for 3D poin...
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone ...
Prior works have proposed several strategies to reduce the computational...
Comprehending the rich semantics in an image and ordering them in lingui...
This paper presents an overview and comparative analysis of our systems
...
Vision-language pre-training has been an emerging and fast-developing
re...
BERT-type structure has led to the revolution of vision-language pre-tra...
With the rise and development of deep learning over the past decade, the...
Transformer with self-attention has led to the revolutionizing of natura...
Despite having impressive vision-language (VL) pretraining with BERT-bas...
In this work, we present Auto-captions on GIF, which is a new large-scal...
Unsupervised domain adaptation has received significant attention in rec...
Recent progress on fine-grained visual recognition and visual question
a...
This notebook paper presents an overview and comparative analysis of our...
It is always well believed that parsing an image into constituent visual...
The problem of distance metric learning is mostly considered from the
pe...
This notebook paper presents an overview and comparative analysis of our...
It is well believed that video captioning is a fundamental but challengi...
Image captioning has received significant attention with remarkable
impr...
In this paper, we introduce a new idea for unsupervised domain adaptatio...
It is always well believed that modeling relationships between objects w...
Automatically describing a video with natural language is regarded as a
...
Image captioning often requires a large set of training image-sentence p...
Automatically describing an image with a natural language has been an
em...