Recent developments in text-conditioned image generative models have
rev...
Conventional detectors suffer from performance degradation when dealing ...
While large-scale pre-trained text-to-image models can synthesize divers...
Weakly-supervised temporal action localization aims to localize and reco...
The answering quality of an aligned large language model (LLM) can be
dr...
Existing autoregressive models follow the two-stage generation paradigm ...
Existing vector quantization (VQ) based autoregressive models follow a
t...
Vision model have gained increasing attention due to their simplicity an...
In-Context Learning (ICL), which formulates target tasks as prompt compl...
Scene text editing (STE) aims to replace text with the desired one while...
Deep metric learning aims to learn an embedding space, where semanticall...
Scene text spotting is of great importance to the computer vision commun...
Chinese spelling check (CSC) is a fundamental NLP task that detects and
...
There still remains an extreme performance gap between Vision Transforme...
Text-to-image generation aims at generating realistic images which are
s...
Human Video Motion Transfer (HVMT) aims to, given an image of a source
p...
Recognizing human actions from point cloud videos has attracted tremendo...
Knowledge Graphs (KGs) are becoming increasingly essential infrastructur...
Rumor detection has become an emerging and active research field in rece...
Action recognition from videos, i.e., classifying a video into one of th...
In this paper, we abandon the dominant complex language model and rethin...
Real-world recommender system needs to be regularly retrained to keep wi...
Existing scene text removal methods mainly train an elaborate network wi...
Cross-modal correlation provides an inherent supervision for video
unsup...
Recommender system usually faces popularity bias issues: from the data
p...
Linguistic knowledge is of great benefit to scene text recognition. Howe...
Person Re-identification (ReID) has achieved significant improvement due...
Most Visual Question Answering (VQA) models suffer from the language pri...
The reliability of current virtual reality (VR) delivery is low due to t...
Recent studies on Graph Convolutional Networks (GCNs) reveal that the in...
The depth images denoising are increasingly becoming the hot research to...
Transductive Zero-shot learning (ZSL) targets to recognize the unseen
ca...
Practical recommender systems need be periodically retrained to refresh ...
Scene text detection has witnessed rapid development in recent years.
Ho...
Image-text matching has received growing interest since it bridges visio...
Bilinear pooling achieves great success in fine-grained visual recogniti...
Recent methods focus on learning a unified semantic-aligned visual
repre...
Graph Neural Network (GNN) is a powerful model to learn representations ...
Graph Neural Network (GNN) is a powerful model to learn representations ...
Graph Convolution Network (GCN) has become new state-of-the-art for
coll...
The latest advance in recommendation shows that better user and item
rep...
Unpaired image-to-image translation problem aims to model the mapping fr...
Knowledge graph embedding, which aims to represent entities and relation...
Learning semantic correspondence between image and text is significant a...
Convolutional Neural Networks (CNN) have been regarded as a capable clas...
Zero-Shot Learning (ZSL) seeks to recognize a sample from either seen or...
Scene parsing is challenging as it aims to assign one of the semantic
ca...
Crowd counting has been widely studied by computer vision community in r...
With the maturity of visual detection techniques, we are more ambitious ...
Existing item-based collaborative filtering (ICF) methods leverage only ...