Referring image segmentation aims to segment an object mentioned in natu...
Due to the limited scale and quality of video-text training corpus, most...
Vision and text have been fully explored in contemporary video-text
foun...
Building general-purpose models that can perceive diverse real-world
mod...
Large-scale image-text contrastive pre-training models, such as CLIP, ha...
Large pre-trained multimodal models have demonstrated significant succes...
In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
As a combination of visual and audio signals, video is inherently
multi-...
The new generation of 4D high-resolution imaging radar provides not only...
In this paper, we consider the image captioning task from a new
sequence...
Depth information matters in RGB-D semantic segmentation task for provid...