Diffusion models have gained significant attention in the realm of image...
Recent knowledge enhanced pre-trained language models have shown remarka...
In this paper, we propose a large-scale language pre-training for text
G...
Multi-hop reasoning requires aggregating multiple documents to answer a
...
Multi-modal pre-training and knowledge discovery are two important resea...
In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
Matching model is essential for Image-Text Retrieval framework. Existing...
We study the problem of coarse-grained response selection in retrieval-b...
Existing research for image text retrieval mainly relies on sentence-lev...
Existing research for image captioning usually represents an image using...
Transformer is an attention-based neural network, which consists of two
...
In this paper, we focus on the problem of unsupervised image-sentence
ma...
Commonsense generation aims at generating plausible everyday scenario
de...