Generalist models, which are capable of performing diverse multi-modal t...
The tremendous success of CLIP (Radford et al., 2021) has promoted the
r...
In this work, we pursue a unified paradigm for multimodal pretraining to...
Mixture-of-Experts (MoE) models can achieve promising results with outra...
Conditional image synthesis aims to create an image according to some
mu...
Despite the achievements of large-scale multimodal pre-training approach...
In this work, we construct the largest dataset for multimodal pretrainin...
In this paper, we propose the TBCNN-pair model to recognize entailment a...