Large-scale image-text contrastive pre-training models, such as CLIP, ha...
Contrastive Masked Autoencoder (CMAE), as a new self-supervised framewor...
Masked image modeling (MIM) has achieved promising results on various vi...
It is imperative to democratize robotic process automation (RPA), as RPA...
We study joint learning of Convolutional Neural Network (CNN) and Transf...
We propose Pixel-BERT to align image pixels with text by deep multi-moda...
We propose to boost VQA by leveraging more powerful feature extractors b...