Vision and text have been fully explored in contemporary video-text
foun...
Building general-purpose models that can perceive diverse real-world
mod...
Multimodal representation learning has shown promising improvements on
v...
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for
cross...