Audio-visual learning has been a major pillar of multi-modal machine
lea...
Text-to-Image diffusion models have made tremendous progress over the pa...
Diffusion models (DMs) have shown great potential for high-quality image...
We present GLIPv2, a grounded VL understanding model, that serves both
l...
Cross-modal encoders for vision-language (VL) tasks are often pretrained...
Contrastive language-image pretraining (CLIP) links vision and language
...
Most existing video-and-language (VidL) research focuses on a single dat...
Large-scale transformer-based pre-training has recently revolutionized
v...
Multimodal pre-training has propelled great advancement in
vision-and-la...
Transformer has become ubiquitous in the deep learning field. One of the...
We present VILLA, the first known effort on large-scale adversarial trai...
Recent Transformer-based large-scale pre-trained models have revolutioni...
We present HERO, a Hierarchical EncodeR for Omni-representation learning...
Large-scale pre-trained language model, such as BERT, has recently achie...
We present a large, tunable neural conversational response generation mo...
Joint image-text embedding is the bedrock for most Vision-and-Language (...
Multi-hop reading comprehension requires the model to explore and connec...
Inspired by how humans summarize long documents, we propose an accurate ...