Audio-visual representation learning aims to develop systems with human-...
Modern hierarchical vision transformers have added several vision-specif...
The recent breakthroughs in natural language processing for model pretra...
There has been a longstanding belief that generation can facilitate a tr...
Large vision-language models are generally applicable to many downstream...
We present Masked Audio-Video Learners (MAViL) to train audio-visual
rep...
This paper studies a simple extension of image-based Masked Autoencoders...
After its sweeping success in vision and language tasks, pure attention-...
As audio-visual systems are being deployed for safety-critical tasks suc...
We present VideoCLIP, a contrastive approach to pre-train a unified mode...
We present a simplified, task-agnostic multi-modal pre-training approach...
This paper studies zero-shot cross-lingual transfer of vision-language
m...
As audio/visual classification models are widely deployed for sensitive ...
The dominant paradigm for learning video-text representations – noise
co...
Active learning (AL) attempts to maximize the performance gain of the mo...
Deep learning has made major breakthroughs and progress in many fields. ...
Unsupervised machine translation (MT) has recently achieved impressive
r...
With the aim of promoting and understanding the multilingual version of ...
Node embeddings have become an ubiquitous technique for representing gra...
We report on CMU Informedia Lab's system used in Google's YouTube 8 Mill...