The recent breakthroughs in natural language processing for model pretra...
We introduce Alexa Arena, a user-centric simulation platform for Embodie...
We present Masked Audio-Video Learners (MAViL) to train audio-visual
rep...
We propose a multimodal (vision-and-language) benchmark for cooperative ...
The task of conducting visually grounded dialog involves learning
goal-o...
The task of visually grounded dialog involves learning goal-oriented
coo...