Driven by improved architectures and better representation learning
fram...
We present Fast Language-Image Pre-training (FLIP), a simple and more
ef...
Masked Autoencoding (MAE) has emerged as an effective approach for
pre-t...
State-of-the-art vision and vision-and-language models rely on large-sca...
We propose UniT, a Unified Transformer model to simultaneously learn the...
We present Worldsheet, a method for novel view synthesis using just a si...
Image descriptions can help visually impaired people to quickly understa...
Many visual scenes contain text that carries crucial information, and it...
Vision-and-Language Navigation (VLN) requires grounding instructions, su...
Solving grounded language tasks often requires reasoning about relations...
Existing visual explanation generating agents learn to fluently justify ...
In complex inferential tasks like question answering, machine learning m...
Natural language explanations of deep neural network decisions provide a...
Navigation guided by natural language instructions presents a challengin...
Existing methods for object instance segmentation require all training
i...
Existing models which generate textual explanations enforce task relevan...
People often refer to entities in an image in terms of their relationshi...
Image segmentation from referring expressions is a joint vision and lang...
In this paper we approach the novel problem of segmenting an image based...
In this paper, we address the task of natural language object retrieval,...
Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
Large scale object detection with thousands of classes introduces the pr...
A major challenge in scaling object detection is the difficulty of obtai...