We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for...
We introduce OpenFlamingo, a family of autoregressive vision-language mo...
Surprising videos, e.g., funny clips, creative performances, or visual
i...
Performant vision-language (VL) models like CLIP represent captions usin...
Weird, unusual, and uncanny images pique the curiosity of observers beca...
The common practice for training commonsense models has gone
from-human-...
As humans, we understand events in the visual world contextually, perfor...
Image captioning has conventionally relied on reference-based automatic
...
Images can give us insights into the contextual meanings of words, but
c...
Modeling expressive cross-modal interactions seems crucial in multimodal...
Pretraining from unlabelled web videos has quickly become the de-facto m...
Instructional videos get high-traffic on video sharing platforms, and pr...
Images and text co-occur everywhere on the web, but explicit links betwe...
Controversial posts are those that split the preferences of a community,...
Multimodal machine learning algorithms aim to learn visual-textual
corre...
The content of today's social media is becoming more and more rich,
incr...
We examine the possibility that recent promising results in automatic ca...