We introduce OpenFlamingo, a family of autoregressive vision-language mo...
Massive web datasets play a key role in the success of large vision-lang...
Recent work in NLP has shown promising results in training models on lar...
Large multimodal datasets have been instrumental in recent breakthroughs...
Scaling up neural networks has led to remarkable performance across a wi...
Changing how pre-trained models behave – e.g., improving their performan...
Vision models often fail systematically on groups of data that share com...
We conduct a large empirical evaluation to investigate the landscape of
...
Open-vocabulary models like CLIP achieve high accuracy across many image...
Web-crawled datasets have enabled remarkable generalization capabilities...
Contrastively trained image-text models such as CLIP, ALIGN, and BASIC h...
Households across the world contain arbitrary objects: from mate gourds ...
The conventional recipe for maximizing model accuracy is to (1) train
mu...
Large pre-trained models such as CLIP offer consistent accuracy across a...
As language models are trained on ever more text, researchers are turnin...
In the past few years, we have witnessed remarkable breakthroughs in
sel...
Transformers have outperformed recurrent neural networks (RNNs) in natur...
Vision, as a central component of human perception, plays a fundamental ...
Standard test sets for supervised learning evaluate in-distribution
gene...
Fine-tuning pretrained contextual word embedding models to supervised
do...
Systems that can associate images with their spoken audio captions are a...