The Robustness Limits of SoTA Vision Models to Natural Variation

by   Mark Ibrahim, et al.

Recent state-of-the-art vision models introduced new architectures, learning paradigms, and larger pretraining data, leading to impressive performance on tasks such as classification. While previous generations of vision models were shown to lack robustness to factors such as pose, it's unclear the extent to which this next generation of models are more robust. To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position, background, lighting, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they're present during training. We consider a catalog of recent vision models, including vision transformers (ViT), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find out-of-the-box, even today's best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of diversity to generalize – though eventually robustness did improve. When diversity is only seen for some classes however, we found models did not generalize to other classes, unless the classes were very similar to those seen varying during training. We hope our work will shed further light on the blind spots of SoTA models and spur the development of more robust vision models.


page 18

page 19

page 20

page 21

page 22

page 23

page 24

page 25


ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations

Deep learning vision systems are widely deployed across applications whe...

Are Vision Transformers Robust to Spurious Correlations?

Deep neural networks may be susceptible to learning spurious correlation...

Towards Robust Prompts on Vision-Language Models

With the advent of vision-language models (VLMs) that can perform in-con...

Robust Self-Supervised Learning with Lie Groups

Deep learning has led to remarkable advances in computer vision. Even so...

Out of Distribution Performance of State of Art Vision Model

The vision transformer (ViT) has advanced to the cutting edge in the vis...

Investigating Failures to Generalize for Coreference Resolution Models

Coreference resolution models are often evaluated on multiple datasets. ...

Please sign up or login with your details

Forgot password? Click here to reset