Does Progress On Object Recognition Benchmarks Improve Real-World Generalization?

07/24/2023
by   Megan Richards, et al.
0

For more than a decade, researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R. Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice. This suggests standard benchmarks, which tend to focus on predefined or synthetic changes, may not be sufficient for measuring real world generalization. Consequently, we propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe. We conduct an extensive empirical evaluation of progress across nearly 100 vision models up to most recent foundation models. We first identify a progress gap between standard benchmarks and real-world, geographical shifts: progress on ImageNet results in up to 2.5x more progress on standard generalization benchmarks than real-world distribution shifts. Second, we study model generalization across geographies by measuring the disparities in performance across regions, a more fine-grained measure of real world generalization. We observe all models have large geographic disparities, even foundation CLIP models, with differences of 7-20 in accuracy between regions. Counter to modern intuition, we discover progress on standard benchmarks fails to improve geographic disparities and often exacerbates them: geographic disparities between the least performant models and today's best models have more than tripled. Our results suggest scaling alone is insufficient for consistent robustness to real-world distribution shifts. Finally, we highlight in early experiments how simple last layer retraining on more representative, curated data can complement scaling as a promising direction of future work, reducing geographic disparity on both benchmarks by over two-thirds.

READ FULL TEXT
research
01/11/2023

Does progress on ImageNet transfer to real-world datasets?

Does progress on ImageNet transfer to real-world datasets? We investigat...
research
07/26/2021

Using Synthetic Corruptions to Measure Robustness to Natural Distribution Shifts

Synthetic corruptions gathered into a benchmark are frequently used to m...
research
05/12/2023

Measuring Progress in Fine-grained Vision-and-Language Understanding

While pretraining on large-scale image-text data from the Web has facili...
research
07/03/2022

Identifying the Context Shift between Test Benchmarks and Production Data

Across a wide variety of domains, there exists a performance gap between...
research
11/29/2021

ROBIN : A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts

Enhancing the robustness in real-world scenarios has been proven very ch...
research
03/08/2021

Contemplating real-world object classification

Deep object recognition models have been very successful over benchmark ...
research
06/04/2018

Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

The deep learning community has proposed optimizations spanning hardware...

Please sign up or login with your details

Forgot password? Click here to reset