Combined Scaling for Zero-shot Transfer Learning

11/19/2021
by   Hieu Pham, et al.
0

We present a combined scaling method called BASIC that achieves 85.7 zero-shot accuracy on the ImageNet ILSVRC-2012 validation set, surpassing the best-published zero-shot models - CLIP and ALIGN - by 9.3 also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-A,R,V2,Sketch and ObjectNet, our model achieves 83.7 drop from the its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. The main challenge with scaling is the limited memory of our accelerators such as GPUs and TPUs. We hence propose a simple method of online gradient caching to overcome this limit.

READ FULL TEXT
research
06/27/2023

CLIPA-v2: Scaling CLIP Training with 81.1 within a $10,000 Budget; An Extra $4,000 Unlocks 81.8

The recent work CLIPA presents an inverse scaling law for CLIP training ...
research
03/27/2023

Sigmoid Loss for Language Image Pre-Training

We propose a simple pairwise sigmoid loss for image-text pre-training. U...
research
01/18/2022

ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization

We propose a multitask pretraining approach ZeroPrompt for zero-shot gen...
research
03/31/2023

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and ...
research
12/19/2022

The case for 4-bit precision: k-bit Inference Scaling Laws

Quantization methods reduce the number of bits required to represent eac...
research
03/27/2023

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Contrastive language-image pre-training, CLIP for short, has gained incr...
research
05/03/2022

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC h...

Please sign up or login with your details

Forgot password? Click here to reset