TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

09/21/2023
by   Kan Wu, et al.
0

In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50 performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 × compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1 surpassing the original CLIP ViT-B/16 by 3.5 parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.

READ FULL TEXT

page 3

page 4

page 7

research
07/24/2023

CLIP-KD: An Empirical Study of Distilling CLIP Models

CLIP has become a promising language-supervised visual pre-training fram...
research
10/14/2020

Weight Squeezing: Reparameterization for Compression and Fast Inference

In this work, we present a novel approach for simultaneous knowledge tra...
research
05/24/2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

Large-scale pretrained foundation models have been an emerging paradigm ...
research
04/14/2022

MiniViT: Compressing Vision Transformers with Weight Multiplexing

Vision Transformer (ViT) models have recently drawn much attention in co...
research
05/28/2023

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Large-scale pre-trained text-image models with dual-encoder architecture...
research
02/10/2023

Feature Affinity Assisted Knowledge Distillation and Quantization of Deep Neural Networks on Label-Free Data

In this paper, we propose a feature affinity (FA) assisted knowledge dis...

Please sign up or login with your details

Forgot password? Click here to reset