Distribution Normalization: An "Effortless" Test-Time Augmentation for Contrastively Learned Visual-language Models

02/22/2023
by   Yifei Zhou, et al.
4

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product.

READ FULL TEXT
research
05/20/2022

Test-time Batch Normalization

Deep neural networks often suffer the data distribution shift between tr...
research
02/07/2021

CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models

Fine-tuning pre-trained language models (PLMs) has demonstrated its effe...
research
10/17/2022

Learning Less Generalizable Patterns with an Asymmetrically Trained Double Classifier for Better Test-Time Adaptation

Deep neural networks often fail to generalize outside of their training ...
research
11/25/2022

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels

Pre-trained vision-language models like CLIP have recently shown superio...
research
03/25/2023

Train/Test-Time Adaptation with Retrieval

We introduce Train/Test-Time Adaptation with Retrieval (T^3AR), a method...
research
06/12/2020

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...
research
02/15/2023

InfoNCE Loss Provably Learns Cluster-Preserving Representations

The goal of contrasting learning is to learn a representation that prese...

Please sign up or login with your details

Forgot password? Click here to reset