A Simple Long-Tailed Recognition Baseline via Vision-Language Model

by   Teli Ma, et al.

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.



There are no comments yet.


page 1

page 2

page 3

page 4


Decoupling Representation and Classifier for Long-Tailed Recognition

The long-tail distribution of the visual world poses great challenges fo...

Improving Tail-Class Representation with Centroid Contrastive Learning

In vision domain, large-scale natural datasets typically exhibit long-ta...

Parametric Contrastive Learning

In this paper, we propose Parametric Contrastive Learning (PaCo) to tack...

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Deep learning-based models encounter challenges when processing long-tai...

Deep Long-Tailed Learning: A Survey

Deep long-tailed learning, one of the most challenging problems in visua...

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

Our work focuses on tackling the challenging but natural visual recognit...

Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations

Large-scale pretraining of visual representations has led to state-of-th...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During past years, visual recognition tasks, such as image classification [simonyan15, he2016deep, xie2017aggregated], object detection [ren2015faster, lin2017feature], semantic segmentation [long2015fully, chen2017deeplab, zhao2017pyramid], and instance segmentation [he2017mask, hu2018learning, liu2018path] have been significantly improved. The performance gains can be largely attributed to the availability of large-scale high-quality datasets [deng2009imagenet, lin2014microsoft, krishna2017visual]. However, the problem of data imbalance has inevitably emerged since real-world data often abide by a long-tailed distribution (e.g., Pareto distribution[pareto1964cours] or Zipf’s law [zipf1949human]). In other words, a few head classes dominate the majority of training examples, whereas many rare or fine-grained classes only have limited relevant data points.

To alleviate the issue, previous efforts either carefully create more balanced datasets (e.g., ImageNet [deng2009imagenet], MSCOCO [lin2014microsoft]

, and Kinetics-400 

[kay2017kinetics]) with human labors or develop more robust algorithms to handle data imbalance. However, since the former is notoriously laborious and expensive, many researchers have been devoted to the latter. Formally, long-tailed recognition (LTR) is a research field seeking robust models that 1) are resistant to significant imbalanced class distribution; 2) can deal with few-shot learning of tail classes. Many methods [zhang2021deep] have been proposed for solving LTR problems. According to the core technical contributions, they can be divided into two categories. Methods in the first line focus on class re-balancing strategies [menon2021longtail, hong2021disentangling, zhang2021distribution, kang2019few]

such as data re-sampling, loss re-weighting, and logit adjustment. The second category focuses on improving network modules 

[cui2021parametric, Kang2020Decoupling, zhang2021test, Samuel_2021_ICCV, tang2020long, cui2021reslt, zhou2020bbn]

by classifier designing, decoupled training, and representation learning. While these methods have achieved significant progress, the performance of LTR remains unsatisfactory. When delving deeper into the utilization of the existing imbalance datasets, we have observed that almost all previous efforts are confined to a predetermined manner which designs models entirely relying on the visual modality. That is to say, they totally ignore the semantic features of the raw label text, which may be a promising solution to exert additional supervision on inadequate data sources. Therefore, this paper explores whether language modality can be effective and complementary information for this task. In the meantime, we could broaden generalization abilities to few-shot categories and zero-shot novel instances.

Figure 2: Overview of our BALLAD framework. In Phase A, we keep pretraining the text and image branches of the vision-language backbone on long-tailed data. After Phase A, head classes typically achieve good enough classification performance, whereas tail classes are still far from perfect. During Phase B, a linear adapter is adopted to further train the vision-language backbone on balanced training samples. As a result, tail classes enjoy a performance boost while head classes slightly increase or maintain their original classification accuracy. represents training with parameter update while represents freezing parameters.

Recently, contrastive vision-language models such as CLIP [radford2021learning] and ALIGN [jia2021scaling] brought a breath of fresh air into the vision community. They learn to align vision and language representations with a contrastive loss given large-scale noisy image-text pairs collected from the web. The powerful visual-language representations obtained from pretraining significantly improve the zero-shot classification performance in open-vocabulary settings without any additional annotations. Motivated by the success of contrastive vision-language models and the curiosity of the language effect mentioned above, we directly test CLIP on LTR datasets under its zero-shot setting. Surprisingly, the results are balanced on many-shots (), medium-shots (), and low-shots () subsets of ImageNet-LT [liu2019large] and the overall performance () is comparable to the state-of-the-art [cui2021parametric]. From which we see the great potential of the multimodality solution for LTR. To further improve the performance while keep the capability of dealing with data imbalance, an intuitive way is to finetune the vision-language models on LTR datasets. However, we find it only brings a slight gain. Therefore, the core task of our work becomes how to design an effective recipe for training vision-language models under the circumstances of long-tailed distribution.

Specifically, in this paper, we design a simple framework based on contrastive vision-language models for LTR. The training procedure of the framework is broken into two phases from the perspective of distribution skewness: A) utilizing abundant annotations from LTR datasets; B) tackling few-shot learning of tail classes on balanced data built with re-sampling strategies. In Phase A, we continue pretraining CLIP backbone on a specific LTR dataset through contrastive learning. It enables our framework to fully exploit available training examples and update visual-language representations on a new domain. To further facilitate the few-shot learning of tail classes, during Phase B, we freeze the CLIP backbone and employ an auxiliary linear adapter for finetuning on re-balanced training samples. The adapter dynamically combines fixed Phase-A and finetuned Phase-B features via a residual connection to refine the visual representations of tail classes. Compared with finetuning the whole CLIP backbone directly, the linear adapter reduces the number of learnable parameters and thus prevents the potential overfitting of few-shot setups. According to Figure 

1, our framework clearly achieves better performances than state-of-the-art LTR approaches. The improvements are especially significant for few-shot and medium-shot classes, demonstrating our approach’s great capability of handling class imbalance. Since our framework solves the data imbalance via a linear adapter, we name it as BALLAD (BALanced Linear ADapter), which implies the harmony of head and tail classes. Our contributions are three folds:

  • [leftmargin=*,noitemsep]

  • We point out the shortcomings of training with fixed class labels and propose to leverage language modality via contrastive vision-language backbone to facilitate long-tailed recognition.

  • We develop the BALLAD framework consisting of two phases to handle head and tail classes successively. Specifically, we keep training the visual and language branches of the pretrained vision-language model simultaneously at the first stage. Then we adopt a linear adapter to tackle tail classes with vision-language parameters frozen.

  • We conduct extensive experiments to demonstrate the effectiveness of BALLAD. Our simple baseline achieves the new state-of-the-art performances on all benchmarks, outperforming the old paradigm by points maximally on ImageNet-LT.

2 Related Work

Contrastive Vision-Language Model.  Contrastive representation learning has been widely adopted to fulfill self-supervised pretraining in various AI domains[chen2020simple, he2020momentum, caron2020unsupervised, caron2021emerging, oord2018representation, gao2021simcse]. Recently, the intersection of vision and language [antol2015vqa, nguyen2018improved, yu2019deep, gao2019dynamic, kim2018bilinear, chen2019uniter, shi2020contrastive] also experienced a revolution sparked by contrastive representation learning. Contrastive vision-language models like CLIP [radford2021learning] and ALIGN [jia2021scaling] demonstrate promising zero-shot performances on various visual search and recognition tasks. Learning directly from natural language supervisions that contain rich visual concepts, they are very flexible and robust to distribution variations across different domains. The success of CLIP and ALIGN has enlightened many downstream vision-language tasks. For instance, DeCLIP [li2021supervision] proposes to utilize self-, multi-view, and nearest-neighbor supervisions among the image-text pairs for data efficient pretraining of CLIP. On visual classification tasks, CLIP-Adapter [gao2021clip] argues that fine-tuning contrastive vision-language models with linear adapters is a better alternative to prompt tuning. For video related tasks, VideoCLIP [xu2021videoclip] performs contrastive pretraining with video-text pairs for zero-shot video-text understanding. ActionCLIP [wang2021actionclip] presents a new “pretrain, prompt and fine-tune” paradigm leveraging pretrained vision-language models for zero-shot/few-shot action recognition. CLIP-It [narasimhan2021clip] designs a language-guided multimodal transformer based on CLIP to address query-focused video summarization. Moreover, CLIPort [shridhar2021cliport] combines CLIP with Transporter [zeng2020transporter] to endow a robot with the ability of semantic understanding and spatial perception. In this paper, we demonstrate that contrastive vision-language models can also facilitate visual recognition under long-tailed class distribution setups if properly trained.

Long-Tailed Recognition.  Long-tailed recognition [zhang2021deep] is a practical and challenging problem in vision domain. General visual models will suffer from severe performance degradation under such imbalanced class distributions. A great number of approaches [huang2016learning, ouyang2016factors, zhang2017range, dong2017class, yin2019feature, menon2021longtail, cui2021parametric, wang2021longtailed, Samuel_2021_ICCV, cui2021reslt] have been proposed to address LTR from different perspectives. An intuitive solution is to directly re-balance the number of training samples across all classes [Kang2020Decoupling, zhou2020bbn]. However, naively adjusting the skewness of training samples may lead to the overfitting of tail classes. Better alternatives include loss re-weighting [lin2017focal, kang2019few, hong2021disentangling] and logit adjustment [menon2021longtail, zhang2021distribution] based on label frequencies. Though efficacious for long-tailed distribution, above methods all sacrifice the performance of head classes at varying levels. To address the limitations, researchers turn to explore new network architectures and training paradigms. Typically, long-tail recognition models contain two key components – feature extractor and classifier. For each component, there are corresponding approaches by either designing better classifier [liu2020deep, tang2020long, wu2020solving] or learning reliable representations [liu2019large, zhu2020inflated]. In terms of new training frameworks, existing efforts seek to divide a one-stage training paradigm into two stages. For example, decoupled training approaches [Kang2020Decoupling, kang2021exploring] conduct representation learning and classifier training in a separate manner. Furthermore, ensemble learning schemes [zhou2020bbn, zhang2021test] first learn multiple experts with different data sub-groups and then merge their complementary knowledge to handle LTR. In contrast, our BALLAD first utilizes abundant long-tailed data to refine visual-language representations on a new target domain. Then we apply a lightweight linear adapter to encourage fine-grained representation learning from balanced samples. The two phases successively handle the learning of head and tail classes and ensure a better balanced performance across all classes.

3 Our Method

In this section, we first briefly revisit how contrastive vision-language models leverage contrastive objectives to achieve efficient and scalable multimodal representation learning. Moreover, we formally present BALLAD framework and discuss the advantages of the proposed two-stage representation learning for long-tailed class distributions.

3.1 Contrastive Vision-Language Model

Contrastive vision-language models such as CLIP [radford2021learning] and ALIGN [jia2021scaling] typically follow a dual-encoder architecture with a language encoder and a visual encoder . Given an input image , is adopted to extract the visual feature for : . Likewise, is applied to encode an input text sequence into its corresponding text feature: . After extracting the feature for each modality, two transformation matrices and are employed to project the original visual and text features into a shared embedding space:


where and are both

-dimension normalized vectors in the joint multimodal space. During pretraining, contrastive vision-language models learn to align image-text pairs inside a batch. The overall training objective consists of matching losses from two different directions,

i.e., for text retrieval and

for image retrieval. They both maximize the scores of matched pairs while minimize that of unmatched ones:



denotes the temperature hyperparameter and

represents the number of image-text pairs in the batch.

Trained with large-scale image-text pairs under the open-vocabulary settings, contrastive vision-language models achieve powerful multimodal representations and naturally possess the capability of zero-shot visual recognition. A collection of text descriptions following templates like ”a photo of a {CLASS}” is created for candidate classes in target datasets to perform zero-shot prediction. If we represent the normalized test image feature as and all normalized text description features as

, we can thus compute the class probability of the test image as below:


where represents the probability for class , and stands for the total number of candidate classes. Finally, the text label with the highest probability is selected as the prediction.

3.2 Balanced Linear Adapter

As stated in Section 1, contrastive vision-language models obtain balanced performance for head and tail classes, whereas traditional approaches like PaCo [cui2021parametric] suffer from lower performance of tail classes owing to the deficiency of training samples. Inspired by the zero-shot ability of contrastive vision-language models, we choose CLIP as our backbone for long-tailed recognition. The observation in Section 4.3.2 also encourages us to decouple the training of long-tailed data into two phases. To be specific, the first phase (Phase A) fully utilizes available training data and ensures the performance for classes with abundant examples, then the second phase (Phase B) focuses on improving the few-shot learning of tail classes. Note that LWS [Kang2020Decoupling] also adopts a decoupled training framework. However, LWS decouples the training of representation and classifier into two stages. In contrast, our two phases are for long-tailed and balanced training samples respectively and both phases conduct representation refinement with contrastive loss.

Phase A.  Recently, Gururangan et al[gururangan2020don] shows that keeping domain-adaptive and task-adaptive model pretraining can largely improve the performances on target NLP tasks. Similarly, for our Phase A, we find that continuing the pretraining of contrastive vision-language backbone on long-tailed target dataset also benefits the learning of classes with abundant examples. In this way, Phase A can make full use of available training data regardless of its skewness. Since we focus on classifying input images into text labels, the pretraining of Phase A directly follows the loss defined in Equation (2). As shown in Figure 2, Phase A updates the representations of both text and image encoders on a new domain. After Phase A, head classes typically achieve good performance while tail classes still require another stage of balanced training.

Phase B.  Tail classes are short of training examples and under the few-shot settings. Directly training the whole vision-language backbone may easily overfit to them and lead to performance degradation. Inspired by parameter-efficient adapter modules [houlsby2019parameter], we freeze the vision-language backbone obtained from Phase A and utilize an additional linear adapter layer to help our model refine its visual-language representation on those infrequent classes. As shown in Figure 2, the text features would remain the same as Phase A. The only difference lies in the image features. If we assume the original image feature to be , the weight matrix and bias of the linear adapter as and , then we can represent the refined image feature as


where indicates the residual factor to dynamically combine Phase-B fine-tuned image features with the original image features of Phase A.

To avoid the Phase-B training from biasing towards head classes, we also adopt class-balanced sampling strategy [Kang2020Decoupling] to construct a balanced group of training samples. Suppose there are classes that constitute a total of training samples in the target dataset. We can represent the number of training samples for class as and thus have . If we assume these classes are already sorted in a decreasing order, then a long-tailed distribution implies if and . For class-balanced sampling, we define the probability of sampling each data point from class to be . In other words, to construct a balanced group of training samples, we will first uniformly choose a class out of the candidates and then sample one data point from the selected class. Finally, we perform Phase B finetuning with on the balanced training data.

4 Experiments

4.1 Experiment Setup

Datasets.  We conduct our experiments on three long-tailed benchmark datasets, namely ImageNet-LT [liu2019large], Places-LT [liu2019large], and CIFAR100-LT [cao2019learning]. ImageNet-LT and Places-LT were first introduced in [liu2019large] for long-tailed recognition research. ImageNet-LT is a long-tailed dataset with 1,000 categories sampled from the original ImageNet [deng2009imagenet] following the Pareto distribution with a power value of . There are 115.8K images in the training split, with maximally 1,280 images per class and minimally 5 images per class. The testing split maintains the same as the original ImageNet [deng2009imagenet], where samples per class are balanced. Places-LT is a long-tailed version of the original Places2 Database [zhou2017places]. The training split of Places-LT contains with 184.5K images from 365 categories, with 4,980 images maximally per class and minimally 5 images per class. For the testing split, the images of each class is also balanced with 100 images per class. CIFAR100-LT [cao2019learning] are created by long-tailed imbalance technique [cui2019class] which reduces training examples per class based on an exponential decay function. In this paper, we directly use the version from [wang2021longtailed] with an imbalance ratio of . The training split contains 50K images from 100 categories, while the testing split has a uniform 100 images for each class.

Implementation Details.  We use CLIP as the contrastive vision-language backbone in all experiments. For the visual branch of CLIP, we vary among ResNet-50, ResNet-100, ViT-B/16, and ResNet-5016, which is 16 computation cost of ResNet-50 following the style of EfficientNet as introduced in [radford2021learning]. The ResNet-50 is leveraged for all ablation studies by default unless specified. We use SGD as the optimizer for all experiments with a momentum of . The batch size is set to . We adopt cosine learning rate schedule to decay learning rates. The initial learning rate of CLIP finetuning is set to for both the visual and language encoders, while the learning rate of linear adapter is set to at the start. For data pre-processing, images are resized to unless the ResNet-5016 visual backbone, which utilizes an image size of . Crop and random horizontal flip are also adopted to augment the original images for robustness considerations. For BALLAD, we train Phase A for 50 epochs and and Phase B for 10 epochs by default unless specified. For the hyperparameters, we set the residual factor to and the temperature to . The feature dimensions of ResNet-50, ResNet-101, ResNet-5016 and ViT-B/16 are respectively.

Evaluation Metrics We evaluate the models for long-tailed recognition on the balanced test splits and report the commonly used top-1 classification accuracy of all classes. Following [Kang2020Decoupling], we divide these classes into three subsets – many-shot, medium-shot, and few-shot categories. Specifically, many-shot, medium-shot, and few-shot are decided according to the amount of instances in each category, namely more than 100 images, 20-100 images, and less than 20 images, respectively.

Method Backbone #Epochs Many Medium Few All
-normalized [Kang2020Decoupling] ResNet-50 90 56.6 44.2 27.4 46.7
ResNet-101 90 59.4 47.0 30.6 49.6
ResNet-152 90 59.6 47.5 32.2 50.1
ResNeXt-50 90 59.1 46.9 30.7 49.4
ResNeXt-101 90 59.1 47.0 31.7 49.6
ResNeXt-152 90 62.2 50.1 35.8 52.8
LWS [Kang2020Decoupling] ResNet-50 90 57.1 45.2 29.3 47.7
ResNet-101 90 60.1 47.6 31.2 50.2
ResNet-152 90 60.6 47.8 31.4 50.5
ResNeXt-50 90 60.2 47.2 30.3 49.9
ResNeXt-101 90 60.5 47.2 31.2 50.1
ResNeXt-152 90 63.5 50.4 34.2 53.3
ResLT [cui2021reslt] ResNeXt-50 180 63.0 50.5 35.5 52.9
ResNeXt-101 180 63.3 53.3 40.3 55.1
Balanced Softmax [ren2020balanced] ResNet-50 400 66.7 52.9 33.0 55.0
ResNeXt-50 400 67.7 53.8 34.2 56.2
ResNeXt-101 400 69.2 55.8 36.3 58.0
RIDE [wang2020long] ResNet-50 100 66.2 52.3 36.5 55.4
ResNeXt-50 100 68.2 53.8 36.0 56.8
PaCo [cui2021parametric] ResNet-50 400 65.0 55.7 38.2 57.0
ResNeXt-50 400 67.5 56.9 36.7 58.2
ResNeXt-101 400 68.2 58.7 41.0 60.0
BALLAD ResNet-50 50+10 71.0 66.3 59.5 67.2 (+7.2)
ResNet-101 50+10 74.7 69.1 63.3 70.5 (+10.5)
ViT-B/16 50+10 79.1 74.5 69.8 75.7 (+15.7)
ResNet-5016 50+10 81.1 75.6 67.0 76.5 (+16.5)
Table 1: Long-tailed recognition accuracy on ImageNet-LT for different methods and backbones. The red colored numbers represent the improvement of overall accuracy compared with the state-of-the-art performance (PaCo with % overall accuracy). : RIDE uses 4 experts. : the state-of-the-art model.

4.2 Performance Comparison

In this section, we compare the performance of BALLAD with long-tailed recognition approaches that report state-of-the-art results on three benchmark datasets, i.e., ImageNet-LT, Places-LT, and CIFAR100-LT.

ImageNet-LT.  Table 1 shows the long-tailed recognition results on ImageNet-LT. We present BALLAD variants with ResNet-50, ResNet-101, ResNet-5016, and ViT-B/16 as the visual backbone. From the table, we observe that with only epochs ( epochs for Phase A and epochs for Phase B), our smallest BALLAD variant with ResNet-50 visual backbone can surpass the largest model of the state-of-the-art PaCo [cui2021parametric] using ResNeXt-101 by . When gradually increasing the size of visual backbone, we find the performance of BALLAD also enjoys an improvement. It is worth noting that BALLAD with ResNet-5016 achieves an accuracy of , which outperforms other state-of-the-art models with a large margin.

Method Backbone #Pretrain Many Medium Few All
OLTR [liu2019large] ResNet-152 Y 44.7 37.0 25.3 35.9
cRT [Kang2020Decoupling] ResNet-152 Y 42 37.6 24.9 36.7
-normalized [Kang2020Decoupling] ResNet-152 Y 37.8 40.7 31.8 37.9
LWS [Kang2020Decoupling] ResNet-152 Y 40.6 39.1 28.6 37.6
Balanced Softmax [ren2020balanced] ResNet-152 Y 42.0 39.3 30.5 38.6
ResLT [cui2021reslt] ResNet-152 Y 39.8 43.6 31.4 39.8
PaCo [cui2021parametric] ResNet-152 Y 37.5 47.2 33.9 41.2
PaCo [cui2021parametric] ResNet-152 Y 36.1 47.9 35.3 41.2
BALLAD ResNet-50 N 46.7 48.0 42.7 46.5 (+5.3)
ResNet-101 N 48.0 48.6 46.0 47.9 (+6.7)
ViT-B/16 N 49.3 50.2 48.4 49.5 (+8.3)
ResNet-5016 N 49.4 50.5 46.6 49.3 (+8.1)
Table 2: Long-tailed recognition accuracy on Places-LT for different methods. The red colored numbers represent improvement of overall accuracy compared with the state-of-the-art performance (PaCo with % overall accuracy). #Pretrain: whether pretrain visual backbone on full ImageNet-2012 or not. : PaCo variant trained with RandAugment [cubuk2020randaugment].

Places-LT.  We further evaluate BALLAD on Places-LT dataset and report the results in Table 2. It is a commonly used scheme of previous approaches to pretrain their models on ImageNet-2012 full dataset first to enrich the visual representation before finetuning on Places-LT. However, BALLAD can directly perform training on Places-LT thanks to the additional language representation of contrastive vision-language models. As shown in Table 2, BALLAD with ResNet-50 visual backbone achieves accuracy for all categories, which beats the state-of-the-art model PaCo with ResNet-152 by . This shows BALLAD can not only achieve better performance with smaller visual backbone but also save a great amount of training time by skipping the ImageNet pretraining.

Method Backbone Many Medium Few All
OLTR [liu2019large] ResNet-32 61.8 41.4 17.6 41.2
LDAM+DRW [cao2019learning] ResNet-32 61.5 41.7 20.2 42.0
-normalized [Kang2020Decoupling] ResNet-32 65.7 43.6 17.3 43.2
cRT [Kang2020Decoupling] ResNet-32 64.0 44.8 18.1 43.3
RIDE [wang2020long] ResNet-32 69.3 49.3 26.0 49.1
TADE [zhang2021test] ResNet-32 65.4 49.3 29.3 49.8
BALLAD ResNet-50 62.4 52.3 38.2 51.6 (+1.8)
ResNet-101 69.5 59.3 47.1 59.2 (+9.4)
ViT-B/16 84.9 79.7 67.3 77.8 (+28.0)
ResNet-5016 74.6 62.8 52.0 63.7 (+13.9)
Table 3: Long-tailed recognition performance comparison on CIFAR100-LT with an imbalance ratio of . The red numbers represent the improvement of overall accuracy compared with the state-of-the-art performance (TADE with % overall accuracy).

CIFAR100-LT.  We also evaluate the models on CIFAR100-LT and show their performances in Table 3. As reported in the table, BALLAD outperforms the state-of-the-art expert-based ensemble methods RIDE [wang2021longtailed] and TADE [zhang2021test] by and , respectively.

4.3 Ablation Studies

In this section, we conduct extensive ablation studies to validate the design choices of BALLAD from three aspects. We first explore how to best utilize vision-language backbone for finetuning. Moreover, we shows the effectiveness of linear adapter and how to make use of linear adapter for better performance. Finally, we demonstrate where and how to conduct data balancing.

4.3.1 Vision-Language Models

We conduct ablations to demonstrate the effectiveness of vision-language backbones as introduced in Section 3.

The Effectiveness of Pretrained Weights.  In Table 4, we validate the effectiveness of pretrained CLIP weights compared with random initialized visual and language weights. All the four ablations are conducted on Phase A without data balancing for 50 epochs. The large gaps between random and pretrained CLIP initialization demonstrate the advantage of utilizing pretrained contrastive vision-language models. Besides, we find that visual encoder has much more influence than text encoder on the performance as random initialized vision encoder drops the accuracy close to zero. Note that poor performance of random initialization is primarily attributed to short training periods and pretrained vision-language weights fastening the convergence largely.

Vision Language Many Medium Few All
random random 0.3 0.0 0.0 0.1
random CLIP 0.3 0.0 0.0 0.1
CLIP random 36.8 2.9 0.0 15.6
CLIP CLIP 75.5 56.3 41.0 61.6
Table 4: Ablations of pretrained vision-language weights on ImageNet-LT dataset. CLIP means using pre-trained weights as initialization and random represents random initialization.
Vision Language Many Medium Few All
59.4 57.5 57.6 58.2
70.4 65.4 58.0 66.3
70.6 65.4 55.9 66.1
71.3 65.4 54.1 66.1
Table 5: Different methods of finetuning CLIP on ImageNet-LT. ”” means finetuning and ”” means freezing the parameters of model. Vision and Language denotes the visual and text encoders of CLIP respectively. All models are finetuned for epochs.

Finetune the Vision-Language Model.  To empirically discover how to utilize contrastive vision-language models, we probe the finetuning process by freezing the pre-trained image encoder and text encoder respectively. When both encoders are frozen, the model directly perform zero-shot predictions. From Table 5, we can easily find the following pattern – as more components are finetuned in CLIP, more accuracy improvement is obtained for many-shot categories whereas more accuracy drop happens in few-shot division. We hypothesize it is because the many-shot classes dominate the visual feature space during finetuning. Therefore, for Phase A, it is necessary to adapt CLIP on specific long-tailed dataset as much as possible, and we choose to finetune both the vision and language branches of CLIP.

Visual Backbones. We try CLIP with different visual backbones to explore its influence on final performance of BALLAD. We report the Phase A results of different backbones in Figure 3 on both ImageNet-LT and Places-LT benchmarks. When the visual backbone becomes deeper and larger, the finetuned performance is also gradually improved for all, many-shot, and medium-shot categories. Surprisingly, the Vision Transformer structure [dosovitskiy2021an] achieves the best accuracy in few-shot subset, probably owing to multi-head self-attention mechanism’s ability in capturing minor features.

(a) ImageNet-LT
(b) Places-LT.
Figure 3: Comparisons between several visual backbones for ImageNet-LT (left) and Places-LT (right).

4.3.2 Linear Adapter

We validate the effectiveness of adopting linear adapter in Phase B and explore the key factors that determines the performance of linear adapter.

The Effectiveness of Linear Adapter.  We design ablations to demonstrate the influence of linear adapter. First, we freeze the parameters of CLIP and finetune the linear adapter for epochs to mix the original zero-shot visual embedding with the corresponding finetuned visual features via residual connection. As illustrated in Figure 4, the simple -epoch training improves the performance from to even without data balancing. Moreover, we finetune both the visual and language encoders of CLIP for epochs and then finetune the linear adapter for another epochs with CLIP parameters fixed. Compared with an equal -epoch training scheme of finetuning the visual and language encoder of the CLIP, an extra -epoch finetuning of linear adapter based on -epoch finetuning of CLIP backbone can further boost the top-1 accuracy from to .

Figure 4: Ablations of effectiveness of Linear Adapter and decouple finetuning. Z-CLIP: zero-shot CLIP model; Z-CLIP+A: finetune adapter based on zero-shot CLIP; CLIP: directly finetune CLIP; Joint-CLIP+A: jointly finetune CLIP and adapter; De-CLIP+A: the BALLAD style which decouples the CLIP and adapter finetuning into two phases.
V-Adapter L-Adapter Many Medium Few All
71.0 66.3 59.5 67.2
71.0 66.2 59.0 67.0
70.6 66.2 58.4 66.8
Table 6: Variants of linear adapter. V-Adapter and L-Adapter represents using linear adapter layer to adapt visual and language encoders respectively. All results are trained on ImageNet-LT for epochs.

Should the Finetuning of CLIP and Linear Adapter be Decoupled?  As mentioned in Section 3 and illustrated in Figure 2, we decouple the training process into two phases – in Phase A, we train both the vision and language encoder of CLIP based on pre-trained weights; in Phase B, we freeze the parameters of visual and language encoders while only finetuning the linear adapter. An alternative scheme is to jointly train the CLIP and linear adapter rather than decoupling the training processes. According to Figure 4, joint training of CLIP and linear adapter (Joint-CLIP+A) leads to a accuracy drop compared with directly finetuning CLIP without adapter (CLIP). In contrast, the decoupled training of CLIP and linear adapter (De-CLIP+A) can largely boost the accuracy from to and the ascent mainly comes from tail classes, which is up to . We visualize the joint and decoupling training schemes using t-SNE [van2008visualizing] and present the results in the supplementary. Compared with joint training, decoupled training better separates the tail-class feature embeddings from head-classes. This demonstrates that the proposed decoupled training of vision-language model and adapter is effective to handle long-tailed distribution.

Variants of Linear Adapter.  Since CLIP has dual encoders, the auxiliary linear adapter could be added to either or both of the two branches. As reported in Table 6, we try linear adapter for adapting visual and language encoders respectively. From the table, we can find that applying the linear adapter to the visual branch of CLIP achieves the best overall performance and is the optimal choice.

Figure 5: The top-1 accuracy on ImageNet-LT with different values of residual factor in Phase B of BALLAD.

Hyperparameters of the Linear Adapter.  Moreover, we explore the influence of linear adapter’s residual factor . determines the importance of new knowledge obtained from finetuning the linear adapter. Note that when equals to , the classification is fully determined by the adapted image features. We explore different values of from to and conduct the ablations of Phase B finetuning on ImageNet-LT. As shown in Figure 5, the best performance of linear adapter can be obtained when is around , with a top-1 accuracy of on ImageNet-LT. The empirical results reveal the knowledge of finetuned CLIP is already good enough to handle most cases, a slight and balanced adaptation would further improve the performance.

4.3.3 Balancing Methods

Balancing methods can alleviate the severe performance degradation due to class imbalance. In this section, we explore different balancing methods for BALLAD to reveal two significant problems: 1) where to utilize balancing methods, and 2) which balancing methods to apply.

Where to balance.  Here, we compare balancing the long-tailed data distribution on either or both of two phases. The experiments are performed on ImageNet-LT and Places-LT datasets with ResNet-50-backboned CLIP. As mentioned earlier, many-shot categories dominate the feature space of long-tailed distribution. The performance drops of many-shot categories on both datasets, as reported in Table 7, suggest that balancing during Phase A tends to sacrifice many-shot representations. Since Phase A is mainly designed for updating representations on a new domain, we thereby abandon Phase-A data balancing. When applying balancing strategies to Phase B alone, BALLAD can achieve a more balanced performance for different shots and improve the overall top-1 accuracy thanks to the rich features learned from Phase A.

Dataset Balance Many Medium Few All
Phase A Phase B
ImageNet-LT 77.3 57.4 39.0 62.6
76.6 58.4 42.7 63.3
70.7 66.2 58.5 66.9
71.0 66.3 59.5 67.2
Places-LT 52.7 32.9 23.4 38.2
51.3 33.2 25.5 38.2
44.6 46.7 44.1 45.5
46.7 48.0 42.7 46.5
Table 7: Where to employ balance strategies ablations. On both ImageNet-LT and Places-LT, balance only in Phase B make BALLAD perform the best.
Balance Methods Many Medium Few All
Class-balanced 71.0 66.3 59.5 67.2
Square-root 75.2 62.8 50.9 66.0
Mix-balanced 72.6 64.9 59.1 67.1
Table 8: Comparison of different balanced sampling strategies on ImageNet-LT.

How to balance.  Furthermore, we explore different sampling strategies including class-balanced sampling, square-root sampling and mix-balanced sampling for Phase B. Class-balanced sampling samples the categories from original dataset in equal probability rather than the natural instance-balanced sampling which selects instances regardless of classes. The process can be decoupled into two steps – first selecting classes equally from the list of categories and then randomly sampling a data point from the selected class. Square-root sampling [mahajan2018exploring] first computes the square-root of the number of head classes, then re-normalize and conduct sampling according to the resulting distribution. Mix-balanced sampling combines the instance-balanced sampling and class-balanced sampling, thus takes advantage of both strategies to avoid overfitting at early epochs and underfitting at late epochs. Motivated by the [Kang2020Decoupling]

, we adopt a soft version of mix-balanced sampling to dynamically interpolates between instance-balanced sampling and class-balanced sampling as learning progresses. As shown in Table 

8, class-balanced sampling can best benefit medium-shot and few-shot categories. Thus we adopt class-balanced sampling as the balancing method of BALLAD.

5 Conclusion

In this paper, we proposed BALLAD which tackles long-tailed recognition by leveraging contrastive vision-language models. We decouple BALLAD into two phases for training with long-tailed and balanced samples respectively. We first continue pretraining with contrastive loss to fully utilize abundant data to update visual-language representation on specific domains. After that, we employ an auxiliary linear adapter to refine the visual representation of tail classes. We hope our simple BALLAD baseline could stimulate more future researches on exploring vision-language models for long-tailed recognition.


Appendix A Algorithm

We provide the pseudo code of training BALLAD as shown in Algorithm 1. In phase A, we keep training the vision-language model, updating parameters of both visual and language encoders. Afterward, in phase B, we train a single linear adapter with vision-language model frozen to adapt visual features in a balanced way.

Training samples , visual and language encoder , , linear adapter
Initialize , with web-data pretrained parameters and
for epoch  do Phase A
     for minibatch  do
         Project into embedding space as Eq.(1)
         Update and
     end for
end for
Initialize randomly for Phase B
Freeze and
for epoch  do
     for minibatch  do
         Project into embedding space as Eq.(1)
     end for
end for
Algorithm 1 Two-phases training of BALLAD

Appendix B Zero-shot Performance

The zero-shot long-tailed recognition performances on three benchmark datasets are presented in Table 9. The red numbers show how much BALLAD has improved compared with zero-shot CLIP results. It illustrates the training scheme of BALLAD is effective as raising the initial vision-language model by a large margin ( maximally).

Visual Backbone ImageNet-LT Places-LT CIFAR100-LT
zero-shot BALLAD zero-shot BALLAD zero-shot BALLAD
ResNet-50 58.2 67.2 (+9) 35.3 46.5 (+11.2) 40.2 51.6 (+11.4)
ResNet-101 61.2 70.5 (+9.3) 36.2 47.9 (+11.7) 47.8 59.2 (+11.4)
ViT-B/16 66.7 75.7 (+9) 37.8 49.5 (+11.7) 66.4 77.8 (+11.4)
ResNet-50 69.0 76.5 (+7.5) 37.1 49.3 (+12.2) 52.9 63.7 (+10.8)
Table 9: Top-1 accuracy of zero-shot CLIP and BALLAD-training.

Appendix C Text Prompting

Text Prompts Many Medium Few All
Single prompt 71.0 66.3 59.5 67.2
Random single prompt 70.5 66.1 59.8 66.9
Ensemble prompts 70.5 66.0 59.5 66.8
Table 10: Comparison of different balanced sampling strategies on ImageNet-LT.

Prompt engineering is initially proposed for knowledge probing in large pretrained language models [petroni2019language, shin2020eliciting, li2021prefix, jiang2020can]. Prompting is adding extra instructions to task inputs to generate specific outputs from pretrained language model. In this paper, we utilize manually designed prompts following CLIP [radford2021learning]. Specifically, a prompt template like a photo of a {CLASS} is adopted to all experiments reported in main manuscript. However, CLIP [radford2021learning] claims that ensembling several classifiers using different hand-crafted prompts as follows can improve the performance of zero-shot tasks.

  • itap of a {CLASS}.

  • a bad photo of the {CLASS}.

  • a origami {CLASS}.

  • a photo of the large {CLASS}.

  • a {CLASS} in the video game.

  • art of the {CLASS}.

  • a photo of the small {CLASS}.

Therefore, we perform ablations on ensembling the multiple prompts and randomly choosing one of them for language model training. Ablations are conducted on ImageNet-LT benchmark [liu2019large] with ResNet-50 visual backbone. Results are reported in Table 10. Surprisingly, prompts ensembling decays performance from to rather than raising accuracy. Randomly choosing a template from above seven prompts also results in performance drop, by overall accuracy. We hypothesize different prompting templates from multiple views may confuse the pretrained language model finetuning, which is different from zero-shot task. Thus, we choose a single template a photo of a {CLASS} in all our ablations.

Appendix D Visualization

As discussed in Sec. 4.3.2 in main manuscript, decoupled training of vision-language model and linear adapter largely boost the performance, especially for few-shot categories. We visualize the classification space of several few-shot categories using t-SNE [van2008visualizing] as shown in Fig. 6. It is clearly illustrated in Sub-figure (a) that decoupled training achieves much more obvious separation boundary among different classes, especially for some easily confusing ones such as kingsnake (purple), water snake (brown) and sea snake (pink).

(a) Decoupled training.
(b) Joint training.
(c) Legends
Figure 6: Comparisons of training vision-language model and linear adapter decoupled and jointly.