Log In Sign Up

Debiased Learning from Naturally Imbalanced Pseudo-Labels for Zero-Shot and Semi-Supervised Learning

by   Xudong Wang, et al.

This work studies the bias issue of pseudo-labeling, a natural phenomenon that widely occurs but often overlooked by prior research. Pseudo-labels are generated when a classifier trained on source data is transferred to unlabeled target data. We observe heavy long-tailed pseudo-labels when a semi-supervised learning model FixMatch predicts labels on the unlabeled set even though the unlabeled data is curated to be balanced. Without intervention, the training model inherits the bias from the pseudo-labels and end up being sub-optimal. To eliminate the model bias, we propose a simple yet effective method DebiasMatch, comprising of an adaptive debiasing module and an adaptive marginal loss. The strength of debiasing and the size of margins can be automatically adjusted by making use of an online updated queue. Benchmarked on ImageNet-1K, DebiasMatch significantly outperforms previous state-of-the-arts by more than 26 on semi-supervised learning (0.2 respectively.


Learning to Adapt Classifier for Imbalanced Semi-supervised Learning

Pseudo-labeling has proven to be a promising semi-supervised learning (S...

Curriculum Labeling: Self-paced Pseudo-Labeling for Semi-Supervised Learning

Semi-supervised learning aims to take advantage of a large amount of unl...

EnergyMatch: Energy-based Pseudo-Labeling for Semi-Supervised Learning

Recent state-of-the-art methods in semi-supervised learning (SSL) combin...

On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning

When there are unlabeled Out-Of-Distribution (OOD) data from other class...

NorMatch: Matching Normalizing Flows with Discriminative Classifiers for Semi-Supervised Learning

Semi-Supervised Learning (SSL) aims to learn a model using a tiny labele...

The Peaking Phenomenon in Semi-supervised Learning

For the supervised least squares classifier, when the number of training...

AggMatch: Aggregating Pseudo Labels for Semi-Supervised Learning

Semi-supervised learning (SSL) has recently proven to be an effective pa...

1 Introduction

Real-world observations naturally come with a long tailed distribution, and so do computer vision datasets 

[van2018inaturalist, gupta2019lvis] if they are not explicitly curated by humans. Imbalanced learning approaches [cao2019learning, kang2019decoupling, wang2021long] attempt to address this data bias in the dataset, preventing the model from being dominated by the head classes. Developing visual recognition systems capable of counteracting biases may have significant social impacts for humanity [mehrabi2021survey].

While prior research on long-tailed recognition focus their attention on dealing with bias from the data collection process by humans, we discover another widespread source of data bias on the pseudo-labels produced by machine learning models. Pseudo-labels are generated by an existing classifier on an unlabeled corpus of images, and the pseudo-labels with high confidence are fed back to supervise the classifier as part of the training data. Pseudo-labeling is a well-established technique for semi-supervised learning 

[sohn2020fixmatch, liu2019deep], domain adaptation [na2021fixbi, kang2019contrastive]

, and transfer learning 

[arnold2007comparative] in general.

In this paper, we take a look at the bias issue arising from the pseudo-labeling process. We analyze the distribution of the pseudo-labels on semi-supervised learning and zero-shot transfer learning in Fig. 1. For zero-shot transfer learning, where the data comes from different source and target domains, we observe that a pretrained CLIP model [radford2021learning] produces highly imbalanced predictions on the curated and balanced ImageNet-1K dataset, even though the training set of CLIP is approximately balanced. Unfortunately, this phenomenon also exists even when the source and target data share the same domain in a semi-supervised setting. We observe that a semi-supervised learning method FixMatch [sohn2020fixmatch] also generates highly biased pseudo-labels although both the labeled set and the unlabeled set are manually curated.

To eliminate the effects of learning on biased pseudo-labels, we propose a simple yet effective method DebiasMatch without using any prior knowledge about the distribution of real marginal class distribution. Our DebiasMatch comprises of an adaptive debiasing module and an adaptive marginal loss. In order to dynamically eliminate biases (counterfactual) learned by a teacher model, we incorporate causality to produce debiased predictions through counterfactual reasoning. The dynamic property is supported by making use of a non-parametric online dictionary, which can be further used to enforce a dynamic class-specific margin with adaptive marginal loss. The online dictionary automatically adjusts the strength of debiasing mechanism as the training progresses. In this way, the cause of biased pseudo-labels can be greatly counteracted.

We benchmark our approach on ImageNet-1K, CIFAR10, CIFAR10-LT, etc. DebiasMatch significantly outperforms previous state-of-the-arts by more than 26% on ImageNet for semi-supervised learning with 0.2% labeled data and 8.7% on ImageNet for zero-shot transfer learning respectively. Furthermore, DebiasMatch is also a universal add-on to various pseudo-labeling methods and exhibits stronger robustness to domain shift.

The contributions of this paper can be summarized as:

  • [leftmargin=*,topsep=0pt]

  • We systematically investigate and confirm the existence of bias in pseudo-labeling methods.

  • We propose a simple yet effective method to successfully eliminate biases in pseudo-labels without leveraging any prior knowledge of the data distribution.

  • A new pipeline is developed to better exploit the knowledge learned from the visual+language pre-trained model (i.e. CLIP), which can be leveraged for ZSL/SSL tasks.

  • DebiasMatch achieves the state-of-the-art results for semi-supervised learning and zero-shot transfer learning, also being a universal add-on for other pseudo-labeling models.

2 Related Work

Semi-Supervised Learning (SSL) improves the learning performance when labels are limited or expensive to obtain through exploring the latent patterns of unlabeled data. The consistency based regularization methods [sajjadi2016regularization, tarvainen2017mean, miyato2018virtual, xie2020unsupervised] leverages the idea that after applying perturbations or adding adversarial noise to unlabeled data, the semantic distribution of the classifier output remains unchanged. pseudo labeling (or self-training) [lee2013pseudo, berthelot2019mixmatch, berthelot2019remixmatch, sohn2020fixmatch, xie2020self, li2021comatch] produces pseudo-labels of unlabeled data by preserving samples on which model’s confidence scores are above a threshold. The model is then trained on these pseudo-labeled and artificially-labeled data. For transfer learning [chen2020big, assran2021semi], the model is trained on unlabeled data with constrastive learning, followed by a supervised learning stage. Recently proposed DC-SSL [wang2021data] focuses on improving SSL performance by optimizing data selection from large-scale unlabeled data for annotation.

CReST [wei2021crest] resolves the long-tailed SSL problem via sampling pseudo-labeled unlabeled data with a class-rebalanced sampler to expand the labeled set, whereas our focus is on pseudo-labeling. In addition, CReST is limited to long-tailed data, while we are not limited to any class distribution, and can be applied to balanced, imbalanced, or even mixed distributions (imbalanced labeled-data and balanced unlabeled-data). DebiasMatch is orthogonal to CReST.

Although previous literature has achieved tremendous success in SSL, the implicitly biased pseudo-labeling issue in SSL is often ignored and has not been thoroughly analyzed, which, however, has great impact on the learning efficiency. The focus of this work is on proposing a simple yet effective debiasing module to eliminate this critical issue.

Zero-shot Classification refers to the problem setting where a zero-shot model classifies images from novel classes into correct categories that the model has not seen during training [romera2015embarrassingly, pennington2014glove, wang2019survey]. Several training strategies been considered from various set of viewpoints: 1) hand-engineered attributes [farhadi2009describing, lampert2013attribute]; 2) pretrained embeddings that incorporates prior knowledge in form of semantic descriptions of classes [frome2013devise, socher2013zero]

; 3) modeling relations between seen and unseen classes with knowledge graphs

[kampffmeyer2019rethinking, nayak2020zero]

; 4) learning generic visual concepts with vision-language models, allowing zero-shot transfer of the model to a variety of downstream classification tasks

[brown2020language, radford2021learning]. [brown2020language] achieves current SOTA results on numerous zero-shot benchmarks.

Long-Tailed Recognition (LTR) aims to learn accurate “few-shot” models for classes with a few instances, without sacrificing the performance on “many-shot” classes, for which many instances are available. 1) class re-balancing/re-weighting resolves it by giving more importance to tail instances [cao2019learning, kang2019decoupling]; 2) multi-expert framework optimizes multiple diversified experts on long-tailed data [wang2021long, zhou2020bbn]; 3) post-hoc adjustment approach modifies a trained model’s predictions according to the prior knowledge of class-distribution [menon2021long] or pursues the direct causal effect by removing the paradoxical effects of the momentum [tang2020long].

In a stark contrast to previous works on unbiased LTR [tang2020long, menon2021long] which either requires the prior knowledge of class-distribution or is applied post-hoc to a trained model, the proposed debias module does not require any prior knowledge and focuses on the biased pseudo-labels when training on balanced source/target data.

Causal Inference is the undertaking of deriving counterfactual conclusions using only factual premises, in which the interventions among the variables are represented by causal graphical models [pearl2013direct, pearl2009causal, greenland1999confounding, rubin2019essential, rubin2005causal]. Causal inference [pearl2009causal] has been widely studied and applied in various tasks for the purpose of removing selection bias which is pervasive in almost all empirical studies [bareinboim2012controlling], eliminating the confounding effect using causal intervention [zhang2020causal], disentangling the desired direct effects with counterfactual reasoning [besserve2019counterfactuals], etc.

3 Analysis of Imbalanced Pseudo-labeling

In a stark contrast to previous works concentrating on the biases caused by training on imbalanced data, whereas our focus is on pseudo-labeling biases, even when training on balanced data. In this section, we provide some analysis on this often-neglected issue hidden behind the tremendous success of FixMatch [sohn2020fixmatch] on SSL and CLIP [radford2021learning] on ZSL, both of which require the use of “pseudo-labeling” to transfer knowledge learned in source data to target data as in Fig. 1, potentially suffering from the bias in “pseudo-labels”.

3.1 Biases in FixMatch

Whether there exist bias has been briefly answered in Section. 1, here, we address it at length. Fig. 2

visualizes the FixMatch and our DebiasMatch’s probability distributions averaged on all unlabeled data at various training epochs. For FixMatch, surprisingly, even when labeled and unlabeled data are both curated (class-balanced), the pseudo-labels are still highly class-imbalanced, most notably at the early training stage. As the training progressed, this situation persisted, and has only been slightly eliminated, if there is no deterioration, in the 50th/100th epochs.

Through learning on implicitly imbalanced pseudo-labels, will be inherited by a student model, and, in turn, reinforces the teacher model’s prejudices. Once some confusing samples are wrongly pseudo-labeled, the mistake is almost impossible to be corrected, on the contrary, it may even mislead the model and further amplify existing bias to produce more wrong predictions. Without intervention, the model will get trapped in the irreparable biases. Therefore, FixMatch is still producing biased predictions at later stages.

(a) Epoch 1
(b) Epoch 50
(c) Epoch 100
Figure 2: FixMatch’s pseudo-labels are highly imbalanced at various training stages although the unlabeled and labeled data it is trained on are class-balanced. Confidence score distribution of FixMatch and our DebiasMatch are averaged on all unlabeled data. The class index is sorted by the averaged confidence score. Experimented on CIFAR10 with 4 labeled instances per class.
Figure 3:

The accuracy and recall of pseudo-labels produced by CLIP are not proportional to the number of predictions. The pseudo-labels produced by zero-shot predicting on 1.3M unlabeled ImageNet-1K instances with a pre-trained CLIP model, in terms of the number of predictions, precision and recall per class, are visualized. The class indexes are sorted by the number of predictions, which is consistent in the three plots.

(a) CIFAR10
(b) CIFAR100
(d) Food101
(e) EuroSAT
Figure 4: CLIP’s zero-shot predictions are highly biased when experimenting on various benchmarks, including CIFAR10, CIFAR100, MNIST, Food101 and EuroSAT. The class index is sorted by the number of predictions per class on the training set of each benchmark.

On the contrary, as in Fig. 2, although DebiasMatch is also troubled by the imbalanced pseudo-labels at the beginning, this situation can be greatly alleviated, and, eventually, we obtain an almost balanced distribution through dynamically debiasing the model based on the level of bias.

3.2 Biases in CLIP

As a leading method on various zero-shot learning benchmarks, CLIP is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million image-text pairs[radford2021learning], which is manually curated to be approximately query-balanced. However, such a powerful model actually has an issue of being highly biased on ImageNet, which is hidden behind the tremendous success CLIP achieved in terms of overall zero-shot prediction accuracy, as explained previously.

Except for the imbalance, the more troublesome issue is, as illustrated in Fig. 3, the precision and recall of many high-frequency classes are in fact much lower than quite a lot of medium-/few-shot classes. Therefore, thresholding the CLIP predictions based on the confidence score is necessary. But this will lead to an even more imbalanced distributions (more analysis on this in appendix). Empirically, we select a threshold of 0.95 for a better trade-off between imbalance ratio and precision/recall. Instances with a confidence score above this threshold are considered as the “pseudo-labels” of unlabeled data in target datasets to conduct experiments on zero-shot transfer learning.

CLIP’s zero-shot predictions are also highly biased when experimenting on many other benchmarks, including EuroSAT [helber2019eurosat], MNIST [lecun1998mnist], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning] and Food101 [bossard14] as shown in Fig. 4.

(a) FixMatch
(b) DebiasMatch
Figure 5:

The blame for pseudo-label biases can be partially placed on the inter-class confusion, such as FixMatch often misclassifies “ship” as “plane”. Confusion matrix of FixMatch’s and our DebiasMatch’s pseudo-labels are visualized.

3.3 Inter-Class Correlations

To take a deeper look at the biased pseudo-labels and its causes, we analyzed the inter-class correlation. We found that the blame for pseudo-label biases can be partially placed on the inter-class confusion. We plot the confusion matrix of FixMatch’s pseudo-labels in Fig. 4(a), which shows that many instances in some categories have a tendency to be misclassfied into one or two specific negative classes, such as “ship” and “bird” is often misclassified as “plane” and “frog” respectively. This issue of FixMatch can be successfully addressed by the proposed DebiasMatch as in Fig. 4(b), which will be introduced in the next section.

To better understand whether strong inter-class correlations as observed in FixMatch’s prediction on CIFAR10 can also be observed in CLIP’s zero-shot prediction on ImageNet, we first compute one image centroid for each class by taking the mean of the normalized image features, extracted by the image encoder of a pre-trained CLIP model, that belong to this class. The cosine similarity between the the image centroid of classes with top-10/least-10 prediction frequency and their closest “confusing” classes are visualized in Fig. 

6. The prediction confusions indicate image similarities at the class level. Fig. 6 shows that the low-frequency classes of ImageNet, with least-10 number of CLIP predictions per class, usually have strong inter-class correlations, while most high-frequency classes have relatively smaller inter-class correlation.

(a) Top-10 classes
(b) Least-10 classes
Figure 6: The low-frequency classes of ImageNet, with least-10 number of CLIP predictions per class, usually have strong inter-class correlations, while the high-frequency classes are the opposite. We compare the cosine similarity between each class’s image embedding centroid and embedding centroids of its 9 closest “negative” classes. (better view zoomed in)

4 DebiasMatch

4.1 Background

CLIP for zero-shot learning. At pre-training time, CLIP [radford2021learning] tries to leverage natural language supervision for image representation learning by maximizing the similarity between paired captions and visual images, meanwhile minimizing the similarity between texts and unpaired visual images. An image encoder and a text encoder are jointly optimized with aforementioned training objectives.

For producing pseudo-labels of unlabeled data, natural language prompting is used to enable zero-shot transfer to target datasets: CLIP uses the names or descriptions of the target dataset’s classes as the set of potential text pairings (e.g. “a photo of dog”) and predict the most probable class according to the cosine similarity of image-text pairs. Specifically, the feature embedding of the image and the feature embedding of the set of possible texts are first computed by their respective encoders. The cosine similarity of these embeddings is then evaluated, and normalized into a probability distribution via a softmax function.

FixMatch for semi-supervised learning. The core technique of the state-of-the-art SSL approach FixMatch [sohn2020fixmatch] is pseudo-labeling [lee2013pseudo, sohn2020fixmatch], which selects unlabeled samples with high confidence as training targets. Its impact can be viewed as a form of entropy minimization, which reduces the density of data points at the decision boundaries [sajjadi2016regularization, lee2013pseudo].

Suppose we have a labeled dataset with labeled instances, and an unlabeled dataset with instances. is the input instance and is a discrete annotated target with classes. and share the same semantic labels. The optimization objective of FixMatch consists of two terms, i.e., the supervised loss applied to labeled data and an unsupervised loss applied to unlabeled data. is simply the vanilla cross-entropy loss between the model predictions and the ground truth: , where is the weak augmentation following a standard flip-and-shift augmentation strategy and is the number of labeled instances in the minibatch.

For the unlabeled samples in the minibatch, the pseudo-labels are generated from the weakly-augmented unlabeled samples, which are used to supervise the model prediction of the strongly-augmented unlabeled samples. FixMatch filters the samples whose output probability fall above a confidence threshold . Instances whose probability fall under the confidence threshold are regarded as unreliable samples and discarded. Formally, can be formulated as:


where is the strong augmentation like RandAugment [cubuk2020randaugment], and is the ratio that determines the relative size of labeled and unlabeled samples in the minibatch.

The pseudo-labeled data and labeled data are then jointly optimized in the same minibatch with , where

is a scalar hyperparameter.

Figure 7: Diagram of the proposed Adaptive Debiasing module and Adaptive Marginal Loss, added to the top of FixMatch.

4.2 Adaptive Debiasing

Our DebiasMatch approach aims at dynamically alleviating biased pseudo-labels’ influence to a student model, without leveraging any prior knowledge on marginal class distribution, even if when exposing to source and target data that follow different distributions. An adaptive debiasing module with counterfactual reasoning and an adaptive marginal loss are proposed to fulfill this goal, described next.

Adaptive Debias w. Counterfactual Reasoning. In order to dynamically mitigate impacts of unwanted bias (counterfactual), we incorporate causality of producing debiased predictions through counterfactual reasoning [holland1986statistics, pearl2009causal, pearl2009causality, pearl2013direct, pearl2018book].

Figure 8: Causal graph of debiasing with counterfactual reasoning.

Given the proposed causal graph in Fig. 8, we can delineate our goal for generating debiased predictions: the pursuit of the direct causal effect along with counterfactual reasoning, defined as Controlled Direct Effect (CDE) [pearl2013direct, richiardi2013mediation, pearl2018book, greenland1999confounding, tang2020long]:


i.e. the contrast between the counterfactual outcome if the individual were exposed at (with notation) and the counterfactual outcome if the same individual were exposed at , with the mediator set to a fixed level . CDE [pearl2013direct, greenland1999confounding] disentangles the model bias in a counterfactual world, where the model bias is considered as the ’s indirect effect when but retains the value when .

However, measuring the counterfactual outcome via visiting all samples per iteration is extremely computational expensive, therefore, we use Approximated Controlled Direct Effect (ACDE) instead. ACDE assumes that the model bias is not drastically changed, therefore, the moving average of counterfactual outcomes can be served as an approximation to

. The debiased logits

after counterfactual reasoning, which is later used to perform pseudo-labeling, i.e., replace in Eqn.1 with , can be simply calculated as:


measures the cumulative probabilities of a model for all classes, is the queue size, refers to logits of weakly-augmented unlabeled instance , is the probability distribution obtained via a softmax function, denotes the debias factor which controls the strength of indirect effect.

Eqn. 3 can be associated with re-weighting method in LTR, whereas it is dynamically adaptive. After applying the softmax function to logits: . Therefore, the class with a larger averaged confidence score will be re-weighted by a smaller scalar.

During training time, as in Fig. 7, we maintain a dynamic dictionary as a queue of prototypes: of the current mini-batch are enqueued, and the oldest are dequeued. All values in the queue is initialized to and the default size of is the number of iterations per epoch. We found that the performance is pretty insensitive to queue size, therefore, using a quarter of the default size only slightly decreases the performance (less than 0.1%). Since the scale of logits is unstable, most notably at the early straining stage, we use the probability distribution

rather than directly using the logit vector in the second term of Eqn. 

3. A log function is applied to rescale to match the magnitude of logit. For simplicity, is set to 0.5 for all experiments. If the debias factor is too strong, it is hard for a model to fit on the data, while too small factor can barely eliminate the biases, and, ultimately, impairs the generalization ability.

Adaptive Marginal Loss. As aforementioned in Section. 3, the biases in pseudo-labels may be partially caused by inter-class confusion. Motivated by this, we apply adaptive margin loss to demand a larger margin between hardly biased and highly biased classes, so that scores for dominant classes, towards which the model highly biased, do not overwhelm the other categories. In addition, by enforcing a dynamic class-specific margin, inter-class confusion can be greatly counteracted, which is further empirically evidenced in Fig. 5. can be formulated as:


where , . We use to replaced in Eqn.1.

The adaptive marginal loss can be understood as a soft version of LA with dynamically adapted margins, without leveraging any prior knowledge on true marginal class distribution. More discussions on distinctions and connections with LA and other methods can be found in Section. 4.3.

We then get the final unsupervised loss by updating Eqn.1 with Eqn. 3 and Eqn. 4, which can be easily reproduced with a few lines of codes. The supervised loss is unchanged.

4.3 Distinctions and Connections with DA, LA, LDAM and Causal Norm.

LA or
DA Ours
Improve representation
learning at training time
No prior knowledge on
true marginal class distribution
Adaptive as the
training progresses
Source and target data can
come from varying distributions
Applicable to both
imbalanced and balanced data
Table 1: Comparisons with previous works concentrating on resolving training data distribution issues, including LA [menon2021long], LDAM [cao2019learning], DA [berthelot2019remixmatch], Causal Norm [tang2020long] and our DebiasMatch, in key properties. Desired (undesired) properties are in green (red).

Please refer to Table 1 to illustrate the distinctions and connections with other related methods in key properties, and Table 2 and Table 3 to compare experimental results.

The use of a dictionary queue and counterfactual reasoning is crucial to the success of DebiasMatch, which enables our training objective does not necessarily need to use the true marginal class distribution as a prior knowledge.

Furthermore, since more training samples per class do not necessarily lead to a higher model bias against it, dynamically adjusting the margin of each class rather than measuring margins based on the number of samples per class as in LA and LDAM could better respect the degree of bias against each class. Unlike previous works, e.g., LA/LDAM and Causal Norm, that use fixed margins or adjustments, we argue that the degree of bias of each class should never be a fixed value, but is in a process of dynamic change. The cause of bias cannot be attributed to the data alone, but the product of the interaction between model and data.

For imbalanced benchmarks, as illustrated in Table 2, integrating SOTA supervised long-tailed recognition method LA [menon2021long] into FixMatch [sohn2020fixmatch] could improve the performance on CIFAR10-LT SSL, however, empirically, the performance of FixMatch w. LA lags behind FixMatch with our debiasing modules. For balanced SSL benchmarks, most existing long-tailed methods that rely on true marginal class distribution are no longer applicable without major changes, since the adjustment or re-weighting vector is calculated based on the true class-distribution, i.e., balanced class distribution leads to identical treatment to all classes.

Another often adopted method distribution alignment [berthelot2019remixmatch] aims to encourage the marginal distribution of the model’s predictions to match the true marginal class distribution. However, DA is limited to scenarios where either true marginal class distribution is available or source and target data are collected from the same distribution, which is too ideal in real-world. Furthermore, DA performs worse than our debiasing modules on all researched benchmarks.

Method CIFAR10-LT: # of labels (percentage) CIFAR10: # of labels (percentage)
=100 =200 40 (0.08%) 80 (0.16%) 250 (2%)
1244 (10%) 3726 (30%) 1125 (10%) 3365 (30%)
UDA [xie2019unsupervised] - - - - 71.0 6.0 - 91.2 1.1
MixMatch [berthelot2019mixmatch] 60.4 2.2 - 54.5 1.9 - 51.9 11.8 80.8 1.3 89.0 0.9
CReST w. DA [wei2021crest] 75.9 0.6 77.6 0.9 64.1 0.22 67.7 0.8 - - -
CReST+ w. DA [wei2021crest] 78.1 0.8 79.2 0.2 67.7 1.4 70.5 0.6 - - -
CoMatch w. SimCLR [li2021comatch, chen2020simple] - - - - 92.6 1.0 94.0 0.3 95.1 0.3
FixMatch [sohn2020fixmatch] 67.3 1.2 73.1 0.6 59.7 0.6 67.7 0.8 86.1 3.5 92.1 0.9 94.9 0.7
FixMatch w. DA w. LA [wei2021crest, sohn2020fixmatch, berthelot2019remixmatch, menon2021long] 70.4 2.9 - 62.4 1.2 - - - -
FixMatch w. DA w. SimCLR [sohn2020fixmatch, berthelot2019remixmatch, chen2020simple] - - - - 89.7 4.6 93.3 0.5 94.9 0.7
DebiasMatch (w. FixMatch) 79.2 1.0 80.6 0.5 71.4 2.0 74.1 0.6 94.6 1.3 95.2 0.1 95.4 0.1
gains over the best FixMatch variant +8.8 +7.5 +9.0 +6.4 +4.9 +1.9 +0.5
Table 2: Without any prior knowledge on the marginal class distribution of unlabeled/labeled data, the performance of DebiasMatch on both CIFAR and CIFAR-LT SSL benchmarks surpasses previous state-of-the-arts, which are either designed for balanced data or meticulously tuned for long-tailed data. DibasMatch are experimented with the same set of hyper-parameters across all benchmarks. states the best reported results of counterpart methods, copied from [li2021comatch], [sohn2020fixmatch] or [wei2021crest]. refers to the imbalance ratio of training data. We report results averaged on 5 different folds.

4.4 DebiasMatch for T-ZSL and SSL

For transductive zero-shot learning, to better exploit semantic knowledge learned from vision-language pre-trained model, i.e. CLIP, and alleviate the domain shift problem when transferring the knowledge to downstream zero-shot transfer learning tasks, a new framework to conduct transductive zero-shot learning (T-ZSL) based on FixMatch and CLIP is developed.

Specifically, we again make use of the pseudo-labeling idea by leverage the one-hot labels (i.e., the of the model’s output) and retaining pseudo labels whose largest class probability fall above a confidence threshold ( by default). These instances along with their pseudo labels are then considered as “labeled data” in SSL.

After which, we could follow the original FixMatch pipeline to jointly optimize “labeled” and “unlabeled” data. To make a fair comparison with previous works and simplify the overall system, all other training recipes and settings are consistent with original FixMatch + EMAN settings, including the model initialization part. Diagram is in appendix.

Because CLIP is highly biased, the vanilla FixMatch + CLIP framework under-performs the original CLIP zero-shot learning, which confirms our earlier hypothesis that learning from biased model may further amplify existing bias to produce more wrong predictions. Therefore, we update the unsupervised loss with our Adaptive Marginal Loss for alleviating the inter-class confusion and Adaptive Debias for producing debiased pseudo-labels as in Section. 4.2.

For semi-supervised learning, the proposed DebiasMatch can be integrated into FixMatch, as in Fig. 7, by adopting the adaptive debiasing module and adaptive marginal loss. To further boost the performance of SSL and exploit the power of vision-language pre-trained model, during the training time, we can also integrate CLIP into FixMatch/DebiasMatch by pseudo-labeling the discarded unlabeled instances with CLIP. Because the instances CLIP are not confident on may be noisy, only these unlabeled instances with CLIP confidence score greater than (set to 0.5 by default) are pseudo-labeled by CLIP. We could get CLIP’s predictions on all training data and store it in a dictionary without re-predicting per iteration. Therefore, the computational overheads introduced by the use of CLIP model is negligible. We only leverage CLIP in large-scale datasets, since using CLIP on low-resolution datasets such as CIFAR10 can only observe marginal gains, partly due to the lack of scale-based data augmentation in CLIP [radford2021learning].

Method B.S. #epochs Pre-train 1% 0.2%
top-1 top-5 top-1 top-5
FixMatch w. DA [sohn2020fixmatch, berthelot2019remixmatch] 4096 400 53.4 74.4 - -
FixMatch w. DA [sohn2020fixmatch, berthelot2019remixmatch] 4096 400 59.9 79.8 - -
FixMatch w. EMAN [sohn2020fixmatch, cai2021exponential] 384 50 60.9 82.5 43.6 64.6
DebiasMatch (w. FixMatch) 384 50 63.1 (+2.2) 83.6 (+1.1) 47.9 (+3.7) 69.6 (+5.0)
DebiasMatch (multi-views) 768 50 65.4 (+4.5) 85.2 (+2.7) 51.6 (+8.0) 73.0 (+8.4)
DebiasMatch (multi-views) 768 200 66.5 (+5.6) 85.6 (+3.1) 52.3 (+8.7) 73.5 (+8.9)
DebiasMatch (multi-views) 1536 300 67.1 (+6.2) 85.8 (+3.3) - -
DebiasMatch w. CLIP [radford2021learning] 384 50 69.1 (+8.2) 89.1 (+6.6) 68.2 (+24.6) 88.2 (+23.6)
DebiasMatch w. CLIP (multi-views) [radford2021learning] 768 50 70.9 (+10.0) 89.3 (+6.8) 69.6 (+26.0) 88.4 (+23.8)
CLIP (few-shot) [radford2021learning, zhou2021learning] 256 50 53.4 - 40.0 -
BYOL [grill2020bootstrap] 4096 50 53.2 78.4 - -
SwAV [caron2020unsupervised] 4096 50 53.9 78.5 - -
SimCLRv2 (+ Self-distillation) [chen2020big] 4096 400 60.0 79.8 - -
PAWS (multi-crops) [assran2021semi] 4096 50 66.5 - - -
CoMatch (multi-views) [li2021comatch] 1440 400 67.1 87.1 - -
Table 3: Semi-Supervised Learning on ImageNet-1K with various percentages of labeling samples. Results on DebiasMatch w. FixMatch are produced using the same data, codebase and training recipe as baseline method. DA refers to distribution alignment [berthelot2019remixmatch]. All results are produced with a backbone of ResNet-50. : all models are unsupervised pre-trained for 800 epochs, except for PAWS [assran2021semi], which is pre-trained for 300 epochs with pseudo-labels generated non-parametrically. : reproduced with official codes. DebiasMatch wins on its merit of simplicity. : training Linear Probe CLIP with partially labeled data as in [radford2021learning], reported by [zhou2021learning].

5 Experiment

5.1 Semi-supervised Learning

Dataset. We perform comprehensive evaluations of DebiasMatch on multiple SSL benchmarks, including CIFAR10 [krizhevsky2009learning], long-tailed CIFAR10 (CIFAR10-LT) [wei2021crest], ImageNet-1K [ILSVRC15], with varying amounts of labeled data. For the balanced benchmarks, since the performance is almost same as fully supervised if given more than 2% labeled data, we put our focus on the extremely low-shot settings, i.e., 0.08%/0.16%/2% on CIFAR10 and 1%/0.2% on ImageNet-1K. For imbalanced benchmarks, we follow the settings in [wei2021crest] and experiment DebiasMatch on CIFAR10-LT under various pre-defined imbalance ratio , where , and percentage of labeled data, including 10% and 30%. More details on datasets in appendix.

Setup. For all experiments on both long-tailed CIFAR10 and CIFAR10 datsets: to be consistent with previous works [sohn2020fixmatch, wei2021crest], we use a backbone of WideResNet-28-2 [he2016deep, zagoruyko2016wide]. To show the insensitivity of DebiasMatch to data distribution and hyper-paramter tuning, our DebiasMatch is experimented with the same set of hyper-parameters as in FixMatch [sohn2020fixmatch], except that we train DebiasMatch for about half of the original training epochs. The only hyper-parameter of DebiasMatch is also fixed and set to .

For experiments on ImageNet-1K, we use ResNet50 as the backbone network and follow the training recipes introduced in FixMatch w. EMAN [cai2021exponential], which is also the default baseline of all experiments on ImageNet-1K. The model is initialized with MoCo v2 + EMAN as in [cai2021exponential]. For the setting with multiple views, in the mini-batch, we perform two strong augmentations and two weak augmentations on each unlabeled sample. Each strongly-augmented instance is paired with one weakly-augmented instance and we jointly optimize the two pairs via pseudo-labeling as in original setting of Fig. 7. Multi-views could increase the convergence speed and stabilize the training process, at the price of increasing about 35% training time.

DebiasMatch is simple yet effective. Table 2 and Table 3 show that DebiasMatch delivers state-of-the-art performance on all experimented benchmarks, outperforming current SOTAs by a large margin. Without using CLIP, DebiasMatch can outperform CoMatch on CIFAR, and is comparable to CoMatch on ImageNet-1K. DebiasMatch wins on its merit of simplicity. Leveraging the power of CLIP could significantly improve the performance of DebiasMatch, surpassing CoMatch by about 4% on ImageNet-1K SSL.

Method Labeled: LT; 10% labeled,
Unlabeled: LT Unlabeled: Balanced
FixMatch [sohn2020fixmatch] 62.3 1.6 72.1 2.3
DebiasMatch 71.4 2.0 (+9.1) 83.5 2.4 (+11.4)
Table 4: DebiasMatch consistently improves the performance of SSL when the unlabeled data is either the sames as labeled data, i.e., long-tailed distributed, or different with labeled data, i.e., balanced distributed across semantics. We report results averaged on 5 folds.
FixMatch MixMatch UDA
Baseline 89.7 4.6 47.5 11.5 29.1 5.9
+ DebiasMatch 94.6 1.3 61.7 6.1 43.2 5.2
Table 5: DebiasMatch is a universal add-on. Top-1 accuracies of various SSL methods on CIFAR10, averaged on 5 folds, are compared. 4 instances per class are labeled.

DebiasMatch is agnostic to source/target data distribution. Table 2 shows that, for both CIFAR and long-tailed CIFAR SSL benchmarks, using a unified framework and the same set of hyper-parameters, DebiasMatch can surpass previous state-of-the-art methods, which are either designed for balanced data or meticulously tuned for long-tailed data. Further more, Table 4 illustrates that when tested in scenarios where labeled and unlabeled data follow different distributions, DebiasMatch produces an even greater gains (11.4%) to the baseline. Experimented on CIFAR10 SSL.

The fewer amount of labeled data, the larger gains can be observed in Table 2 and Table 3. The top-1 accuracy on CIFAR10 SSL with only 0.08% labeled data has been greatly improved, almost eliminating the gap between fully-supervised and semi-supervised learning.

DebiasMatch is also a universal add-on as illustrated in Table 5. Various SSL methods, including MixMatch and UDA, can be enhanced by incorporating DebiasMatch to achieve consistent performance improvements.

Method #param Accuracy (%)
top-1 top-5
ConSE [norouzi2014zero] - 1.3 3.8
EXEM [changpinyo2017predicting] - 1.8 5.3
DGP [kampffmeyer2019rethinking] - 3.0 9.3
ZSL-KG [nayak2020zero] - 3.0 9.9

Visual N-Grams

- 11.5 -
CLIP (prompt ensemble) [radford2021learning] 26M 59.6 -
(ours) CLIP + FixMatch 26M 55.7 80.6
(ours) CLIP + DebiasMatch 26M 68.3 (+8.7) 88.9 (+8.3)
CLIP (ViT-B/32) [radford2021learning] 398M 63.2 -
CLIP (ResNet50x4) [radford2021learning] 375M 65.8 -
CLIP (16-shot) [radford2021learning, zhou2021learning] 26M 53.4 -
CLIP + CoOp [zhou2021learning] 26M 60.9 -
Table 6: DebiasMatch delivers state-of-the-art results of zero-shot learning on ImageNet-1K. The time cost of zero-shot training FixMatch+/DebiasMatch+ CLIP, without using any human annotations, for 100 epochs is less than 0.01% of CLIP’s overall training time, thereby is negligible. : CoOp and CLIP (16-shot) need to be fine-tuned with about 1.5% annotated data.
CLIP [radford2021learning] 59.6 75.6 41.6 66.6 41.1 81.1
CLIP [radford2021learning] 59.7 72.3 40.9 61.7 43.5 80.5
DebiasMatch 68.3 91.5 63.4 83.7 69.8 85.1
Table 7: DebiasMatch exhibits stronger robustness to domain shift when conducting zero-shot learning on various datasets. : reproduced with official codes. Experimented with ResNet-50 as a backbone network.

5.2 Tranductive Zero-Shot Learning

Dataset. We evaluate the efficiency of DebiasMatch in T-ZSL on ImageNet-1K [ILSVRC15]. EuroSAT [helber2019eurosat], MNIST [lecun1998mnist], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning] and Food101 [bossard14] were also evaluated to show the robustness to domain shift.

Setup. T-ZSL assumes that unseen classes and their associated images are known during training, but not their correspondence. Following this settings, we do not use any semantic labels of target data. We train CLIP + DebiasMatch and CLIP + FixMatch the same as training FixMatch/DebiasMatch, except that the labeled data are “labeled” by CLIP rather than a human annotator. Specifically, all unlabeled instances whose CLIP confidence score greater than are pseudo-labeled by CLIP and considered as “labeled” data. A backbone of ResNet50 and a threshold of 0.95 are used for all datasets. The same default hyper-parameters and training recipes as in FixMatch + EMAN are utilized for fair comparisons. More details in appendix.

DebiasMatch delivers SOTA results on zero-shot transfer learning, even surpassing few-shot CLIP [radford2021learning] and CoOP [zhou2021learning] that are fine-tuned on partial human-labeled data. Moreover, DebiasMatch with a backbone of ResNet50 can significantly outperform CLIP with 15 larger backbones, as shown in Table 6.

DebiasMatch is also more robust to domain shift. DebiasMatch exhibits stronger robustness than zero-shot CLIP to distribution shift, without accessing any semantic labels, as depicted in Table 7. Also, DebiasMatch can observe greater gains (more than 20%) on datasets with larger domain shifts, e.g., an astonishing 26.3% gains can be obtained on the satellite image dataset EuroSAT [helber2019eurosat]. Fig. 4 and Table 7 show that the proposed DebiasMatch can obtain larger gains on the datasets with higher imbalanced ratios, such as CIFAR100 and EuroSAT.

6 Summary

In this paper, we conduct research on the often-neglected biased pseudo-labeling issue. A simple yet effective method DebiasMatch is proposed to dynamically alleviate biased pseudo-labels’ influence to a student model, without leveraging any prior knowledge on true data distribution. As a universal add-one, DebiasMatch delivers significantly better performance than previous state-of-the-arts on both semi-supervised learning and transductive zero-shot learning tasks, and exhibits stronger robustness to domain shifts. We are committed to public code release and reproducible results.


7 Appendix

7.1 Details on Datasets and Implementations

We conduct experiments on several benchmarks to prove the effectiveness and universality of DebiasMatch, here we provide more details on datasets and implementations for each benchmark:

CIFAR10 [krizhevsky2009learning]: The original version of CIFAR10 contains 50,000 images on training set and 10,000 images on validation set with 10 categories for CIFAR10. For semi-supervised learning on CIFAR10, we conduct the experiments with a varying number of labeled examples from 40 to 250, following standard practice in previous works [berthelot2019mixmatch, berthelot2019remixmatch, sohn2020fixmatch, li2021comatch]. The reported results of each previous method in the paper are directly copied from the best reported results in MixMatch [berthelot2019mixmatch], ReMixMatch [berthelot2019remixmatch], FixMatch [sohn2020fixmatch], CoMatch [li2021comatch], etc.

We keep all hyper-parameters the same as FixMatch, except for the number of training steps. We use WideResNet-28-2 [he2016deep, zagoruyko2016wide]

with 1.5M parameters as a backbone network for CIFAR10. The SGD optimizer with a Nesterov momentum of 0.9 is used for optimization. The learning rate is initialized as 0.03 and decayed with a cosine learning rate scheduler

[loshchilov2016sgdr], which sets the learning rate at training step as times of the initial learning rate, where is the total number of training steps, i.e., about 512 epochs, and is 2 times fewer than the original number of FixMatch training steps. The model is trained with a mini-batch size of 512, which contains 64 labeled samples and 448 unlabeled samples, on one V100 GPU. As in previous works, an exponential moving average of model parameters is used to produce final performance. The weight decay is set as 0.0005 for CIFAR10. Unless otherwise stated, the only independent hyperparameter of DebiasMatch is fixed and set to

in all experiments. Each method is tested under 5 different folds and we report the mean and the standard deviation of accuracy on the test set.

CIFAR10-LT [krizhevsky2009learning, liu2019large, wei2021crest]: The long-tailed version of CIFAR10 follows an exponential decay in sample sizes across different categories. CIFAR10-LT is constructed by sampling a subset of CIFAR10 following the Pareto distribution with the power value . Then, we select 10% or 30% of all CIFAR10-LT instances to construct the SSL benchmark labeled dataset, and the others are regarded as unlabeled dataset. Each algorithm is tested under 5 different folds of labeled data and we report the mean and the standard deviation of accuracy on the test set. As in previous works, an exponential moving average of model parameters is used to produce final performance.

In order to demonstrate the universality of the proposed method DebiasMatch and its insensitivity to data distribution, we follow the same hyperparameters and training formulas in CIFAR10, and do not specifically adjust any hyperparameters when conducting experiments in the long-tail SSL benchmarks.

ImageNet-1K [ILSVRC15]: ImageNet-1K is a curated dataset with approximately class-balanced data distribution, containing about 1.3M images for training and 50K images for validation.

For semi-supervised learning, ImageNet-1K with varying amounts of labeled data is experimented, i.e., 0.2% and 1%. The FixMatch model is trained with a batch size of 64 (320) for labeled (unlabeled) images with initial learning rate 0.03. Following [cai2021exponential]

, we replace batch normalization (BN) layers with exponential moving average normalization (EMAN) layers in the teacher model. EMAN updates its statistics by exponential moving average from the BN statistics of the student model. ResNet-50 is used as the default network and the default hyperparameters in the corresponding papers

[cai2021exponential, sohn2020fixmatch] is applied. The model is initialized with MoCo v2 + EMAN pre-trained model as in [cai2021exponential]. To make fair comparisons, we report results of FixMatch with EMAN as the baseline model and all hyper-parameters of FixMatch with EMAN are untouched, unless noted otherwise.

Figure 9: Overall framework of transductive zero-shot learning with CLIP + DebiasMatch. CLIP + FixMatch can be realized by removing the debiasing module and replacing marginal loss to cross-entropy loss.

For zero-shot learning, no manual annotation is leveraged in the training process. We train CLIP + DebiasMatch and CLIP + FixMatch following the same hyperparameters and training recipes as FixMatch with EMAN, except that the labeled data are “labeled” by CLIP rather than a human annotator. Specifically, all unlabeled instances whose CLIP confidence score greater than are pseudo-labeled by CLIP (with a backbone of ResNet50) and considered as “labeled” data. A backbone of ResNet50 and a threshold of 0.95 are used. The same default hyper-parameters and training recipes as in FixMatch + EMAN are utilized for fair comparisons. The framework of transductive zero-shot learning with DebiasMatch is illustrated in Fig. 9.

For experiments on other benchmarks of zero-shot learning, including EuroSAT [helber2019eurosat], MNIST [lecun1998mnist], CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning] and Food101 [bossard14], we follow the training recipe of ImageNet-1K.

7.2 Ablation Study

In this section, we conduct additional ablation studies on the influence of the two components of DebiasMatch (Table. 8) for SSL, DebiasMatch’s unique hyperparameter (Table. 9) for SSL, and CLIP’s confidence score threshold (Table. 10) for T-ZSL.

Debiasing Magirnal Loss CIFAR10 CIFAR10-LT
86.1 73.5
93.3 79.6
94.6 80.6
Table 8: Ablation study on the contribution of each component of DebiasMatch. Experimented on CIFAR10 and CIFAR10-LT () SSL, in which 4 out of 5,000 samples are labeled per class for CIFAR10 and 30% instances are labeled for CIFAR10-LT. Results averaged over 5 different folds are reported.

As shown in Table. 8, the two components of DebiasMatch lead to significant improvements to both CIFAR10 and CIFAR10-LT SSL benchmarks. Compared with the balanced benchmark, the performance improvement obtained by introducing the marginal loss is relatively smaller than the unbalanced benchmark.

0.0 0.25 0.5 0.75 1.0 2.0
DebiasMatch 73.5 79.5 80.6 80.5 80.5 77.7
Table 9: Ablation study on CIFAR10-LT () semi-supervised learning with DebiasMatch under various weight of debiasing module and marginal loss. 30% samples are labeled. The model is identical to FixMatch when . Results averaged over 5 different folds are reported.

Table. 9 illustrates the influence of debias factor . When the value of is set to 0, DebiasMatch is identical to FixMatch. Adding debiasing module and marginal loss can improve the performance on CIFAR10-LT by more than 7% when selecting the optimal choice of 0.5, which is marginally better than the default value 1.0. However, there is a trade-off. If the debias factor is too strong, it is hard for a model to fit on the data, while too small factor can barely eliminate the biases, and, ultimately, impairs the generalization ability.

(a) 0.0
(b) 0.4
(c) 0.95
Figure 10: A higher imbalanced ratio is obtained when filtering CLIP’s zero-shot predictions with a larger threshold, analyzed on CLIP’s zero-shot predictions on 1.3M almost class-balanced ImageNet training samples. Per class number of predictions (row 1), precision (row 2) and recall (row 3) of samples passing various confidence score threshold are visualized. Zero-shot predictions are produced with an ensemble of 80 prompts and a backbone of ResNet50, using official codes.
0.2 0.4 0.6 0.8 0.9 0.95
DebiasMatch + CLIP 55.9 63.2 66.2 67.1 67.7 67.7
Table 10: Ablation study on ImageNet-1K zero-shot Learning with DebiasMatch + CLIP [radford2021learning] under various threshold .

As illustrated in the main paper, the CLIP predictions are class-imbalanced. Therefore, the natural question is whether we can obtain a more balanced predictions by filtering instances with a threshold ? Unfortunately, no, on the contrary, when filtering predictions with a larger threshold, a higher imbalance rate is observed, as in Fig. 10. Furthermore, when filtering instances with a threshold of 0.95, more than 60 categories get zero predictions.

The dilemma is that using a smaller threshold

can obtain a smaller imbalanced ratio, which is a desired property, however, it also leads to a lower precision, which introduces many outliers and misclassified samples. Therefore, a module to eliminate biases captured by the CLIP model when CLIP is pre-trained on source data is needed to yield a good performance on target data.

Table. 10 shows that using a threshold of 0.95 can get the optimal performance on ImageNet zero-shot learning task, which indicates that the high precision of the labeled data, realized by using a high threshold, is essential for better performance on target data. At the same time, our proposed DebiasMatch can greatly alleviate the trouble of a higher imbalance ratio caused by using a larger threshold, therefore, eventually, obtaining more than 10% performance gains.