Class-Imbalanced Semi-Supervised Learning

02/17/2020 ∙ by Minsung Hyun, et al. ∙ 0

Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data. However, SSL has a limited assumption that the numbers of samples in different classes are balanced, and many SSL algorithms show lower performance for the datasets with the imbalanced class distribution. In this paper, we introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data. In doing so, we consider class imbalance in both labeled and unlabeled sets. First, we analyze existing SSL methods in imbalanced environments and examine how the class imbalance affects SSL methods. Then we propose Suppressed Consistency Loss (SCL), a regularization method robust to class imbalance. Our method shows better performance than the conventional methods in the CISSL environment. In particular, the more severe the class imbalance and the smaller the size of the labeled data, the better our method performs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large dataset with well-refined annotations is essential to the success of deep learning and every time we encounter a new problem, we should annotate the whole dataset, which costs a lot of time and effort 

(russakovsky2015best; bearman2016s). To alleviate this annotation burden, many researchers have studied semi-supervised learning (SSL) that improves the performance of models by utilizing the information contained in unlabeled data (chapelle2009semi; verma2019interpolation; berthelot2019mixmatch).

However, SSL has a couple of main assumptions and shows excellent performance only in these limited settings. The first assumption is that unlabeled data is in-distribution, i.e., the class types of unlabeled data are the same as those of labeled data (oliver2018realistic). The second is the assumption of balanced class distribution, which assumes that each class has almost the same number of samples  (li2011semi; stanescu2014semi). In this paper, we performed a study dealing with the second assumption.

(a) Two moons Supervised
(b) Two moons model
(c) Two moons Mean Teacher
(d) Two moons SCL (ours)
(e) Four spins Supervised
(f) Four spins model
(g) Four spins Mean Teacher
(h) Four spins SCL (ours)
Figure 1: Toy examples: We experimented on Two moons and Four spins datasets in CISSL settings for four algorithms (Supervised learning, model (laine2016temporal), Mean Teacher (tarvainen2017mean)

and SCL (ours)). The color represents the probability of of the class with the highest confidence.

The class distribution of data, in reality, is not refined and is known to have long tails (kendall1946advanced). However, many researches have developed models based on well-refined balanced data such as CIFAR (krizhevsky2009learning), SVHN (netzer2011reading)

, and ImageNet ILSCVRC 2012 

(deng2009imagenet). Training the model with imbalanced datasets causes performance degradation. Class imbalanced learning (CIL) is a way to solve such class imbalance and proposes various methods in the level of data, algorithm, and their hybrids (krawczyk2016learning; johnson2019survey). However, to our best knowledge, the studies on CIL have relied entirely on labeled datasets for training and have not considered the use of unlabeled data.

In this paper, we define a task, class-imbalanced semi-supervised learning (CISSL), and propose a suitable algorithm for it. By assuming class imbalance in both labeled and unlabeled data, CISSL relaxes the assumption of balanced class distribution in SSL. Also, it can be considered as a task of adding unlabeled data to CIL.

We analyzed the existing SSL methods in the CISSL setting through toy examples. First, we found that the class imbalance in the CISSL disrupts the learning of the existing SSL methods based on the ‘cluster assumption’ which asserts that each class has its own cluster in the latent space (chapelle2009semi). According to this assumption, the decision boundary traverses the low-density area of the latent space. With the class imbalance, however, the decision boundary may be incorrectly formed and passes through the high-density area of the minor class, which results in degradation of the SSL methods.

In Fig.0(b), 0(f)

, we can see that each decision boundary is skewed toward the minority class in the

model (laine2016temporal), a representative algorithm of consistency-regularization-based SSL, compared to that of supervised learning (Fig.0(a), 0(e)).

Second, we examined that the Mean Teacher (MT) (tarvainen2017mean) is more robust than model in CISSL settings. In Fig.0(c), 0(g), even though there is a class imbalance, MT maintains a relatively stable decision boundary. We show later that MT is more stable because it uses a conservative target for consistency regularization.

Based on these observations, we propose a regularization method using ‘suppressed consistency loss’ (SCL), for better performance in the CISSL settings. SCL prohibits the decision boundary in a minor class region from being smoothed too much in the wrong direction as shown in Fig.0(d), 0(h). In Section 4, we will discuss the role of SCL in more detail.

We also proposed standard experimental settings in the CISSL. We followed the SSL experiment settings, but to be more realistic, we considered class imbalance in both labeled and unlabeled data. In this setting, we compared existing SSL and CIL methods to ours and found that our method with SCL shows better performance than others. Furthermore, we applied SCL to the object detection problem and improved performance in the existing SSL algorithm for object detection.

Our main contributions can be summarized as follows:
We defined a task of imbalanced semi-supervised learning, reflecting a more realistic situation, and suggested standard experimental settings.
We analyzed how the existing SSL methods work in CISSL settings through mathematical and experimental results.
We proposed Suppressed Consistency Loss that works robustly for problems with class imbalance, and experimentally show that our method improves performance.

2 Related Work

2.1 Semi-Supervised Learning

Semi-supervised learning is a learning method that tries to improve the performance of supervised learning, which is based only on labeled data (), by additional usage of unlabeled data (). SSL approaches include methods based on self-training and generative models (lee2013pseudo; zhai2019s4l; goodfellow2014generative; radford2015unsupervised; dumoulin2016adversarially; lecouat2018manifold). In addition, consistency regularization has shown good performance in semi-supervised learning, which pushes the decision boundary to low-density areas using unlabeled data (bachman2014learning; sajjadi2016regularization; laine2016temporal; verma2019interpolation). The objective function is composed of supervised loss, , for and consistency regularization loss, , for . As a typical semi-supervised learning method (laine2016temporal; oliver2018realistic), ramp-up scheduling function is used for stable training:

(1)
(2)

where is a distance metric such as distance or KL-divergence, and are perturbations to input data, and and are the parameters of the model and target model, respectively. For -class classification problem,

is the output logit (class probability) for the input

. model (laine2016temporal) and Mean Teacher (MT) (tarvainen2017mean) are the representative algorithms using consistency regularization. The model uses as and MT updates with EMA (exponential moving average) as follows:

(3)

From (3), MT can be considered as a temporal ensemble model in the parameter space.

Above this, there are some methods that optimize the direction of perturbation (miyato2018virtual), regularize through graphs of minibatch samples (luo2018smooth) and perturb inputs with mixup (zhang2017mixup; verma2019interpolation). In addition, the consistency-based semi-supervised learning for object detection (CSD) is an algorithm that applies SSL to object detection by devising classification and localization consistency (jeong2019consistency).

2.2 Class Imbalanced Learning

Class imbalanced learning is a way to alleviate the performance degradation due to class imbalance. buda2018systematic defined the class imbalance factor as the ratio between the numbers of samples of the most frequent and the least frequent classes. And we call each class as major class and minor class.

So far, there have been various researches to solve class imbalance problems (johnson2019survey). Data-level methods approach the problem by over-sampling minor classes or under-sampling major classes (masko2015impact; lee2016plankton; pouyanfar2018dynamic; buda2018systematic). These methods take a long time in model training due to re-sampling. Algorithm-level methods re-weight the loss or propose a new loss without touching the sampling scheme (wang2016training; lin2017focal; wang2018predicting; khan2017cost; zhang2016training; wang2017learning; cui2019class; cao2019learning). Algorithm-level methods can be easily applied without affecting training time. There are also hybrids of both methods (huang2016learning; ando2017deep; dong2019imbalanced).

In this paper, we applied three algorithm-level methods to the CISSL environment and compared their performance to cross-entropy loss (CE):
(i) Normalized weights, which weight a loss inversely proportional to the class frequency (IN) (cao2019learning).

(ii) Focal loss which modulates by putting fewer weights on samples that the model is easy to classify 

(lin2017focal).
(iii) Class-balanced loss which re-weights the loss in inverse proportion to the effective number of classes (CB) (cui2019class).

max width= (%) Class Type Supervised model Mean Teacher MT+SCL (Ours) All 25.06 12.43 41.57 8.82 34.99 9.98 24.39 15.14 Twomoons Major 0.95 1.24 0.00 0.00 0.01 0.03 0.06 0.07 Minor 49.17 24.74 83.14 17.64 69.96 19.98 48.01 31.04 All 19.70 6.70 17.79 8.39 14.99 8.46 10.91 8.94 Fourspins Major 7.83 5.43 4.75 3.74 4.76 3.30 6.28 3.26 Minor 49.39 25.61 52.68 31.17 43.29 31.53 27.68 36.48

Table 1:

Mean and standard deviation of validation error rates (%) for all, major, and minor classes in toy examples. We conducted 5 runs with different random seeds for class imbalance distribution.

3 Analysis of SSL under Class Imbalance

In this section, we look into the topography of the decision boundary to see how the SSL algorithms work in the class-imbalanced environment. First, we compare supervised learning with SSL’s representative algorithms, model (laine2016temporal) and Mean Teacher (tarvainen2017mean) via toy examples. And we analyze why MT performs better in CISSL through a mathematical approach.

Figure 2: Suppressed Consistency Loss (SCL). Due to the imbalance in data, decision boundary tends to skew into the areas of minor class with consistency regularization. SCL inversely weights consistency loss to the number of class samples and pushes the decision boundary against low-density areas.
(a) Labeled dataset (b) Uniform (c) Half (d) Same
Figure 3: Types of unlabeled data imbalance. (a) Imbalance of labeled data. (b) Uniform case: The number of samples in all classes are the same ( =1). (c) Half case: The imbalance factor of the unlabeled data is half of the labeled data. ( (d) Same case: Same labeled and unlabeled imbalance factor (.

3.1 Toy examples

We trained each algorithm by 5,000 iterations on two moons and four spins datasets with an imbalance factor of 5 for each labeled and unlabeled data. 111The number of samples of class c is set to
. The Rank, , of the major class is 1.
and are the imbalance ratio and the number of classes.
Fig.1 represents the probability of the class with the highest confidence at each location. The region with relatively low probability, closer to the dark red color, is the decision boundary in the figure.

In Fig.0(a), 0(e), the decision boundary of the supervised learning is very steep. And there are very high confidence areas far away from the decision boundary. With the SSL methods, unlabeled data smooth the decision boundary through consistency regularization (chapelle2009semi). In particular, the decision boundary smoothing is larger in the minor class area. Also, we found that the learning patterns of the model and MT are different. Table.1 shows the validation error rates for toy examples. We found that performance degradation is evident in the minor class. MT shows relatively better performance than model, although it shows inferior performance than the supervised learning in two moons. Our method which applies SCL to MT achieves the best performance in both two moons and four spins datasets.

3.2 Model vs. Mean Teacher

We analyze the results of Section.3.1 in this part. When the consistency regularization is applied to supervised learning in Fig.0(a), 0(e), compared to the samples far away from the boundary, the influence of the samples around the decision boundary is considerable, because the model output does not change even if small perturbation is added to the model input in the region far from the decision boundary from (2). As a result, consistency regularization smooths the decision boundary, as shown in Fig.0(b), 0(f).

According to the cluster assumption (chapelle2009semi), the decision boundary lies in the low-density area and far from the high-density area. However, in a problem with severe class imbalance, the decision boundary may penetrate a globally sparse but relatively high-density area of a minor class as shown in the blue square in Fig.3. By consistency regularization, decision boundary smoothing occurs in this area, and many samples in the minor class are misclassified.

Therefore, conventional consistency regularization-based methods are generally expected to degrade the performance for the minor class. But we found that the severity of this phenomenon differs depending on the SSL algorithm. In Table.1, MT consistently performed better than model, especially for the minor class.

First, we analyzed the behavior of MT in CISSL with the simple SGD optimizer. Consider the model parameter , the learning rate , and the objective function , then the update rule of SGD optimizer is:

(4)

For a EMA decay factor of MT, , the current () and the target () model parameters at the -th iteration are

(5)
(6)

Comparing (5) and (6), we can see that , the target for the consistency loss in MT, is updated slower than the model parameter because of the use of the EMA decay factor . On the other hand, in model, because , the target is updated faster than that of MT As described in the supplementary, we can get the same results of slow target update in MT for the SGD with momentum case that we used for our experiments.

Now we will check why MT performs better than model in CISSL environment. Assume and be initially with the same value . In this case, the consistency loss of model and MT are

(7)

If we use distance for for simplicity, their derivatives become

(8)
(9)

Note the target parameters () in (7) are not included in the gradient calculation. Using the Taylor series expansion and subtracting (8) from (9), we obtain

(10)

In the last line of (10), we assumed gradients be constant in a small area around . When the sample is far away from the decision boundary, and MT and model behave the same, but in the area near the decision boundary, it becomes , and in the gradient descent step, compared to the model, the negative gradient of MT ( in (1)) prohibits from being away from the target . In the CISSL environment, while model pushes the boundary towards the minor class, MT mitigates this by retaining the old target boundary like ensemble models.

In summary, the performance difference between the model and MT in CISSL is due to different targets of consistency regularization. The model uses the current model () as a target. Therefore, the model smooths the decision boundary regardless of whether it passes the high-density area of the minor class. Because the target is the same as the parameter, smoothing causes model degradation as the parameter update is repeated. MT, on the other hand, targets a more conservative model () than the current model. Note that since the target of MT is different from the current model, even if we reduce the learning rate of the model, it would work differently from MT. The conservative target has an ensemble effect with consistency regularization, so smoothing does not cause severe performance degradation.

Besides, we can explain the reason why MT performs better than the model in terms of batch sampling. In the mini-batch, minor class samples are sampled at a relatively low frequency. For this reason, the model frequently updates the model without a minor sample during the consistency regularization, which distorts the decision boundary. On the other hand, since the target of MT is calculated by EMA, even if there is no minor class sample in the mini-batch, it includes more information about the minor class samples. Thus, we can say that MT learns with a more stable target than the model.

4 Suppressed Consistency Loss

In Section 3, we found that the main performance degradation of SSL models in CISSL is due to consistency regularization in minor classes. With the intuition that we should suppress the consistency regularization of minor classes in CISSL, we propose a new loss term, suppressed consistency loss (SCL), as follows:

(11)

Here, can be any function inversely proportional to and we set it as

(12)

where . is the number of training samples of the class predicted by the model, is the number of samples of the class with the most frequency. SCL weights the consistency loss in an exponentially inverse proportional to the number of samples in a class. In (11), is 1 for the most frequent class, where it works the same as the conventional consistency loss. For the least frequent class, the influence of the consistency loss is suppressed. In (12), the exponential decay is to incorporate very high imbalance factor in our model. However, when the imbalance factor is not so high, a simple linear decay can also be used.

Fig.3 illustrates the effect of consistency regularization by SCL. When training with SCL, the decision boundary is smoothed weakly for minor class and is smoothed strongly for major class. If the performance of the model is inaccurate, especially for the minor class, it may pass through the high-density area. Then the SCL limits the smoothing of the decision boundary towards the minor class cluster. On the other hand, when the model mispredicts actual minor class samples as a major class in the high-density area of the minor class, the decision boundary is smoothed with higher weight. Consequentially, SCL pushes the decision boundary to low-density areas of the minor class and prevents performance degradation, as shown in Fig.3.

Unlabel Imbalance Type Uniform () Half () Same ()
Imbalance factor () 10 20 50 100 10 20 50 100 10 20 50 100
Supervised 23.03 27.49 33.15 36.71 23.03 27.49 33.15 36.71 23.03 27.49 33.15 36.71
-Model (laine2016temporal) 21.10 25.74 33.91 39.36 22.69 27.72 33.96 38.84 23.49 28.18 34.22 38.05
MT (tarvainen2017mean) 16.45 19.25 23.45 29.06 19.48 23.30 30.06 35.37 20.50 24.67 31.77 35.91
VAT + em (miyato2018virtual) 17.93 20.18 30.43 36.57 20.17 24.50 32.54 36.77 21.45 25.83 33.13 37.67
VAT + em + SNTG (luo2018smooth) 18.15 20.39 29.77 36.34 20.41 24.64 32.56 38.48 21.87 26.49 33.36 38.48
Pseudo-Label (lee2013pseudo) 19.33 24.34 34.18 39.59 21.23 26.78 34.12 39.72 22.73 27.50 34.91 38.69
ICT (verma2019interpolation) 18.01 20.52 30.18 38.33 19.53 23.90 31.09 37.36 19.96 25.63 33.56 36.85
MT+SCL (ours) 15.65 16.99 19.95 22.62 17.36 21.74 28.20 33.09 18.69 22.98 29.76 34.22
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
Imbalance factor () 10 20 50 100 10 20 50 100 10 20 50 100
Supervised 18.49 21.92 30.03 35.89 18.49 21.92 30.03 35.89 18.49 21.92 30.03 35.89
-Model (laine2016temporal) 11.74 13.42 21.63 28.59 12.96 16.70 24.02 33.73 13.46 17.13 26.53 33.71
MT (tarvainen2017mean) 6.52 6.75 7.60 8.94 7.25 8.85 12.19 17.23 8.62 9.29 15.16 21.01
VAT + em (miyato2018virtual) 6.81 7.70 13.84 29.15 8.99 11.59 18.95 30.44 10.39 13.62 21.49 32.39
VAT + em + SNTG (luo2018smooth) 93.30 93.30 14.88 93.30 93.30 93.30 20.60 93.30 93.30 93.30 23.52 93.30
Pseudo-Label (lee2013pseudo) 10.15 9.97 16.00 32.79 11.59 13.97 24.40 33.70 12.34 15.93 25.66 33.53
ICT (verma2019interpolation) 27.82 37.75 58.20 67.02 22.38 38.12 48.88 58.99 24.53 37.25 49.85 56.97
MT+SCL (ours) 6.52 7.11 7.70 8.56 7.54 9.29 11.46 18.63 8.22 10.04 15.48 20.39

Table 2: Test error rates (%) from experiments with 4k number of labeled data and imbalance factor {10, 20, 50, 100 } under 3 different unlabeled imbalance types in CIFAR10 and imbalance factor {10, 20, 50, 100 } under 3 different unlabeled imbalance types in SVHN. VAT+EM refers to Virtual Adversarial Training with Entropy Minimization. To improve legibility, the standard deviation is listed in supplemental materials. (Bold/Red/Blue: supervised, best and second best results for each column.)

5 Experiments

5.1 Datasets and implementation details

We conducted experiments using the CIFAR10 (krizhevsky2009learning) and SVHN (netzer2011reading) datasets in our proposed environments and followed the common practice in SSL and CIL (oliver2018realistic; johnson2019survey). We divided the training dataset into three parts: labeled dataset, unlabeled dataset, and validation dataset. Labeled data is configured to have an imbalance for each class according to the CIL environment. We have experimented with various numbers of labeled samples and imbalance factors. We considered three types of class imbalance in unlabeled data: Same (, where and are the imbalance factors for labeled and unlabeled dataset.), Uniform

(uniform distribution,

), and Half (). The size of the unlabeled dataset changes depending on unlabeled data imbalance types because of the limitation of the dataset used. For fair experiments, we set the size of the unlabeled set based on the Same case, which uses the lowest number of unlabeled samples. Fig.3 shows the three imbalance types with imbalance factor 10. Validation data is made up as in (oliver2018realistic).

In all experiments, we used the Wide-Resnet-28-2 model (zagoruyko2016wide). It has enough capacity to show the performance improvement of SSL objectively (oliver2018realistic), and it is used in the new SSL methods (berthelot2019mixmatch; verma2019interpolation). We adopt optimizer and learning rate from (verma2019interpolation), and other hyper-parameters are set under a similar setting with (oliver2018realistic)222https://github.com/brain-research/realistic-ssl-evaluation. In our experiments, we used third-party implementation 333https://github.com/perrying/realistic-ssl-evaluation-pytorch. All the scores of test error rates are from five independent runs with different random seeds. Experiments with different random seeds shuffle the frequency ranking of each class when the imbalance factor is constant, and cover a variety of cases.

Unlabel Imbalance Type Uniform () Half () Same ()
Re-weighting Method CE IN Focal CB CE IN Focal CB CE IN Focal CB
Supervised 36.71 35.73 36.80 37.19 36.71 35.73 36.80 37.19 36.71 35.73 36.80 37.19
-Model (laine2016temporal) 39.36 36.90 39.89 39.20 38.84 37.82 38.28 37.51 38.05 37.18 38.10 37.34
MT (tarvainen2017mean) 29.06 24.00 30.73 29.50 35.37 33.08 34.45 35.04 35.91 34.01 35.65 35.17
VAT + em (miyato2018virtual) 36.57 31.34 37.51 36.78 36.77 36.20 37.62 38.13 37.67 36.91 37.88 37.64
VAT + em + SNTG (luo2018smooth) 36.34 33.03 37.78 36.26 38.48 35.90 38.01 37.44 38.48 36.99 37.71 37.53
Pseudo-Label (lee2013pseudo) 39.59 30.62 37.90 39.38 39.72 37.36 38.77 39.12 38.69 36.84 38.92 38.52
MT+SCL (ours) 22.62 21.59 23.44 22.93 33.09 31.63 34.09 33.14 34.22 32.09 33.93 34.66
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
Re-weighting Method CE IN Focal CB CE IN Focal CB CE IN Focal CB
Supervised 35.89 34.60 35.45 35.30 35.89 34.60 35.45 35.30 35.89 34.60 35.45 35.30
-Model (laine2016temporal) 28.59 26.72 30.15 27.99 33.73 29.60 31.67 31.12 33.71 31.70 31.70 33.17
MT (tarvainen2017mean) 8.94 6.82 8.66 7.86 17.23 17.02 16.20 16.68 21.01 20.80 20.01 21.77
VAT + em (miyato2018virtual) 29.15 20.26 28.09 29.37 30.44 27.44 28.62 29.65 32.39 29.18 30.62 30.93
VAT + em + SNTG (luo2018smooth) 93.30 93.30 93.30 93.30 93.30 93.30 93.30 93.30 93.30 93.30 93.30 93.30
Pseudo-Label (lee2013pseudo) 32.79 13.48 35.07 34.38 33.70 31.83 32.79 32.83 33.53 31.62 33.63 34.55
MT+SCL (ours) 8.56 8.48 7.74 9.02 18.63 18.59 16.34 16.44 20.39 20.51 20.95 21.06
Table 3: Test error rates (%) from experiments with different re-weighting methods in CIFAR10 and SVHN. We compared inverse and normalization (IN), focal loss (FOCAL), and class-balanced loss (CB) to conventional cross-entropy loss (CE).
(Red: best results for each row with same unlabeled data imbalance.)
Unlabel Imbalance Type Uniform () Half () Same ()
# labeled data 1000 2000 4000 1000 2000 4000 1000 2000 4000
Supervised 54.24 45.81 36.71 54.24 45.81 36.71 54.24 45.81 36.71
-Model (laine2016temporal) 56.82 48.55 39.36 55.99 47.74 38.84 55.42 46.83 38.05
MT (tarvainen2017mean) 51.74 38.94 29.06 51.61 42.47 35.37 52.58 44.11 35.91
VAT + em (miyato2018virtual) 53.68 48.47 36.57 53.60 45.20 36.77 53.62 44.77 37.67
VAT + em + SNTG (luo2018smooth) 54.53 48.23 36.34 55.59 45.37 38.48 55.55 45.99 38.48
Pseudo-Label (lee2013pseudo) 58.19 50.01 39.59 57.05 49.42 39.72 56.68 48.45 38.69
ICT (verma2019interpolation) 57.10 48.25 38.33 56.02 47.60 37.36 55.10 47.19 36.85
MT+SCL (ours) 42.84 28.69 22.62 45.72 39.97 33.09 48.00 40.69 34.22
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
# labeled data 250 500 1000 250 500 1000 250 500 1000
Supervised 61.31 47.98 35.89 61.31 47.98 35.89 61.31 47.98 35.89
-Model (laine2016temporal) 54.51 39.49 28.59 54.14 42.20 33.73 54.10 43.89 33.71
MT (tarvainen2017mean) 38.32 18.14 8.94 41.72 23.33 17.23 42.42 28.86 21.01
VAT + em (miyato2018virtual) 64.67 44.04 29.15 58.01 41.15 30.44 55.03 42.44 32.39
VAT + em + SNTG (luo2018smooth) 65.02 93.30 93.30 57.94 93.30 93.30 54.19 93.30 93.30
Pseudo-Label (lee2013pseudo) 63.16 49.78 32.79 54.79 44.32 33.70 56.83 43.71 33.53
ICT (verma2019interpolation) 86.54 77.64 67.02 84.22 72.21 58.99 85.15 71.19 56.97
MT+SCL (ours) 26.25 15.31 8.56 33.44 22.26 18.63 35.32 27.13 20.39
Table 4: Test error rates (%) from experiments with imbalance factor 100 and the number of labeled data {1k, 2k, 4k} under 3 different unlabeled imbalance types in CIFAR10 and the number of labeled data {250, 500, 1k} under 3 different unlabeled imbalance types in SVHN. Details are the same as Table 2.

5.2 Baselines to CISSL

We conducted experiments on how existing methods in the field of SSL and CIL perform in our defined CISSL environment and used them as the baseline for our research. We experimented in the case of 4k and 1k labeled samples for CIFAR10 and SVHN each, both with imbalance factor 100.

A. Comparison of Semi-supervised Learning Methods
Columns with imbalance factor 100 in Table.2(a) is the results of applying the SSL methods to the CISSL problem in CIFAR10. Except for MT, almost all SSL methods are inferior to supervised learning. Even if the unlabeled data imbalance is mitigated to Uniform case, there is no improvement in the performance of SSL methods except MT.

Columns with imbalance factor 100 in Table.2(b) is the same experiment for SVHN. Most SSL methods perform better when the unlabeled data imbalance is lower, i.e. in Uniform case than in Same case. Notably, ICT showed a performance degradation of over compared to the supervised learning, and SNTG even failed to train a model.

From this experimental results and the analysis in Section.3, we used MT as our baseline, which performed best in all experiments.

B. Comparison of Class Imbalanced Learning Methods
We carried out the ablation experiments to cross-entropy loss (CE) as three types of CIL: Inverse and Normalization (IN), Focal loss, and Class-Balanced (CB) loss. We applied these CIL methods only to the supervised loss, in (1), and did not apply them to unlabeled data because we do not know the class label of the unlabeled data. In this experiment, we ignored ICT because CIL methods cannot be applied to ICT which uses mixup supervised loss.

Table.3(a) is the result of CIFAR10 experimented with imbalance factor 100, 4k labeled dataset. First of all, it seems that not all CIL methods always improve performance over CE. As unlabeled data imbalance and SSL methods change, their relative performance with CE differs. In this table, IN shows the best performance in all cases except the Half case of the model.

Table.3(b) is the result of SVHN experiments with imbalance factor 100, 1k labeled dataset. Unlike the previous CIFAR10 results, IN does not always dominate. The best algorithm differs according to the unlabeled data imbalance type in MT and our method. Since we do not know the unlabeled data imbalance beforehand, choosing a specific CIL algorithm does not guarantee a performance boost. So we used the most common cross-entropy as our baseline. In addition, SNTG failed to learn, as in Table.2(b).

5.3 Unlabeled data Imbalance

A. Comparison of Imbalance Factor
We experimented with changing the imbalance factor while keeping the number of labeled samples. We experimented on CIFAR-10 and SVHN with imbalance factor . The results are shown in Table 2(a), 2(b), respectively.

In Table 2(a), the higher the imbalance factor, the lower the performance. Supervised learning on imbalance factor 100 achieves error, which higher than supervised learning on imbalance factor 10. In the case of the small imbalance factor, SSL algorithms generally improve performance although unlabeled data has same imbalance with labeled data. As the imbalance factor increases, on the other hand, some SSL algorithms show lower performance than supervised learning. Mean Teacher is the only SSL algorithm that improves the performance with imbalance factor 100 in Same case. This means that general SSL algorithms do not consider the imbalance for the unlabeled data. However, the proposed SCL has robustly improved the performance in various imbalance settings. Notably, it shows remarkable improvement in the Uniform case compared to SSL algorithms.

Table.2(b) shows similar results. However, there is no big performance difference between MT and our method. This is because SVHN is easier to classify than CIFAR10. For SVHN, SNTG and ICT show lower performance than the supervised learning. It seems that the model training fails. We discuss this phenomenon in Section.6.

B. Comparison of The Number of Labeled Samples
We experimented with keeping the imbalance factor while changing the number of labeled samples. We set the number of labeled data to {1k, 2k, 4k} in CIFAR10, and {250, 500, 1k} in SVHN. The results of CIFAR10 and SVHN are shown in Table 4(a), 4(b), respectively.

In Table.4(a), the smaller the size of the labeled set, the lower the performance. In particular, when the size of the labeled data is 1k, most of the algorithms are weaker than supervised learning , while our method improves performance. This result indicates that consistency regularization is not valid when the baseline classifier is not performing well.

Table.4(b) also shows similar tendency between the size of labeled data and performance. For SNTG and ICT, same as Section.5.3.A, they have lower performance than supervised learning, either.

max width= Algorithm Supervised CSD (jeong2019consistency) CSD + SCL(Ours) cls o o o o loc o o mAP 70.2 71.7 72.3 72.07 0.15 72.60 0.10

Table 5: Detection results for PASCAL VOC2007 testset. cls and loc are the consistency loss for classification and localization, respectively. We trained SSD300 on VOC07(L)+VOC12(U). Our result is from three independent trials.

5.4 Object detection

We followed the CSD (jeong2019consistency) experiment settings and used the SSD300 model (liu2016ssd). We used PASCAL VOC2007 trainval dataset as the labeled data and PASCAL VOC2012 trainval dataset. We evaluated with PASCAL VOC2007 test dataset. In this experiment, the imbalance factor of labeled dataset is about 20. We applied our algorithm only to the classification consistency loss of CSD. The details are in the supplementary material.

In Table 5, supervised learning using VOC2007 shows 70.2 mAP. CSD with only classification consistency loss is 1.5%p higher than the supervised and CSD shows 2.1%p of enhancement. When SCL is applied to the CSD, our method shows additional improvement.

6 Discussion

The reason why the existing SSL methods did not perform well in the CISSL environment was that they did not consider data imbalance. This fact gives us some implications. First, for deep learning to become a practical application, we need to work on a harsher benchmark. We experimented on datasets which relaxed the equal class distribution assumption of SSL, and our method yielded meaningful results. Second, we should avoid developing domain-specific algorithms which work very well only under certain conditions. SNTG (luo2018smooth) and ICT (verma2019interpolation) are very good algorithms for existing SSL settings. In our experiments, however, both algorithms were not robust against class imbalance. Finally, we need to focus not only on the performance improvement of the model but also on its causes. An in-depth analysis of the causes of the phenomena provides an intuition about the direction of future research. Concerning this, we discussed aspects of learning in the CISSL environment in Section.3.

7 Conclusion

In this paper, we proposed Class-Imbalanced Semi-Supervised Learning, which is one step beyond the limitations of SSL. We theoretically analyzed how the existing SSL methods work in CISSL. Based on the intuition obtained here, we proposed Suppressed Consistency Loss that works robustly in CISSL. Our experiments show that our method works well in the CISSL environment compared to the existing SSL and CIL methods, as well as the feasibility of working in object detection. However, our research have focused on relatively small datasets. Applying CISSL to more massive datasets would be the future work.

References

A Toy Examples Details

We generated two moons and four spins datasets. We split the train set into labeled data and unlabeled data with imbalance factor 5. The class distribution of unlabeled data follows same case. The size of the labeled data is 12 ({2, 10} samples each) in two moons, 11 ({1,2,3,5} samples each) in four spins. The size of the unlabeled data is 3000 in two moons, 2658 in four spins. Both datasets have 6,000 validation samples. We trained each algorithm by 5,000 iterations. The model is a 3-layer network; optimizer is SGD with momentum, the learning rate is 0.1 decaying at 4,000 iterations multiplied by 0.2, and momentum is 0.9.

In the experiment, we set the function of suppressed consistency loss as with simplicity, where is the number of training samples of the class predicted by the model, is the number of samples of the class with the most frequency.

B model vs. Mean Teacher Details

b.1 SGD Case

Consider the model parameter , the learning rate , and the objective function , then the update rule of SGD optimizer is:

(13)

For a EMA decay factor of MT, , the current () and the target () model parameters at the -th iteration are

(14)

b.2 SGD with Momentum Case

is the momentum for SGD optimizer, is a decay factor of momentum, and other parameters are the same with Section.B.1.

(15)
(16)

The current model parameter() at the -th iteration is

(17)

And then the target model() at the -th iteration is

(18)

Difference of the coefficient for for each is

(19)

Fig.4 is the difference of the first term (from current model ) and the second term (from target model ) in (19) when is 0.9 and is 0.95 same as our experiments. We can see that the different between two terms is always greater or equal to 0. Therefore, is a more conservative target than in SGD with momentum optimizer, either.

Figure 4: The difference of the first term (from current model ) and the second term (from target model ) in (19) for each iteration when is 0.9 and is 0.95. The trend between iteration 0 to 499,000 is almost same with the early iteration of this figure.

C Experiment Settings

c.1 Dataset Details

We followed standard settings for CIFAR10 and SVHN. For CIFAR10, there are 50,000 training images and 10,000 test images. We split the training set into a 45,000 train set and a 5,000 validation set for experiments. The validation set consists of the same size per class. We applied global contrast normalization and ZCA normalization. For data augmentation, we used random horizontal flipping, random cropping by padding 2 pixels each side of the image, and added Gaussian noise with standard deviation 0.15 to each pixel.

For SVHN, there are 73,257 training images and 26,032 test images. We split the training set into a 65,931 train set and a 7,326 validation set for experiments. The validation set consists of the same size per class. We applied global contrast normalization and ZCA normalization. For data augmentation, we used random cropping by padding 2 pixels on each side of the image only.

In our main experiments, we split the training set into the labeled set and the unlabeled set. The size of the unlabeled set changes depending on unlabeled data imbalance types because of the limitation of the training dataset. For fair experiments, we set the size of the unlabeled set based on the Same case, which uses the lowest number of unlabeled samples. The size of unlabeled data is described in Table.6(a), 6(b).

# of labeled data
Imbalance Factor 1k 2k 4k
100 10166 9166 7166
50 - - 8596
20 - - 11322
10 - - 14389
(a) CIFAR10
# of labeled data
Imbalance Factor 250 500 1k
100 16109 15858 15360
50 - - 17455
20 - - 21449
10 - - 25943
(b) SVHN
Table 6: Number of unlabeled data in CIFAR10 and SVHN according to imbalance factor and number of labeled data.

c.2 Implementation details

In all experiments, we use the Wide-Resnet-28-2 model (zagoruyko2016wide). Following the settings from verma2019interpolation

, we set SGD with Nesterov momentum as our optimzer and adopted the cosine annealing technique 

(loshchilov2016sgdr)

. Detailed hyperparameters for experiments is described in Table.

7.

max width= Shared Training iteration 500k Consistency ramp-up iteration 200k Initial learning rate 0.1 Cosine learning rate ramp-down iteration 600k Weight decay Momentum 0.9 Model Max consistency coefficient 20 Mean Teacher Max consistency coefficient 8 Exponential Moving Average decay factor 0.95 VAT+em Max consistency coefficient 0.3 VAT (CIFAR10) 6.0 VAT (SVHN) 1.0 VAT VATEMSNTG (as for VAT) Entropy penaly multiplier 0.06 Pseudo-Label Max consistency coefficient 1.0 Pseudo-label threshold 0.95 ICT Max consistency coefficient 100 Exponential Moving Average decay factor 0.999 ICT 1.0 Suppressed Consitency Loss (Ours) Suppression Coefficient () 0.5

Table 7: Hyperparameters for shared environment and each SSL algorithms and our method used in the experiments.

D Detailed Experiment Results

We omitted the standard deviation from the experiment of the paper for readability. Table.8,9,10 are tables with standard deviation. Since we used five different seeds in each experiment, the class frequency distribution varies from seed to seed, which results in a change in baseline performance. As a result, the standard deviation of our experiment is larger than that of the random initialization of the weights.

Unlabel Imbalance Type Uniform () Half () Same ()
Imbalance factor () 10 20 50 100 10 20 50 100 10 20 50 100
Supervised 23.03 1.65 27.49 1.87 33.15 2.83 36.71 2.79 23.03 1.65 27.49 1.87 33.15 2.83 36.71 2.79 23.03 1.65 27.49 1.87 33.15 2.83 36.71 2.79
-Model (laine2016temporal) 21.1 1.93 25.74 3.82 33.91 3.49 39.36 4.47 22.69 1.99 27.72 4.17 33.96 3.19 38.84 4.17 23.49 2.69 28.18 3.31 34.22 3.19 38.05 3.19
MT (tarvainen2017mean) 16.45 1.24 19.25 1.99 23.45 3.30 29.06 5.13 19.48 1.96 23.30 2.85 30.06 3.92 35.37 3.52 20.50 2.58 24.67 2.60 31.77 3.79 35.91 3.70
VAT + em (miyato2018virtual) 17.93 2.12 20.18 3.18 30.43 6.18 36.57 7.20 20.17 2.49 24.50 2.88 32.54 4.61 36.77 3.75 21.45 1.88 25.83 3.21 33.13 3.67 37.67 2.20
VAT + em + SNTG (luo2018smooth) 18.15 2.25 20.39 2.46 29.77 6.71 36.34 6.54 20.41 2.47 24.64 2.79 32.56 4.05 38.48 3.87 21.87 2.65 26.49 3.07 33.36 3.86 38.48 2.96
Pseudo-Label (lee2013pseudo) 19.33 1.36 24.34 4.06 34.18 4.23 39.59 5.70 21.23 2.52 26.78 3.41 34.12 4.51 39.72 4.20 22.73 2.74 27.50 3.39 34.91 2.57 38.69 4.28
ICT (verma2019interpolation) 18.01 1.28 20.52 1.91 30.18 2.63 38.33 4.72 19.53 1.41 23.90 2.07 31.09 3.35 37.36 2.02 19.96 1.05 25.63 1.91 33.56 3.14 36.85 3.44
MT+SCL (ours) 15.65 0.69 16.99 1.31 19.95 2.36 22.62 3.54 17.36 1.17 21.74 2.15 28.20 3.09 33.09 3.63 18.69 2.09 22.98 2.33 29.76 2.40 34.22 3.50
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
Imbalance factor () 10 20 50 100 10 20 50 100 10 20 50 100
Supervised 18.49 1.90 21.92 2.28 30.03 3.83 35.89 6.39 18.49 1.90 21.92 2.28 30.03 3.83 35.89 6.39 18.49 1.90 21.92 2.28 30.03 3.83 35.89 6.39
-Model (laine2016temporal) 11.74 1.80 13.42 2.14 21.63 4.58 28.59 7.90 12.96 1.26 16.70 4.01 24.02 3.97 33.73 7.52 13.46 2.13 17.13 2.61 26.53 3.43 33.71 8.17
MT (tarvainen2017mean) 6.52 0.55 6.75 0.49 7.60 1.85 8.94 2.12 7.25 0.38 8.85 1.10 12.19 1.68 17.23 2.44 8.62 1.29 9.29 1.41 15.16 3.54 21.01 4.14
VAT + em (miyato2018virtual) 6.81 0.30 7.70 0.87 13.84 6.17 29.15 4.80 8.99 1.21 11.59 1.85 18.95 4.49 30.44 6.95 10.39 0.96 13.62 2 21.49 5.27 32.39 8.25
VAT + em + SNTG (luo2018smooth) 93.30 0.00 93.30 0.00 14.88 5.38 93.30 0.00 93.30 0.00 93.30 0.00 20.60 5.73 93.30 0.00 93.30 0.00 93.30 0.00 23.52 7.34 93.30 0.00
Pseudo-Label (lee2013pseudo) 10.15 0.87 9.97 1.45 16.00 4.34 32.79 7.62 11.59 1.96 13.97 2.11 24.40 4.46 33.70 6.89 12.34 1.79 15.93 2.43 25.66 5.95 33.53 8.08
ICT (verma2019interpolation) 27.82 5.12 37.75 7.50 58.20 9.38 67.02 12.66 22.38 7.89 38.12 6.57 48.88 8.33 58.99 7.35 24.53 12.62 37.25 8.22 49.85 7.74 56.97 10.28
MT+SCL (ours) 6.52 0.53 7.11 0.30 7.70 0.73 8.56 0.86 7.54 0.50 9.29 1.48 11.46 1.21 18.63 3.97 8.22 0.89 10.04 0.82 15.48 2.29 20.39 4.10
Table 8: Test error rates (%) and standard deviation from experiments with 4k number of labeled data and imbalance factor {10, 20, 50, 100 } under 3 different unlabeled imbalance types in CIFAR10 and SVHN. VAT+EM refers to Virtual Adversarial Training with Entropy Minimization.
Unlabel Imbalance Type Uniform () Half () Same ()
# labeled data 1000 2000 4000 1000 2000 4000 1000 2000 4000
Supervised 54.24 2.08 45.81 3.00 36.71 2.79 54.24 2.08 45.81 3 36.71 2.79 54.24 2.08 45.81 3 36.71 2.79
-Model (laine2016temporal) 56.82 3.63 48.55 4.26 39.36 4.47 55.99 2.79 47.74 3.82 38.84 4.17 55.42 1.47 46.83 3.29 38.05 3.19
MT (tarvainen2017mean) 51.74 5.33 38.94 7.67 29.06 5.13 51.61 4.58 42.47 5.66 35.37 3.52 52.58 3.23 44.11 4.16 35.91 3.7
VAT + em (miyato2018virtual) 53.68 4.21 48.47 3.66 36.57 7.20 53.60 3.18 45.20 4.84 36.77 3.75 53.62 3.14 44.77 2.82 37.67 2.2
VAT + em + SNTG (luo2018smooth) 54.53 3.09 48.23 3.50 36.34 6.54 55.59 3.54 45.37 3.25 38.48 3.87 55.55 2.47 45.99 4.29 38.48 2.96
Pseudo-Label (lee2013pseudo) 58.19 1.73 50.01 2.78 39.59 5.70 57.05 2.86 49.42 2.42 39.72 4.20 56.68 3.00 48.45 3 38.69 4.28
ICT (verma2019interpolation) 57.10 4.56 48.25 1.53 38.33 4.72 56.02 3.37 47.60 2.39 37.36 2.02 55.10 2.68 47.19 1.57 36.85 3.44
MT+SCL (ours) 42.84 2.88 28.69 4.55 22.62 3.54 45.72 2.62 39.97 2.58 33.09 3.63 48.00 3.41 40.69 3.41 34.22 3.50
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
# labeled data 1000 2000 4000 1000 2000 4000 1000 2000 4000
Supervised 61.31 8.05 47.98 7.08 35.89 6.39 61.31 8.05 47.98 7.08 35.89 6.39 61.31 8.05 47.98 7.08 35.89 6.39
-Model (laine2016temporal) 54.51 8.43 39.49 9.10 28.59 7.90 54.14 9.11 42.20 7.73 33.73 7.52 54.10 10.07 43.89 9.68 33.71 8.17
MT (tarvainen2017mean) 38.32 11.69 18.14 10.47 8.94 2.12 41.72 9.34 23.33 10.78 17.23 2.44 42.42 9.74 28.86 10.57 21.01 4.14
VAT + em (miyato2018virtual) 64.67 6.41 44.04 8.88 29.15 4.80 58.01 10.44 41.15 10.23 30.44 6.95 55.03 8.85 42.44 8.04 32.39 8.25
VAT + em + SNTG (luo2018smooth) 65.02 5.23 93.30 0.00 93.30 0.00 57.94 8.87 93.30 0.00 93.30 0.00 54.19 9.43 93.30 0.00 93.30 0.00
Pseudo-Label (lee2013pseudo) 63.16 6.60 49.78 7.92 32.79 7.62 54.79 10.42 44.32 7.29 33.70 6.89 56.83 8.81 43.71 5.76 33.53 8.08
ICT (verma2019interpolation) 86.54 5.27 77.64 1.94 67.02 12.66 84.22 7.46 72.21 9.43 58.99 7.35 85.15 5.89 71.19 7.68 56.97 10.28
MT+SCL (ours) 26.25 12.84 15.31 6.81 8.56 0.86 33.44 10.81 22.26 6.22 18.63 3.97 35.32 10.59 27.13 10.58 20.39 4.10
Table 9: Test error rates (%) and standard deviation from experiments with imbalance factor 100 and the number of labeled data {1k, 2k, 4k} in CIFAR10, and the number of labeled data {250, 500, 1k} in SVHN under 3 different unlabeled imbalance types.
Unlabel Imbalance Type Uniform () Half () Same ()
Re-weighting Method CE IN Focal CB CE IN Focal CB CE IN Focal CB
Supervised 36.71 2.79 35.73 2.39 36.8 2.41 37.19 2.88 36.71 2.79 35.73 2.39 36.80 2.41 37.19 2.88 36.71 2.79 35.73 2.39 36.8 2.41 37.19 2.88
-Model (laine2016temporal) 39.36 4.47 36.90 3.56 39.89 4.23 39.20 4.38 38.84 4.17 37.82 1.55 38.28 3.37 37.51 1.43 38.05 3.19 37.18 2.12 38.1 3.37 37.34 2.48
MT (tarvainen2017mean) 29.06 5.13 24 3.17 30.73 6.2 29.5 5.69 35.37 3.52 33.08 2.78 34.45 3.89 35.04 3.42 35.91 3.70 34.01 2.85 35.65 2.64 35.17 3.77
VAT + em (miyato2018virtual) 36.57 7.20 31.34 5.01 37.51 7.56 36.78 8.40 36.77 3.75 36.20 1.93 37.62 3.94 38.13 4.63 37.67 2.20 36.91 2.33 37.88 3.66 37.64 2.74
VAT + em + SNTG (luo2018smooth) 36.34 6.54 33.03 4.78 37.78 6.94 36.26 6.78 38.48 3.87 35.90 2.78 38.01 4.85 37.44 4 38.48 2.96 36.99 2.77 37.71 4.32 37.53 3.52
Pseudo-Label (lee2013pseudo) 39.59 5.70 30.62 3.62 37.90 6.87 39.38 4.79 39.72 4.20 37.36 3.14 38.77 4.23 39.12 3.19 38.69 4.28 36.84 3.31 38.92 3.31 38.52 3.13
MT+SCL (ours) 22.62 3.54 21.59 3.05 23.44 3.24 22.93 3.53 33.09 3.63 31.63 2.31 34.09 3.22 33.14 3.43 34.22 3.50 32.09 2.16 33.93 3.27 34.66 4.37
(a) CIFAR10

(b) SVHN
Unlabel Imbalance Type Uniform () Half () Same ()
Re-weighting Method CE IN Focal CB CE IN Focal CB CE IN Focal CB
Supervised 35.89 6.39 34.60 6.51 35.45 6.20 35.3 7.08 35.89 6.39 34.60 6.51 35.45 6.20 35.30 7.08 35.89 6.39 34.60 6.51 35.45 6.20 35.30 7.08
-Model (laine2016temporal) 28.59 7.90 26.72 6.12 30.15 8.13 27.99 5.95 33.73 7.52 29.60 7.33 31.67 6.43 31.12 7.72 33.71 8.17 31.7 6.94 31.70 5.08 33.17 5.35
MT (tarvainen2017mean) 8.94 2.12 6.82 0.34 8.66 2.11 7.86 1.82 17.23 2.44 17.02 3.93 16.20 3.70 16.68 2.24 21.01 4.14 20.80 5.43 20.01 4.41 21.77 4.45
VAT + em (miyato2018virtual) 29.15 4.80 20.26 7.97 28.09 5.46 29.37 4.76 30.44 6.95 27.44 7.63 28.62 8.11 29.65 6.90 32.39 8.25 29.18 7.45 30.62 8.23 30.93 6.51
VAT + em + SNTG (luo2018smooth) 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00 93.30 0.00
Pseudo-Label (lee2013pseudo) 32.79 7.62 13.48 2.45 35.07 10.85 34.38 9.48 33.70 6.89 31.83 4.98 32.79 6.72 32.83 8.27 33.53 8.08 31.62 6.06 33.63 6.24 34.55 6.58
MT+SCL (ours) 8.56 0.86 8.48 1.47 7.74 0.58 9.02 1.34 18.63 3.97 18.59 3.71 16.34 2.62 16.44 2.47 20.39 4.10 20.51 5.43 20.95 4.48 21.06 4.78
Table 10: Test error rates (%) and standard deviation from experiments with different re-weighting methods in CIFAR10 and SVHN. We compared inverse and normalization (IN), focal loss (FOCAL), and class-balanced loss (CB) to conventional cross-entropy loss (CE).

E Object Detection Experiment Settings

e.1 Dataset Details

We used PASCAL VOC2007 trainval dataset as the labeled data and PASCAL VOC2012 trainval dataset as the unlabeled data. Fig. 5 shows the distributions of PASCAL VOC data. The imbalance factor of labeled data is 22, and the imbalance factor of unlabeled data is 15. The order of the number of classes is also different. It means that the object detection task is more difficult and real settings.

(a) Labeled dataset (VOC2007)
(b) Unlabeled dataset (VOC2012)
Figure 5: Distributions for the labeled dataset (VOC2007) and the unlabeled dataset (VOC2012).

e.2 Implementation details

We followed the CSD444https://github.com/soo89/CSD-SSD (jeong2019consistency) experiment settings and used the SSD300 model (liu2016ssd). All hyperparameters such as coefficient, learning iteration, schedule function, and background elimination are the same. We set the as becuase it shows better performance.