Log In Sign Up

BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets

Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets. However, there naturally exists a class imbalance in most real-world datasets. It is known that training models on such imbalanced datasets leads to biased models, which in turn lead to biased predictions towards the more frequent classes. This issue is further pronounced in SSL methods, as they would use this biased model to obtain psuedo-labels (on the unlabeled data) during training. In this paper, we tackle this problem by attempting to select a balanced labeled dataset for SSL that would result in an unbiased model. Unfortunately, acquiring a balanced labeled dataset from a class imbalanced distribution in one shot is challenging. We propose BASIL (Balanced Active Semi-supervIsed Learning), a novel algorithm that optimizes the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop. Importantly, our technique can be efficiently used to improve the performance of any SSL method. Our experiments on Path-MNIST and Organ-MNIST medical datasets for a wide array of SSL methods show the effectiveness of Basil. Furthermore, we observe that Basil outperforms the state-of-the-art diversity and uncertainty based active learning methods since the SMI functions select a more balanced dataset.


page 1

page 2

page 3

page 4


ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning

Existing semi-supervised learning (SSL) algorithms typically assume clas...

Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning

While semi-supervised learning (SSL) has proven to be a promising way fo...

CLINICAL: Targeted Active Learning for Imbalanced Medical Image Classification

Training deep learning models on medical datasets that perform well for ...

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Many pairwise classification tasks, such as paraphrase detection and ope...

Multi-Centroid Hyperdimensional Computing Approach for Epileptic Seizure Detection

Long-term monitoring of patients with epilepsy presents a challenging pr...

PLATINUM: Semi-Supervised Model Agnostic Meta-Learning using Submodular Mutual Information

Few-shot classification (FSC) requires training models using a few (typi...

CoSSL: Co-Learning of Representation and Classifier for Imbalanced Semi-Supervised Learning

In this paper, we propose a novel co-learning framework (CoSSL) with dec...

1 Introduction

Deep neural networks (DNNs) have proven to be successful on a variety of machine learning tasks. However, they are mostly fueled by large amounts of data. These DNNs can be trained using multiple objective functions that require labeled or unlabeled data. Towards this end, we face multiple data related challenges in training DNNs.

Firstly, most real-world datasets have a natural class-imbalance. For instance, in the medical imaging domain, cancerous images are rare in comparison to non-cancerous images. Secondly, obtaining labeled data is notoriously time-consuming and expensive. This issue is highly pronounced in the biomedical domain where the annotators need to be well compensated since they are experts like doctors, radiologists,  Thirdly, in many scenarios, even procuring unlabeled data is challenging. For example, acquiring few samples of medical data of a new disease is rare and involves several privacy constraints. Hence, it is critical to use the unlabeled data even when a few labeled data points are available.

The above introduced data related issues are well-known, and the community has devised several techniques to tackle these issues seperately. For mitigating labeling costs, active learning (AL) [2, 13, 11, 12, 23, 24] is an established paradigm that samples uncertain or diverse data points from an unlabeled set. The goal is to acquire a subset that entails the largest improvement in performance of the model. Another technique, called semi-supervised learning (SSL) [16, 20, 15, 3, 26, 28] leverages the unlabeled data when only a small amount of labeled data is available. Lastly, several subset selection techniques that tackle class imbalance [12, 14, 13] have also been proposed. Evidently, these techniques revolve around the idea of obtaining the best possible model at a minimum cost. However, individually, each of these techniques suffer from limitations that the other ones do not. For example, existing AL and SSL techniques are known to suffer from class imbalance, thereby leading to learning biased models. Also, existing AL and subset selection methods do not leverage the remaining unlabeled data, and simply discard it. To bridge these gaps in existing methods, we propose Basil, a unified framework that actively samples data points per-class to create a balanced labeled set followed by SSL to make the most of the remaining unlabeled data.

1.1 Related work

Active Learning (AL). Uncertainty based methods aim to select the most uncertain data points for labeling. The most common technique is Entropy [24] that aims to select data points with maximum entropy. The main drawback of uncertainty based methods is that they lack diversity within the acquired subset. To mitigate this, a number of approaches have proposed to incorporate diversity. A recent approach called Badge [2] uses the last linear layer gradients to represent data points and runs K-means++ [1] to obtain centers that have a high gradient magnitude. The centers being representative and having high gradient magnitude ensures uncertainty and diversity at the same time. However, for batch active learning, this diversity and uncertainty are limited within the batch and not across all batches. Another method, BatchBald [11] requires a large number of Monte Carlo dropout samples to obtain significant mutual information which limits its application to medical domains where data is scarce. Recently, [12] proposed the use of submodular information measures for active learning in realistic scenarios, while [13] used them to find rare objects in an autonomous driving object detection problem. However, they focus on acquiring data points only from the rare classes or slices. Our proposed method maximizes the per-class mutual information, thereby selecting data points for each class to obtain a balanced labeled set, which is critical for training unbiased models.

Semi-supervised Learning (SSL). The goal of SSL methods is to leverage unlabeled data alongside the labeled data to obtain a better representation of the dataset than supervised learning [21]. The most basic SSL method, pseudo-labeling [16]

uses model predictions as target labels as a regularizer, and a standard supervised loss function for the unlabeled dataset. Some SSL methods such as

-Model [15, 22] and Mean Teacher [26]

use consistency regularization, by using data augmentation and dropout techniques. Mean Teacher obtains a more stable target output by using an exponential moving average of parameters across previous epochs. Virtual Adversarial Training (VAT) 

[20] uses an effective regularization technique that involves slight perturbations such that the prediction of the unlabeled samples is affected the most. ICT [28]

encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. More recent techniques like FixMatch 

[25], MixMatch [3] and UDA [29] use data augmentations like flip, rotation, and crops to predict pseudo-labels. All the above methods depend on the model trained using a small labeled set. Hence, they are susceptible to using a biased model if the labeled set is randomly sampled from an unlabeled set with class imbalance. In this paper, we study the effect of selecting a balanced seed set using Basil for a wide array of SSL techniques.

1.2 Our contributions

We summarize our contributions as follows: 1) We emphasize on the need of jointly addressing multiple real-world data related problems such as class imbalance, expensive labeling costs, and leveraging the unlabeled data. Particularly, we show that these problems co-exist in the medical domain. 2) We propose Basil, a novel algorithm that can tackle these problems in an end-to-end manner. Concretely, we acquire a balanced subset of the unlabeled data by maximizing per-class instantiations of submodular mutual information functions in an active learning loop followed by semi-supervised learning (see Fig. 1). 3) Basil can leverage any SSL method and yield improved performance over the vanilla SSL approach. 4) We evaluate the effectiveness of Basil on two diverse modalities of medical data. Namely, histopathology (Path-MNIST [9]) and Abdominal CT (Organ-MNIST [10]). 5) We conduct rigorous experiments on 6 AL strategies and 8 SSL techniques, and show that balanced labeled set selection using Basil outperforms existing AL methods and obtains larger gain in performance for various SSL techniques (see Tab. 1 and Tab. 2).

2 Preliminaries

Submodular Functions: We let denote the ground-set of data points and a set function . The function is submodular [4] if it satisfies the diminishing marginal returns, namely for all . Facility location, graph cut, log determinants, etc. are some examples [8].

Submodular Mutual Information (Smi): Given a set of items , the submodular mutual information (MI) [5, 7] is defined as . Intuitively, this measures the similarity between and and we refer to as the query set.

Kothawade et. al. [14] extend Smi to handle the case when the target can come from a different set apart from the ground set . In the context of imbalanced medical image classification, is the source set of images and the query set is the target set containing the rare class images. To find an optimal subset given a query set , we can define , and maximize the same.

2.1 Examples of Smi functions

For balanced subset selection via Basil  we use the recently introduced Smi functions in [7, 5] and their extensions introduced in [14] as acquisition functions. Note that we only use a subset of functions presented in [14], that are the most scalable, for per-class selection of data points. For any two data points and , let denote the similarity between them.

Graph Cut MI (Gcmi): The Smi instantiation of graph-cut (Gcmi) is defined as: . Since maximizing Gcmi maximizes the joint pairwise sum with the query set, it will lead to a summary similar to the query set . In fact, specific instantiations of Gcmi have been intuitively used for query-focused summarization for videos [27] and documents [18, 17].

Facility Location MI (Flmi): We consider two variants of Flmi. In the first variant of facility location which is defined over (Flvmi), the Smi instantiation can be defined as: . The first term in the min(.) of Flvmi models diversity, and the second term models query relevance.

For the second variant, which is defined over (Flqmi), the Smi instantiation can be defined as: . Flqmi is very intuitive for query relevance as well. It measures the representation of data points that are the most relevant to the query set and vice versa. It can also be thought of as a bidirectional representation score.

Figure 1: Architecture of Basil. We select a balanced labeled set by maximizing the per-class submodular mutual information in an active learning loop. The balanced labeled set is used to train an unbiased model which better leverages the remaining unlabeled data using semi-supervised learning.

3 Basil: Our Active Semi-supervised Learning framework

In this section, we present Basil, a unified framework that jointly tackles data problems of class imbalance, high labeling costs and leveraging unlabeled data. We do so by gradually acquiring balanced subsets in an active learning loop, followed by training the final model using the balanced labeled set and the remaining unlabeled set (see Fig. 1).

0:  Labeled set of data points: , large unlabeled dataset: , Loss function defined over : , Loss function defined over : , batch size: , number of selection rounds: , number of classes:
1:  for selection round  do
2:     Train model with loss on the current labeled set
3:      {Compute gradients of using hypothesized labels}
4:      {Compute gradients of using true labels}
5:      Cosine_Similarity () {}
6:     for class  do
7:        Instantiate a Smi function based on . { contains rows from corresponding to class }
8:         { contains data points from class }
10:     end for
11:     Get labels for batch and ,
12:  end for
13:  Train final model with loss {Train model on balanced and remaining using SSL}
14:  Return trained model and parameters .
Algorithm 1 Basil: Balanced Active Semi-supervised Learning

The main idea in Basil is to maximize per-class instantiations of submodular mutual information (SMI) functions to obtain a balanced labeled set. Concretely, we formulate an SMI function using a query set that is a subset of the current labeled set with data points from class . The SMI functions are instantiated using a similarity kernel , where is the pairwise similarity between data points and , represented by gradients computed using . Specifically, we define , where is the labeled loss on the

th data point. Note that we use hypothesized labels,  the label with maximum class probability, for computing gradients of

. Next, we optimize the SMI function for class using a greedy strategy [19] with a constraint that the budget is equally divided across all classes:


Hence, Basil gradually builds a balanced labeled set by accumulating smaller balanced sets in every AL round. Finally, we can use any SSL technique to train the model with the balanced labeled set and the remaining unlabeled set . We summarize our method in Algo. 1, illustrate the architecture in Fig. 1, and discuss its scalability in Appendix. 0.C.

4 Experiments

In this section, we evaluate the effectiveness of Basil on two modalities of medical data, viz., histopathology (see Sec. 4.1) using the Path-MNIST dataset [9, 30] and Abdominal CT (see Sec. 4.2) using the Organ-MNIST dataset [10, 30]. For evaluation, we compare the test accuracy of the model obtained after training via a semi-supervised learning algorithm on a combination of labeled set (selected using active learning) and the unlabeled set. We also compare the imbalance ratio (IR) of all AL methods in Fig. 2. The IR is computed as follows: , where contains class indices of the rare classes and contains class indices of the remaining frequent classes. Note that IR()=1 when is perfectly balanced.

Our results show that selecting a balanced labeled set using Basil (see Fig. 2) outperforms existing AL baselines (supervised row in Tab. 1, 2). Importantly, we observe that, independent of the SSL algorithm used, using a balanced labeled set selected using Basil helps leverage the remaining imbalanced unlabeled set better.

Baseline AL methods and SSL techniques. We compare the performance of Basil against an uncertainty based AL method (Entropy), and a diversity based AL method (Badge). We discuss the details of all baselines in Sec. 1.1. Lastly, we compare with random sampling (Random). We evaluate the subset selected by Basil and the AL baselines on a wide array of SSL algorithms such as Psuedo-Label (PL), Mean Teacher (MT), -Model, MixMatch (MM), ICT, VAT, VAT + Entropy Minimization (EM) on two medical imaging datasets which are described in Sec. 4.1 and Sec. 4.2.

Experimental setup:

We use the same training procedure and hyperparameters for all AL methods to ensure a fair comparison. For the first AL round, we randomly sample data points for labeling from the unlabeled set

. We do so in order to obtain meaningful model parameters for the AL acquisition functions in the next round of AL. For all experiments, we train a Wide ResNet (WRN) [6] model using an Adam optimizer with an initial learning rate of 3e-4. For each AL round, the weights are reinitialized using Xavier initialization and the model is trained for 100K iterations. After obtaining the labeled set using AL, we reinitialize the WRN model and train it using SSL for 500K iterations. We run each experiment on a V100 GPU and provide the error bars (std deviation). We discuss dataset splits and hyperparameters below and provide more details in Appendix. 0.B.

4.1 Analysis on Histopathology data

Path-MNIST Dataset: Path-MNIST [9] is a dataset based on a prior study for predicting survival from colorectal cancer histology slides. It includes a training dataset of 100,000 non-overlapping image patches from hematoxylin and eosin stained histological images, and a test dataset of 7,180 image patches from a different clinical center. These patches are categorized into 9 types of tissues, resulting in a multi-class classification task. For our experiments, we use a pre-processed version of Path-MNIST [30], where the original images are resized to 3 28 28. We consider a subset of 26K data points to create the initial imbalanced unlabeled set . The unlabeled set is imbalanced by randomly choosing 4 rare classes and randomly selecting 250 data points for each class. For the remaining 5 classes, we randomly select 5000 data points for each class.

SSL \ AL Random Badge Entropy Flqmi Gcmi Flvmi
Supervised 69.371.33 76.560.4 70.541.76 78.510.58 78.761.84 77.680.58
PL [16] 69.382.19 80.111.68 66.161.8 78.662.23 80.331.6 81.240.91
ICT [28] 75.450.28 80.250.17 74.910.78 82.290.96 82.580.13 82.080.66
-Model [15] 64.590.19 78.631.03 67.290.73 80.621.72 80.591.19 80.770.71
MT [26] 70.981.76 79.620.85 74.191.56 82.140.91 82.431.44 83.360.52
MM [3] 67.131.78 67.771.43 67.010.44 73.781.22 76.080.09 73.421.54
VAT [20] 80.271.95 83.051.72 77.491.8 85.721.83 83.70.37 84.380.81
VAT+EM [20] 81.481.1 84.110.06 82.60.75 85.461.54 84.11.94 85.221.51
Table 1: Active SSL on Path-MNIST. Test accuracies in bold (one in each row) show the best performing AL acquisition functions for a particular SSL method. We observe that Flqmi which models query-relevance and representation obtains the highest gains for VAT and VAT+EM. Whereas, Flvmi acquires the best labeled set for PL, -Model and MT.
Figure 2: Comparison of Imbalance Ratio (IR) on Path-MNIST (Left) and Organ-MNIST (Right) datasets (Lower the better). We observe that the SMI functions select more data points from the rare classes, thereby having a to lower IR than the best performing baseline and to lower IR than Random.

Results: We present results for Active SSL on Path-MNIST in Tab. 1. We observe that the SMI based AL acquisition functions outperform existing AL methods in the supervised and across all semi-supervised learning methods. This is due to the fact that per-class selection using SMI functions in Basil results in a more balanced labeled set (see Fig. 2). This reinforces the need for a framework like Basil for training models in a supervised or semi-supervised manner in class imbalance scenarios. Interestingly, we observe that the choice of SMI function depends on the modality of medical data and the SSL method. For histopathology, we see that PL, -Model and MT methods perform the best when an acquisition function like Flvmi that balances between query-relevance and diversity is used. For ICT and MM, Gcmi shows the best results, which models query relevance only. Whereas, on VAT and VAT+EM, Flqmi shows the best results which models query-relevance and representation.

4.2 Analysis on Abdominal CT data

Organ-MNIST Dataset: Organ-MNIST [10] is a dataset based on 3D computed tomography (CT) images from Liver Tumor Segmentation Benchmark (LiTS). For our experiments, we use a pre-processed version of Organ-MNIST [30], where images are cropped using bounding-box annotations of 11 body organs. The images are resized into 1 28 28 to perform multi-class classification of 11 body organs. We consider a subset of 21.6K data points to create the initial imbalanced unlabeled set . The unlabeled set is imbalanced by randomly choosing 4 classes and randomly selecting 150 data points for each class. For the remaining 5 classes, we randomly select 3000 data points for each class.

Results: We present results for Active SSL on Organ-MNIST in Tab. 2. Similar to our results on Path-MNIST, we observe that the SMI based AL acquisition functions outperform existing AL methods across all supervised and SSL methods. We observe that the facility location based SMI functions dominate for the Abdominal CT modality. Particularly, Flqmi which models query-relevance and representation, outperforms other baselines and SMI functions for all SSL methods except VAT. The Flvmi functions performs slightly better than Flqmi when the SSL method is VAT or VAT+EM.

SSL \ AL Random Badge Entropy Flqmi Gcmi Flvmi
Supervised 60.630.73 64.321.29 61.031.36 65.580.28 62.151.63 64.410.6
PL [16] 62.831.79 64.431.47 64.270.16 67.681.43 64.481.3 65.560.35
ICT [28] 61.10.57 60.291.52 64.011.27 65.320.86 60.331.37 62.221.24
-Model [15] 64.611.32 65.611.73 62.671.83 65.941.03 61.851.62 65.430.94
MT [26] 64.640.63 66.491.18 65.530.34 66.890.01 60.810.63 64.810.54
MM [3] 53.621.28 50.341.4 51.351.51 58.570.51 55.860.73 56.081.07
VAT [20] 71.821.98 70.480.53 72.170.97 75.250.73 72.511.62 76.170.1
VAT+EM [20] 72.670.28 73.520.22 71.950.43 75.571.12 73.171.73 76.580.69
Table 2: Active SSL on Organ-MNIST. Test accuracies in bold (one on each row) show the best performing AL method for each SSL. We see that Flqmi which models query-relevance and representation obtains the highest gains for most SSL methods except VAT, where Flvmi consistently performs better.

5 Conclusion

We demonstrate the effectiveness of a unifying algorithm like Basil for selecting a balanced labeled set in scenarios with class imbalanced data. Through rigorous experiments on diverse modalities of medical datasets, we show that Basil selects a more balanced labeled set than other AL acquisition functions, resulting in relatively unbiased models leading to better performance for supervised and semi-supervised learning.


  • [1] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, PA, USA, pp. 1027–1035. External Links: ISBN 978-0-898716-24-5 Cited by: §1.1.
  • [2] J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal (2020) Deep batch active learning by diverse, uncertain gradient lower bounds.. In ICLR, Cited by: §1.1, §1.
  • [3] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel (2019) Mixmatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §1.1, §1, Table 1, Table 2.
  • [4] S. Fujishige (2005) Submodular functions and optimization. Elsevier. Cited by: §2.
  • [5] A. Gupta and R. Levin (2020) The online submodular cover problem. In ACM-SIAM Symposium on Discrete Algorithms, Cited by: §2.1, §2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.
  • [7] R. Iyer, N. Khargoankar, J. Bilmes, and H. Asnani (2020) Submodular combinatorial information measures with applications in machine learning. arXiv preprint arXiv:2006.15412. Cited by: §2.1, §2.
  • [8] R. K. Iyer (2015) Submodular optimization and machine learning: theoretical results, unifying and scalable algorithms, and applications. Ph.D. Thesis. Cited by: §2.
  • [9] J. N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C. Weis, T. Gaiser, A. Marx, N. A. Valous, D. Ferber, et al. (2019)

    Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study

    PLoS medicine 16 (1), pp. e1002730. Cited by: §1.2, §4.1, §4.
  • [10] D. S. Kermany, M. Goldbaum, W. Cai, C. C. Valentim, H. Liang, S. L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, et al. (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §1.2, §4.2, §4.
  • [11] A. Kirsch, J. Van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. arXiv preprint arXiv:1906.08158. Cited by: §1.1, §1.
  • [12] S. Kothawade, N. Beck, K. Killamsetty, and R. Iyer (2021) Similar: submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems 34. Cited by: §1.1, §1.
  • [13] S. Kothawade, S. Ghosh, S. Shekhar, Y. Xiang, and R. Iyer (2021) TALISMAN: targeted active learning for object detection with rare classes and slices using submodular mutual information. arXiv preprint arXiv:2112.00166. Cited by: §1.1, §1.
  • [14] S. Kothawade, V. Kaushal, G. Ramakrishnan, J. Bilmes, and R. Iyer (2021) PRISM: a rich class of parameterized submodular information measures for guided subset selection. arXiv preprint arXiv:2103.00128. Cited by: §1, §2.1, §2.
  • [15] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §1.1, §1, Table 1, Table 2.
  • [16] D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in LRtation learning, ICML, Vol. 3. Cited by: §1.1, §1, Table 1, Table 2.
  • [17] J. Li, L. Li, and T. Li (2012)

    Multi-document summarization via submodularity

    Applied Intelligence 37 (3), pp. 420–430. Cited by: §2.1.
  • [18] H. Lin (2012)

    Submodularity in natural language processing: algorithms and applications

    Ph.D. Thesis. Cited by: §2.1.
  • [19] B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause (2015) Lazier than lazy greedy. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 29. Cited by: 1st item, Appendix 0.C, §3.
  • [20] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §1.1, §1, Table 1, Table 2.
  • [21] A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow (2018) Realistic evaluation of deep semi-supervised learning algorithms. arXiv preprint arXiv:1804.09170. Cited by: §1.1.
  • [22] M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems 29, pp. 1163–1171. Cited by: §1.1.
  • [23] O. Sener and S. Savarese (2018)

    Active learning for convolutional neural networks: a core-set approach

    In International Conference on Learning Representations, Cited by: §1.
  • [24] B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1.1, §1.
  • [25] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §1.1.
  • [26] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §1.1, §1, Table 1, Table 2.
  • [27] A. B. Vasudevan, M. Gygli, A. Volokitin, and L. Van Gool (2017)

    Query-adaptive video summarization via quality-aware relevance estimation

    In Proceedings of the 25th ACM international conference on Multimedia, pp. 582–590. Cited by: §2.1.
  • [28] V. Verma, K. Kawaguchi, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz (2019) Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825. Cited by: §1.1, §1, Table 1, Table 2.
  • [29] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §1.1.
  • [30] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni (2021) MedMNIST v2: a large-scale lightweight benchmark for 2d and 3d biomedical image classification. arXiv preprint arXiv:2008. Cited by: §4.1, §4.2, §4.

Supplementary Material

Appendix 0.A Summary of Notations

Topic Notation Explanation
Unlabeled set of instances
Basil (Sec. 3) A subset of
Similarity between any two data points and
A submodular function
Labeled set of data points
Query set
Query set containing data points from class for per-class SMI selection,
Deep model
Active learning selection budget
Labeled Loss function used to train model and compute gradients
Unlabeled Loss function used for semi-supervised learning
Pairwise similarity matrix computed using gradients
Gradients of some subset
Table 3: Summary of notations used throughout this paper

Appendix 0.B Details of Datasets and Experimental setting

0.b.1 Datasets

In this section, we will describe details of the datasets Path-MNIST (Histopathology data) and Organ-MNIST (Abdominal CT data).

0.b.1.1 Path-MNIST

A dataset based on a prior study for predicting survival from colorectal cancer histology slides, which provides 100,000 non-overlapping image patches from hematoxylin and eosin stained histological images, and a test dataset of 7,180 image patches from a different clinical center. 9 types of tissues are involved, resulting a multi-class classification task. We resize the source images of 3 x 224 x 224 into 3 x 28 x 28.

0 1 2 3 4 5 6 7 8 adipose background deris lymphocytes mucus smooth muscle normal colon mucus cancer-associated stroma colorectal adenocarcinoma epithelium

Figure 3: Types of tissues and their labels in Path-MNIST

Out of 100K images we take 5K images each from classes 0,1,4,6,8 and 250 images each from classes 2,3,5,7 and form a unlabeled train dataset of 26K images and for validation set we had 10 samples each from all classes thus forming a validation set of size 90. For the test dataset we used default one which consists of 7180 samples. For Path-MNIST we had active learning selection budget() as 900.

0.b.1.2 Organ-MNIST

A dataset based on 3D computed tomography (CT) images from Liver Tumor Segmentation Benchmark (LiTS). Hounsfield-Unit (HU) of the 3D images are transformed into grey scale with a abdominal window; we then crop 2D images from the center slices of the 3D bounding boxes in axial views (planes). The images are resized into 1 x 28 x 28 to perform multi-class classification of 11 body organs.

0 1 2 3 4 5 6 7 8 9 10 bladder femur-left femur-right heart kidney-left kidney-right liver lung-left lung-right pancreas spleen

Figure 4: Organs and their labels in OrganA-MNIST

There are total of 34581 training samples out of which we pick 3000 samples from classes 4,5,6,7,8,9,10 and 150 samples from classes 0,1,2,3 and form an unlabeled train dataset of 21.6K images. And for validation set we have selected 10 points each from a class to form a validation set of size 110. And the test set consists a total of around 17K images. For Organ-MNIST we had active learning selection budget() as 990.

0.b.2 Experimental Setting

Given an unlabeled dataset and an active learning selection budget , we have to select points in each round of active learning. In the first round, we select points randomly from the unlabeled dataset and label those points. Then in further rounds we train the Wide-ResNet model using an Adam optimizer with an initial learning rate of (same across different active learning methods to make better comparison) for 100K iterations using the labeled set formed till then and extract the gradients of remaining unlabeled dataset from model and use the gradients to perform per-class selection from SMI functions to get a Balanced Dataset.
After forming a labeled set of size , we perform Semi-Supervised learning using labeled set and unlabeled dataset . We use many SSL algorithms to evaluate our selected balanced datasets such as Psuedo-Label, ICT, -Model, Mean-Teacher, MixMatch, VAT, VAT+EM. For each of these methods the loss component consists of two components, they are supervised loss (loss corresponding to labeled dataset) and SSL loss (loss corresponding to unlabeled dataset). Supervised loss is common to all, whereas SSL loss depends on the algorithm which we are using. We use the same parameters for one SSL method across different selections of active learning for better comparison.

0.b.2.1 Parameters used for each SSL method

  • Supervised: "lr":

  • PL: "threshold": 0.95, "lr": , "consistency-coefficient": 1

  • ICT: "ema_factor": 0.999, "lr": , "consistency-coefficient": 100, "alpha": 0.1

  • -Model: "lr": , "consistency-coefficient": 20.0

  • MT: "ema_factor": 0.95, "lr": , "consistency-coefficient": 8

  • MM: "lr": , "consistency-coefficient": 100, "alpha": 0.75, "T": 0.5, "K": 2

  • VAT: "xi": , "lr": , "consistency-coefficient": 0.3, "eps": 6

  • VAT+EM: "xi": , "lr": , "consistency-coefficient": 0.3, "eps": 6, "em": 0.06

Details on the computation of Imbalance Ratio: The IR is computed as follows: , where contains class indices of the rare classes and contains class indices of the remaining frequent classes. For example, for a dataset with 5 total classes, assume it has 2 rare classes, , then size of . The remaining classes are frequent classes, and, . Further, denotes the set of data points that belong to the rare classes. Similarly, denotes the set of data points that belong to the frequent classes. Note that IR()=1 when is perfectly balanced.

Appendix 0.C Scalability of Basil

Below, we provide a detailed analysis of the complexity of creating and optimizing the different SMI functions. Denote as the size of set . Also, let (the ground set size, which is the size of the unlabeled set in this case).

  • Facility Location: We start with FLVMI. The complexity of creating the kernel matrix is . The complexity of optimizing it is (using memoization)111: Ignoring log-factors if we use the stochastic greedy algorithm [19] and with the naive greedy algorithm. The overall complexity is . For FLQMI, the cost of creating the kernel matrix is , and the cost of optimization is also (with naive greedy, it is ).

  • Graph-Cut: For GCMI, we require a kernel matrix, and the complexity of the stochastic greedy algorithm is also .

We end with a few comments. First, most of the complexity analysis above is with the stochastic greedy algorithm [19]. If we use the naive or lazy greedy algorithm, the worst-case complexity is a factor larger. Secondly, we ignore log-factors in the complexity of stochastic greedy since the complexity is actually , which achieves an approximation.