The utilization of large training datasets is an essential ingredient to endorse the success of deep learning(Li et al., 2021, 2022a; Zhang et al., 2021a, b) and its applications (e.g., image classification (Krizhevsky et al., 2017), text classification (Chen et al., 2017; de Araujo et al., 2020)). However, in real-world systems, the acquisition of decent labeled data is cost-prohibitive and challenging (e.g., legal text annotation (Xu et al., 2020) can be expensive because of domain expert scarcity). Therefore, initiating deep learning with sparse (Li et al., 2022b) and ever-growing training data, with/without human involvement, can be a vital task to explore.
To address this problem, prior studies embraced two directions: 1) active learning, (AL) (Gal et al., 2017; Mayer and Timofte, 2020; Yoo and Kweon, 2019) aims to select the most informative samples for human annotation with the least labeling cost (Figure 1); 2)semi-supervised learning, (SSL) (Berthelot et al., 2019; Kipf and Welling, 2016; Chen et al., 2020) also faces such challenges with labeled data sparseness and unlabeled data abundance. Along this line of research, the potential of combining SSL and AL (i.e., SSL-AL) can be nontrivial to further improve model performance. However, recent SSL-AL investigations are limited by the following two drawbacks: 1). Most current SSL-AL methods (Zhang et al., 2020; Sinha et al., 2019; Kim et al., 2021) with VAE-GAN structure suffer from the mismatch problem (Figure 1), as learned representations of samples does not facilitate (or even do harm to) the classification. 2). Current methods (Zhang et al., 2020; Sinha et al., 2019; Kim et al., 2021) employ SSL and AL as two isolated modules, while great potentials can be unleashed by enabling the sophisticated interaction between SSL and AL (Figure 1).
Indeed, AL and SSL can be complementary when they share the same goal (i.e., utilizes unlabeled samples), and they can work collaboratively to achieve mutual enhancement: 1) (SSLAL), SSL can provide enhanced embeddings to assist AL to exploit unlabeled samples’ underlying distribution and evaluate unlabeled samples’ attributes; 2) (ALSSL), AL can exclude some complex examples from the unlabeled pool and help SSL reduce the uncertainty of model’s prediction. Utilizing the potential relatedness among various learning methods can be regarded as a form of inductive transfer, e.g., inductive bias (Baxter, 2000), to equip the combined learning model with the correct hypothesis and performance improvement. SSL-AL reciprocal optimization is a novel but critical problem for human-in-the-loop AI.
In this work, we utilize ‘sample inconsistency’ as the bridge to enable the interaction between SSL and AL, which is inspired by the fact that the sample inconsistency is an effective means for both SSL (Berthelot et al., 2019) and AL (Gao et al., 2020). The robustness against the sample inconsistency (e.g., the translation invariance of CNNs) can provide essential information to enhance models’ generalization ability. Furthermore, the prior study (Miyato et al., 2018)
has shown that not only human-perceivable but also human-unperceivable sample variance can lead to a significant impact on models’ prediction.
To address these challenges, we propose a novel Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) framework to enhance human-in-the-loop AI by leveraging effective SSL-AL fusion. IDEAL carries three modules: the SSL label propagator propagates the label information from the sparse training data to the unlabeled samples, which is a mixture of coarse-grained (human-perceivable) and fine-grained (human-unperceivable) augmentations. It yields the task model with high consistency and low entropy. Then, the virtual inconsistency ranker ranks all the unlabeled samples in terms of their coarse-grained and fine-grained inconsistency and selects top samples (i.e., poor consistency) as the rough annotation candidates. Finally, the density-aware uncertainty re-ranker further filters the selected samples (i.e., high entropy), and provides the final samples for human annotation.
To validate the performance of IDEAL, we provide a comprehensive evaluation on four benchmark datasets and two real-world datasets. Experiments demonstrate that IDEAL outperforms state-of-the-art approaches and makes noticeable economic benefits.
Our major contribution can be summarized as follows:
We propose a novel Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) framework where SSL and AL form a closed-loop structure for reciprocal optimization and enhancement.
We develop novel coarse-grained and fine-grained data augmentation strategies and couple them to both SSL and AL. Specially, we leverage the virtual adversarial technique to explore the continuous local distribution of unlabeled samples and further measure the inconsistency of samples.
We evaluate IDEAL over diverse text and image datasets. The results validate IDEAL is superior than previous approaches and it is cost effective in various industrial applications.
2. Related Work
In practice, it is easy to acquire abundant unlabeled samples. Thus pool-based AL (Zhang et al., 2020; Yoo and Kweon, 2019; Sener and Savarese, 2018; Sinha et al., 2019) is more popular than the other two scenarios: steam-based (Dagan and Engelson, 1995) and membership query synthesis (Tran et al., 2019).
Pool-based active learning. Uncertainty-based sampling and distribution-based sampling are common methods in the pool-based scenario. Our method considers both uncertainty and distribution. For uncertainty-based methods, the framework (Kapoor et al., 2007) uses Gaussian processes to evaluate uncertainty, while (Ebrahimi et al., 2020)
uses a Bayesian network. MC dropout(Gal et al., 2017) is used as an approximation of the Bayesian network. However, these methods have a high computational cost and can not handle large-scale datasets. In the era of deep-learning-based AL, (Zhang et al., 2017) and (Yoo and Kweon, 2019) adopt similar metrics to choose informative samples. (Zhang et al., 2017) selects samples leading to the greatest change to the model’s gradient during the training process, while (Yoo and Kweon, 2019) selects samples with the biggest training loss. However, the task model’s loss and gradient information are unstable in the early training stage, and they may influence the quality of the final selected samples.
The distribution-based method (Sener and Savarese, 2018) selects samples of a subset whose feature distribution covers the entire feature space as much as possible. However, subset selection (NP-hard problem) becomes computationally infeasible for large-scale datasets or high-dimension input data. Thus, the algorithms of selecting subsets will suffer from inefficient computing.
Semi-supervised active learning. Recently, (Sinha et al., 2019; Kim et al., 2021; Zhang et al., 2020) leverage VAE-GAN structure to learn the representation of both labeled and unlabeled samples in latent space. However, these methods misuse the relation between labeling states and class labels. Consequently, the learning process may harm the semantic distribution of samples in latent space. Annotation information (labeling states) and the feature distribution (class labels) are orthogonal, while the feature distribution is highly correlated with semantic information of different classes. Compared to these VAE-GAN based methods, our method considers inconsistency of different granularity to combine AL and SSL. The works (Song et al., 2019; Gao et al., 2020) combine AL and SSL based on prediction consistency given a set of data augmentations. However, these methods only use a limited number of ways of human-perceivable data augmentation to estimate inconsistency. In contrast, our method further leverages human-unperceivable inconsistency to obtain more abundant information of unlabeled samples for model estimation and optimization. Besides, we also consider the feature distribution of samples.
suffers from high computational and spatial costs caused by its critical step (builds a KNN graph in every selection cycle). Consequently, deploying(Guo et al., 2021) in real-world scenarios to handle large-scale datasets faces enormous challenges. In addition, (Guo et al., 2021) does not integrate adversarial perturbation generation into the training process. Thus, the selection criterion (adversarial perturbation) can not adapt dynamically to the model’s optimization. Our method is free of the above two deficiencies. Firstly, we replace graph SSL with the Mixup technique to handle large-scale datasets. Then we couple selection criterion (fine-grained inconsistency) to model optimization for optimal perturbation generation.
Unlike our work, all of these methods are specially designed for image classification tasks, which can not be directly transferred to tasks of other media types (e.g., text type).
In this section, we formulate the proposed Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) algorithm. We first provide a brief overview of our whole AL framework, and then introduce three main components of IDEAL: a SSL label propagator (section 3.2), a virtual inconsistency ranker (section 3.3) and a density-aware uncertainty re-ranker (section 3.4).
In this section, we formally describe the pool-based AL loop with our proposed IDEAL demonstrated in Figure 2. We define the labeled pool as and the unlabeled pool as ( and are the numbers of labeled samples and unlabeled samples respectively). (1) For samples in the pool, IDEAL first feeds them into the SSL label propagator, which propagates the label information from labeled samples to unlabeled samples by mixing up samples. (2) Based on the inconsistency signals provided by the SSL label propagator, our virtual inconsistency ranker calculates the total inconsistency (coarse-grained and fine-grained inconsistency) for each unlabeled sample in the pool , and selects the top- samples with the largest inconsistency as initial annotation candidates. (3) The density-aware uncertainty re-ranker further selects top- candidates with the largest density-aware uncertainty from initial top- candidates for human annotation, finally excluding unsmoothed and unstable samples for SSL to perform better label propagation. As a consequence, the sizes of the labeled pool and the unlabeled pool will be updated to and respectively. The loop will be repeated until the annotation budget is run out.
3.2. SSL Label Propagator
We leverage the SSL label propagator to propagate the label information from labeled samples to unlabeled samples by smoothing local inconsistency distribution of unlabeled samples, and obtain enhanced embeddings for following AL.
Formally speaking, given unlabeled dataset and the labeled dataset , we firstly generate augmented unlabeled samples for each unlabeled sample through coarse-grained augmentations. Specifically, the coarse-grained augmentation is a family of data augmentation functions that s.t. and (where is a threshold discriminating human-perceivable and human-unperceivable difference). The corresponding inconsistency estimation is called to be coarse-grained, and vice versa. Moreover, we adopt sentence-level transformations, rewriting sentences with similar meanings (back translation) as the coarse-grained augmentation strategy for text data. We use image-level transformations (e.g., rotation and clipping of the whole image) to obtain coarse-grained augmented image samples.
In addition to the above discrete coarse-grained data transformations, we introduce pixel-level or embedding-level fine-grained augmentations, adding local perturbation to inputs’ representations in feature space (described in section 3.3), to force the model to explore continuous local distribution.
After applying these augmentation strategies of two different granularity, we can obtain the final augmentation samples for samples in .
Then, we feed each unlabeled sample and its augmented samples into the task model to get predicted labels and . We generate shared guessed label for augmentation samples by averaging all of their predicted labels:
where weights and mean the contribution ratios of different augmentation samples to the final guessed label. We perform a weighted average operation to guarantee the task model to output consistent predictions for different augmentation samples. We set weights parameters for original samples and their augmented samples based on translation qualities of different languages in text classification. For images, we set all weights parameters to 1.
Next, we obtain the labeled set , the unlabeled set and the augmented set ( and share the same labels). We can obtain training samples by randomly mixing two batches of samples’ hidden states from these three sets as well as their respective labels (or guessed labels) together.
We can obtain three types of mixed samples: (1) mixing two batches of labeled samples; (2) mixing a batch of labeled and a batch of unlabeled samples; (3) and mixing two batches of unlabeled samples. This can be defined as:
where are the hidden states corresponding to the samples and . In detail, for text tasks, we select the hidden states of the middle layer of the task model for mixing up. For image tasks, we mix up the raw input features directly. Besides, the weight factor is sampled from Beta Distribution, and it can be further defined as:
where is the hyper-parameter of the distribution . After the mixing process, the mixed feature aggregates both labeled samples’ and unlabeled samples’ features. Similarly, the mixed label aggregates supervised label information from labeled samples and distribution information of unlabeled samples.
Finally, we define the loss function as: cross-entropy loss term is minimized through supervised label information. Consistency loss term (e.g., KL-divergence or L2 norm) is applied to constrain different augmented samples to have the same label with their original unlabeled samples.
Trained with mixed samples, the label propagator will propagate rich label information from labeled samples to unlabeled samples. The propagator smoothes the local distribution of unlabeled samples and makes the task model insensitive to both coarse-grained data transformation and fine-grained perturbation, providing enhanced embeddings for AL. Mixup mines and optimizes local and global samples distribution for SSL and AL. Apart from the local distribution mining based on the smoothing of point-wise augmentation, Mixup explores convex combinations of pairs of labeled, unlabeled, and augmented samples to mine the global distribution. Since enumerating all the convex combinations of pairs of samples to precisely model the whole picture is intractable, we utilize the random sampling method to estimate the distribution and optimize the corresponding objectives to learn the desired distribution. Transformed/untransformed samples with similar classes are close.
3.3. Virtual Inconsistency Ranker
Above SSL label propagator can smooth the local distribution around unlabeled samples, which means that the task model’s predictions for unlabeled samples tend to be consistent before and after perturbation. Our goal is to select the samples with poor consistency for further annotation from . Because these non-smooth samples fail to acquire enough label information, the task model has a vague understanding of the local distribution of these samples. In this sense, selecting non-smooth samples for further annotation are more valuable, which can help the task model smooth unlabeled samples better and output more stable predictions.
To select these non-smooth samples, the virtual inconsistency ranker estimates the inconsistency of the task model’s prediction on unlabeled samples under augmentation strategies of two different granularity. Specifically, for each unlabeled sample in the same unlabeled pool as mentioned in section 3.2 (this section, we use a different symbol to distinguish between the selection stage and the SSL training stage): (1) we first obtain its coarse-grained augmentation set by aforementioned augmentations, and feed and into the task model to get their predictions and . (2) Then we feed and simultaneously into the ranker to get adversarial perturbation for each augmented sample . (3) After that, we feed each fine-grained augmented sample into the task model again to get their fine-grained augmented predictions
. Besides, we conduct the fine-grained augmentations on texts’ embedding vectors extracted from the middle layer of the task model, as texts’ input space is discrete. The process of generating adversarial perturbation is formulated as:
represents the posterior probability of the task model. Under perturbation of the same norm
, there is a higher probability for adversarial samples of unlabeled samples with unstable predictions to change their original label and obtain predictions of other classes. Therefore, the ranker computes the KL-divergence of the posterior probability of samples and their adversarial samples to measure the inconsistency of unlabeled samples.
Since the computation of
is intractable for many neural networks,(Miyato et al., 2018) proposed to approximate using the second-order Taylor approximation and solved the via the power iteration method. Specifically, we can approximate by applying the following update:
where is a randomly sampled unit vector, is the label for coarse-grained augmented samples, is the fine-grained augmented prediction, and the sign means the unit vector of v. The computation of
can be performed with one set of backpropagation for neural networks.
Once the adversarial perturbation is solved, we can estimate the total inconsistency (Gao et al., 2020) of unlabeled samples under coarse-grained and fine-grained augmentations. Coarse-grained inconsistency can be formulated as:
where is the number of classes, and means the variance. Furthermore, fine-grained inconsistency can be formulated as:
where means the KL-divergence. To combine the above two selection criteria together, we normalize and transform them into percentiles. Specially, we denote as the percentile of the criteria of a given sample among the dataset . For instance, indicates that of the samples in are smaller than under the criteria . We can calculate the total inconsistency of each unlabeled sample as:
where is the weight coefficient to balance two criteria. Finally, we select top- samples with the largest inconsistency from unlabeled samples as the initial annotation candidates.
3.4. Density-aware Uncertainty Re-ranker
Based on the goal described in section 3.3, the virtual inconsistency ranker roughly selects annotation candidates with large inconsistency, which can be regarded as an initial recall set of final potential samples. However, there is still a gap between the goal of the virtual inconsistency ranker (i.e. select samples with large inconsistency) and our final AL goal (i.e. select samples with high uncertainty). Here, we propose a density-aware uncertainty re-ranker to rank the annotation candidates further, guaranteeing that the selected samples are with high uncertainty and large inconsistency. In this way, the top samples in the candidates set can bring great model uncertainty reduction.
In practice, we use the entropy of predicted labels from the task model to estimate the uncertainty of samples and select top- samples from former top- annotation candidates. The entropy of -th unlabeled sample from the set can be calculated with the following formulation:
where is the probability of the -th unlabeled sample in the set belonging to the -th class.
However, the above entropy-based formula only estimates the uncertainty information of each sample and fails to take the distribution relations among samples into account. As a result, the metric may run the risk of selecting some outliers or unrepresentative samples in the distribution space. To alleviate this issue, we re-weight the uncertainty metric with a representativeness factor and explicitly consider the distribution structure of samples. We denote this density-aware uncertainty as:
where is the size of the candidates set , and the additional second term in the formula is the similarity of each sample
relative to other samples in the distribution space. We can obtain the similarity term through cosine similarity, Euclidean distance, Spearman’s rank correlation, or any other metrics. Considering that cosine similarity is common to compute the similarity of texts’ embedding and also effective for images’ similarity computation, we use cosine similarity in our paper:
and this density-aware indicator can help us select samples with high uncertainty as well as spatial representativeness.
Accordingly, SSL and AL can work collaboratively for reciprocal enhancement in this framework. On the one hand, the SSL label propagator can propagate label information of labeled samples and smooth the local distribution around unlabeled samples through a mixture of labeled and unlabeled samples. Then SSL can provide a consistent signal to help AL decide the annotation candidates. On the other hand, the virtual inconsistency ranker and the density-aware uncertainty re-ranker enable human to annotate the samples with the largest inconsistency plus the highest uncertainty weighted by representativeness. AL assists the SSL label propagator exclude unstable and unsmoothed augmentation samples for better label propagation. We detail IDEAL in Algorithm 1 in Appendix A.1.
To validate the proposed method, we conduct a comprehensive evaluation in both text and image domains, with four benchmark datasets and two new real-world datasets. We initiate each task with a small (randomly sampled) labeled training set, and the remaining unlabeled data are used as candidates for IDEAL/baseline’s AL/SSL components. Once labeled by human, samples will be added to the training set. For each cycle, we train the task model with the newly updated training set. For a fair comparison, all baselines share the same initial training set and model parameters. All the figures in this study depict the results over five experiment trials.111To facilitate other scholars reproduce the experiment outcomes, we will make all the data code available via https://github.com/AmazingJeff/IDEAL.
We choose two text classification benchmarks: AG News(Zhang et al., 2015)
and IMDB(Maas et al., 2011)
, and two image classification benchmarks: CIFAR-10(Krizhevsky et al., 2009) and CIFAR-100 (Krizhevsky et al., 2009)
. Furthermore, we collect two industrial text classification datasets. The Legal Text dataset is collected from Chinese court trial cases, used to classify and retrieve similar cases. We define 12 fact labels (elements of judgment basis) for each case. The Bidding dataset is used to facilitate the users to filter the procurement methods (different companies prefer different procurement methods) and help them locate the most suitable bidding opportunities. We define 22 labels of purchase and sale for each announcement document. Unlike benchmarks, Legal Text and Bidding datasets come from industrial applications, and their annotations need experts with domain knowledge. The average annotation price of each sample for these two datasets is0.91 and 0.92 respectively.
For text classification tasks, we employ four baselines for comparison: (1) EGL (Zhang et al., 2017); (2) Core-set (Sener and Savarese, 2018); (3) MC-dropout (Siddhant and Lipton, 2018); (4) Random sampling (Figueroa et al., 2012). For supervised image classification tasks, we adopt seven baselines including: (1) Core-set (Sener and Savarese, 2018); (2) Random sampling (Figueroa et al., 2012); (3) ICAL (Gao et al., 2020); (4) SRAAL (Zhang et al., 2020); (5) VAAL (Sinha et al., 2019); (6) LLAL (Yoo and Kweon, 2019); (7) REVIVAL (Guo et al., 2021). Since some baselines under supervised learning can not be applied or transferred to SSL scenarios directly, we use different baselines in the SSL scenario. The baseline details can be found in Appendix A.3.
4.3. Evaluation Settings
We adopt Wide ResNet-28 (Oliver et al., 2018)
as the backbone of task models for image classification, while we use a BERT model along with a two-layer MLP architecture as text tasks’ backbone. The initial training set is uniformly distributed over classes, and we set up the initial set by randomly sampling. Under the supervised learning scenario, we only use the labeled data to train the task model. Thus, we leave out the operations of the SSL propagator. Under the SSL scenario, SSL hyper-parameters have been well-explored in MixText(Chen et al., 2020) and Mixmatch (Berthelot et al., 2019). They found in practice that hyper-parameters can be fixed and do not need to be tuned on a per-experiment or per-dataset basis. Thus, we follow these empirical settings. For text, we set K to 2 (back translation from German and Russian). Based on translation qualities, the corresponding weight parameters (, , ) for original samples and their augmented samples are (1, 1, 0.5). Over real-world datasets, we set back translation (English and German) weight parameters (, , ) to (1, 0.8, 0.2). For images, we set K to 5 (5 augmentations obtained by horizontally flipping and random cropping) and all weight parameters to 1. We set to 16 and 0.75 for text and image datasets respectively. Further, we refer to the work (Miyato et al., 2018) to fine-tune the hyper-parameter , and we obtain as 1e-2 and 10 for text and image data respectively.
4.4. Performance Analysis
Performance for image classification. Figures 3(a) and 3(b) depict the supervised learning performance of IDEAL on image benchmarks. For CIFAR-10 (Figure 3(a)), we can observe that IDEAL outperforms evidently against state-of-the-art methods in all selection cycles. In Figure 3(b), IDEAL beats all baselines with a margin up to 1.37%. Figures 3(c) and 3(d) show that, with the SSL setting, IDEAL is superior to baselines throughout the entire sample selection process. Compared to AL, SSL brings essential improvements early in the selection process, so it is necessary to have different selection batch sizes for better visualization of the performance increase under the SSL scenario. In Figure 3(c), the accuracy of methods under the SSL scenario (with 150 labeled samples) is higher than that under the supervised learning scenario (with 4000 labeled samples). IDEAL reaches accuracy with 2000 labeled samples, while previous SOTA requires more than 3500 (IDEAL significantly reduces the annotation cost by at least 42.8%). Compared to mere SSL (random selection), IDEAL brings a huge reduction in annotation cost of (4000→1000) to reach accuracy. In Figure 3(d), when using 12500 labeled samples, semi-supervised IDEAL achieves more accuracy than supervised IDEAL.
Performance for text classification. Figures 4(a) and 4(b) show the results of supervised IDEAL for text classification. One can witness that IDEAL outperforms all baselines throughout the whole sample selection stage. On the ‘BERT scale’, IDEAL leads to decent performance improvement, which validates the effectiveness of the proposed method. Figures 4(c) and 4(d) show the results of semi-supervised IDEAL for the text classification task, and the Random represents the text SSL method MixText (w/o AL). From the figures, we can see that IDEAL consistently demonstrates the best performances across different labeled data numbers in both datasets. The graph propagator limits REVIVAL’s performance by failing to capture complex semantic relations among text samples and build a high-quality relation graph. In contrast, IDEAL leverages the Mixup technique to propagate label information for subsequent inconsistency estimation. Further, IDEAL can not only measure the prediction inconsistency of augmented samples but estimates the robustness of samples’ embedding to adversarial perturbation as well. Thus, IDEAL significantly improves the naive SSL method (the Random) by excluding unsmoothed and unstable samples, compared to ICAL and REVIVAL.
|[1.5pt] Methods||Dataset||Setting||N||$/L||Cost $||Accuracy %|
|[1pt] Random (Figueroa et al., 2012)||Legal||SSL||9,000||0.91||8,190||87.71|
|(l)1-1 (l)3-7 ICAL (Gao et al., 2020)||SSL-AL||7,000||0.91||6,370||87.77|
|(l)1-1 (l)3-7 IDEAL||SSL-AL||5,000||0.91||4,550||88.36|
|[1pt] Random (Figueroa et al., 2012)||Bidding||SSL||2,500||0.92||2,300||85.35|
|(l)1-1 (l)3-7 IDEAL||SSL-AL||1,000||0.92||920||86.25|
|[1pt] ICAL (Gao et al., 2020)||Bidding||SSL-AL||2,500||0.92||2,300||87.60|
|(l)1-1 (l)3-7 IDEAL||SSL-AL||2,000||0.92||1,840||87.70|
Cost saving analysis of industrial application. To address IDEAL’s potential advancement on industrial applications, we perform an in-depth performance plus cost-saving analysis on two real-world industrial text datasets. As shown in Figures 5(a), 5(c) and Figures 5(b), 5(d), IDEAL maintains consistent superiority over other SOTAs in all experiment settings. In real scenarios, AL is used to improve the system’s performance to a certain usable standard at the cost of minimal annotations. From this perspective, a more practical way to measure AL’s performance is to compare the annotation cost saving (Labeling cost per label (/) Number of labels () ) demanded to reach a specific system performance. Table 1 summarizes the statistics of annotation costs: 1) Legal dataset. IDEAL reaches 88.36% accuracy with 5,000 labeled samples, reducing 44.44% (IDEAL vs Random sampling, saving 3640) and 28.57% (IDEAL vs ICAL, saving 1820) annotations, respectively. 2) Bidding dataset. IDEAL achieves 86.25% and 87.70% accuracy with 1,000 and 2,000 labeled samples, respectively. With the better performance, IDEAL reduces 60.0% (IDEAL vs Random sampling, saving 1380) and 20.0% (IDEAL vs ICAL, saving 460) of annotation cost. In sum, IDEAL is promising to reduce the annotation cost significantly for various industrial AI efforts. Notably, the analysis also shows that the data magnitude and cost-cutting scale are positively correlated. In other words, as the data required from industrial tasks grows, IDEAL can investigate unlabeled data deeply, and the annotation cost saving can rise in lockstep with the data scale. For instance, the number of unlabeled samples in a legal fact-finding dataset exceeds 10 million; IDEAL can expedite the data annotation process and save a few hundred thousand dollars in the best-case scenario.
5. Model Analysis
To further validate the proposed IDEAL’s effectiveness, we select the most commonly used dataset (CIFAR-100) and perform an in-depth analysis of our method on the dataset.
5.1. Robustness to Hyperparameters
Hyperparameter . We conduct experiments to analyze the proposed method’s robustness to hyper-parameter against the CIFAR-100 dataset under supervised learning. The results are shown in Figure 6(a). In each selection stage, the primary selection is made by inconsistency, and then the further selection is made by entropy. Their impacts on performance vary with . Besides, top-
is not a hyperparameter as it is equal to the budget size (2500) in each cycle. When(the number of samples in primary selection equals the budget), only inconsistency affects the performance. When equals the number of all unlabeled samples (equivalent to no primary selection), only entropy affects the performance. When is in between, inconsistency and entropy work together. From to , the impact of inconsistency and entropy on performance is relatively stable (IDEAL has strong robustness to ). After , the impact of inconsistency begins to decrease, and correspondingly, the effect of entropy begins to increase. We tend to choose smaller to save computational overhead while ensuring accurate performance. Thus, we set to 6500. We can obtain similar results for other datasets.
Hyperparameter . We conduct an exploratory analysis to systematically infer a proper weight coefficient for balancing coarse-grained and fine-grained inconsistency. On the one hand, we can observe from Figure 6(b) that IDEAL obtains the best performance (accuracy of 64.74%) when =0.4, i.e., the weight of coarse-grained and fine-grained inconsistency are 0.4 and 0.6, respectively. These results suggest that the fine-grained inconsistency has a higher weight to select informative AL samples. This phenomenon is reasonable because the fine-grained perturbation (pixel-level or embedding-level) can guide the model to deeply explore continuous local distribution non-smooth deeply and select the non-smooth samples. On the other hand, using the coarse-grained or fine-grained inconsistency as the major guidance (=0.1 or 0.9) to choose AL samples will yield relatively unsatisfactory results. The findings indicate the importance of balance between coarse-grained or fine-grained inconsistency. In other words, the two granularities of inconsistency in IDEAL can work together to pick the informative unlabeled samples as AL annotation candidates for superior SSL in the following learning cycle.
5.2. Ablation Study
We present ablation studies to evaluate the contribution of critical modules of IDEAL on the CIFAR-100 dataset in Table 2. For convenience, we use Ranker and Re-ranker to represent virtual inconsistency ranker and density-aware uncertainty re-ranker respectively. Without the ranker, the model accuracy yields a significant drop (). Diving into the ranker, coarse-grained inconsistency estimation helps SSL exclude unstable samples (not robust to augmentation methods), while fine-grained inconsistency estimation helps SSL exclude unsmoothed samples (sensitive to adversarial perturbation). Besides, the continuous exploration of local distribution is more beneficial than limited and discrete data augmentations (). Moreover, the re-ranker can further constrain the entropy of initially selected samples and take samples’ feature distribution into account. Based on our proposed density-aware uncertainty, the re-ranker further boosts model performance from to .
|[1.5pt] Methods||Number of labeled samples|
|[1pt] - density-aware module||68.240.21||71.930.19|
|- coarse-grained inconsistency||67.710.18||71.380.19|
|- fine-grained inconsistency||67.340.20||71.090.19|
Annotated data is the backbone of vast ML algorithms, and low-resource learning, w/wo human engagement, becomes a vital task. In this study, we propose a novel inconsistency-based virtual adversarial active learning (IDEAL) framework where SSL and AL formed a unique closed-loop structure for reciprocal enhancement. IDEAL can locate optimal unlabeled samples for human annotation based on innovative coarse-grained and fine-grained inconsistency estimation and density-aware uncertainty. To validate the effectiveness and practical advantages of IDEAL, we conducted experiments over several benchmarks and real-world datasets. Experimental results across text and image domains witness that IDEAL could make significant improvements over existing SOTA methods. In the future, we will investigate more sophisticated models to enable enhanced collaborations between human and algorithm, e.g., using explainable SSL to maximize human annotators’ contribution.
Acknowledgements.This work has beed supported in part by National Key Research and Development Program of China (2018AAA0101900), Zhejiang NSF (LR21F020004), Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies, Alibaba Group through Alibaba Research Intern Program, Key Research and Development Program of Zhejiang Province, China (No.2021C01013), Key Research and Development Plan of Zhejiang Province (Grant No.2021C03140), Chinese Knowledge Center of Engineering Science and Technology (CKCEST). The author Guo would also like to thank his girlfriend, Miss. Earth (Xiaolin Li), for the considerable support in plotting figures.
A model of inductive bias learning.
Journal of artificial intelligence research12, pp. 149–198. Cited by: §1.
- Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §1, §1, §4.3.
- Importance of semantic representation: dataless classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2, pp. 830–835. Cited by: §A.4.
MixText: linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2147–2157. Cited by: §1, §4.3.
Recurrent attention network on memory for aspect sentiment analysis. In
Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452–461. Cited by: §1.
- Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pp. 150–157. Cited by: §2.
- VICTOR: a dataset for brazilian legal documents classification. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1449–1458. Cited by: §1.
- Uncertainty-guided continual learning with bayesian neural networks. In International Conference on Learning Representations, Cited by: §2.
- Active learning for clinical text classification: is it better than random sampling?. Journal of the American Medical Informatics Association 19 (5), pp. 809–816. Cited by: 4th item, §4.2, Table 1.
- Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 1183–1192. Cited by: §1, §2.
Consistency-based semi-supervised active learning: towards minimizing labeling cost.
European Conference on Computer Vision, pp. 510–526. Cited by: 7th item, §1, §2, §3.3, §4.2, Table 1.
- Semi-supervised active learning for semi-supervised models: exploit adversarial examples with graph-based virtual labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2896–2905. Cited by: 9th item, §A.4, §2, §4.2.
- Active learning with gaussian processes for object categorization. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.
Task-aware variational adversarial active learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pp. 8166–8175. Cited by: §1, §2.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1.
- Learning multiple layers of features from tiny images. Cited by: §4.1.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §1.
- Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1867–1877. Cited by: §1.
- Compositional temporal grounding with structured variational cross-graph correspondence learning. arXiv preprint arXiv:2203.13049. Cited by: §1.
- End-to-end modeling via information tree for one-shot natural language spatial video grounding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8707–8717. Cited by: §1.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150. Cited by: §4.1.
- Adversarial sampling for active learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3071–3079. Cited by: §1.
- Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §1, §3.3, §4.3.
- Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems 31. Cited by: §4.3.
- Active learning for convolutional neural networks: a core-set approach. In International Conference on Learning Representations, Cited by: 2nd item, §2, §2, §4.2.
- Deep bayesian active learning for natural language processing: results of a large-scale empirical study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2904–2909. Cited by: 3rd item, §4.2.
- Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5972–5981. Cited by: 5th item, §1, §2, §2, §4.2.
- Combining mixmatch and active learning for better accuracy with fewer labels. arXiv preprint arXiv:1912.00594. Cited by: §2.
- Bayesian generative active deep learning. In International Conference on Machine Learning, pp. 6295–6304. Cited by: §2.
- Distinguish confusing law articles for legal judgment prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3086–3095. Cited by: §1.
- Learning loss for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102. Cited by: 8th item, §1, §2, §2, §4.2.
- State-relabeling adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8756–8765. Cited by: 6th item, §1, §2, §2, §4.2.
MAGIC: multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning. arXiv preprint arXiv:2112.06558. Cited by: §1.
- Consensus graph representation learning for better grounded image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 3394–3402. Cited by: §1.
- Character-level convolutional networks for text classification. Advances in neural information processing systems 28, pp. 649–657. Cited by: §4.1.
- Active discriminative text representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: 1st item, §2, §4.2.
Appendix A Appendix
a.1. Algorithm Details
a.2. Additional Details on the Dataset
Table 3 summarizes the statistics of datasets used in experiments. The legal dataset consists of the fact-finding portion of the public adjudication documents. The case type of the dataset is private lending disputes (PLD). We collect the bidding dataset from public bid notifications to classify enterprises’ purchase and sale documents. We leverage the bidding dataset to help customers search for desired bidding information.
a.3. Additional Details on the Baselines
The baselines used in our experiments can be summarized as follows:
EGL (Zhang et al., 2017) selects samples with the largest expected gradient change, as they are expected to impose a large impact on the model. The expectation is computed over the posterior distribution of labels for the sample according to the trained model.
Core-set (Sener and Savarese, 2018) is a distribution-based sampling algorithm that selects a subset to cover the whole set’s distribution. The relation between the subset and the whole set is shown in (Sener and Savarese, 2018) intuitively from the perspective of geometry, and the NP-hard problem of subset selection can be solved efficiently by a greedy algorithm.
MC-dropout (Siddhant and Lipton, 2018) selects samples with the largest uncertainty that is calculated by Monte Carlo Dropout on multiple inference cycles. It uses the max-entropy acquisition function.
Random (Figueroa et al., 2012) sampling randomly selects samples and often serves as the lower bound of active learning algorithms.
VAAL (Sinha et al., 2019) learns representations of both labeled and unlabeled data in latent space. Afterwards, the VAE-GAN module uses extracted representations to estimate the label state information and pick the most informative samples through a min-max adversarial training process.
SRAAL (Zhang et al., 2020) deeply explores the label state information and maps discrete label states into a continuous variable with a state relabeling method, compared to VAAL.
ICAL (Gao et al., 2020) chooses samples with the highest inconsistency of predictions over a set of data augmentations.
LLAL (Yoo and Kweon, 2019) annotates samples with the largest training loss.
REVIVAL (Guo et al., 2021) utilizes a graph to boost AL with an effective prior by inferring and passing samples’ relationships through the graph. It annotates samples near the boundary of clusters in the graph, which cloud refine the graph faster as a better prior for SSL.
a.4. Computational Costs Analysis
In this subsection, we conduct experiments on computational cost for IDEAL and its contemporary hierarchical sampling method REVIVAL (Guo et al., 2021). The experiments are conducted on a Linux server equipped with an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, 512GB RAM and two NVIDIA Titan V GPU. We use the Yahoo! Answers dataset (1,400,000 training data) (Chang et al., 2008) to exhaustedly test the computational cost of methods by changing the size of the unlabeled pool. We fine-tune the hyper-parameters of these two methods to reach their peak performance for a fair comparison. The results (average over five experiment trials) of actual running time and additional memory cost are shown in Table 4. We can observe that REVIVAL consumes abundant computing resources, including actual running time and additional memory space, to build a KNN graph. At the same time, IDEAL is able to deal with a huge unlabeled pool at a higher speed and memory-efficient. Thus, IDEAL is more advantageous when facing large-scale, high-dimensional datasets and deployed in practical industry scenarios.
|[1.5pt] Data size||IDEAL||REVIVAL|
|(l)3-6 (l)7-10||Running time (s)||Memory cost (MB)||Running time (s)||Memory cost (MB)|