SoQal: Selective Oracle Questioning in Active Learning

04/22/2020 ∙ by Dani Kiyasseh, et al. ∙ University of Oxford 0

Large sets of unlabelled data within the healthcare domain remain underutilized. Active learning offers a way to exploit these datasets by iteratively requesting an oracle (e.g. medical professional) to label instances. This process, which can be costly and time-consuming is overly-dependent upon an oracle. To alleviate this burden, we propose SoQal, a questioning strategy that dynamically determines when a label should be requested from an oracle. We perform experiments on five publically-available datasets and illustrate SoQal's superiority relative to baseline approaches, including its ability to reduce oracle label requests by up to 35 performs competitively in the presence of label noise: a scenario that simulates clinicians' uncertain diagnoses when faced with difficult classification tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of modern-day deep learning algorithms in the medical domain has been contigent upon the availability of large, labelled datasets

(Poplin et al., 2018; Tomašev et al., 2019; Attia et al., 2019). The curation of such datasets, however, is a challenge due to the time-consuming nature of and high costs associated with labelling. This is particularly the case in the medical domain where the input of expert medical professionals is required. One way of overcoming this challenge and exploiting large, unlabelled datasets is via active learning (AL) (Settles, 2009). In this setting, a learner is tasked with iteratively acquiring a subset of unlabelled instances and asking an oracle to label it, before adding it to the set of labelled instances. By presenting the most informative instances to the oracle, AL aims to improve the performance of algorithms while minimizing the burden of labelling on the oracle.

Although shown to be data-efficient, current AL approaches are overly reliant on the presence of an oracle. Namely, an oracle is always assumed to be present. Such over-reliance is detrimental for two reasons. Firstly, it negatively affects the applicability of AL algorithms to scenarios where an oracle is either unavailable or is ill-trained for the task at hand. This is prevalent, for instance, in low-resource healthcare settings where there is a shortage of qualified medical professionals. Secondly, over-reliance can still inundate experts with a significant number of label requests, the very goal AL is supposed to minimize. This is particularly consequential for expert medical professionals who have limited bandwidth and who are increasingly suffering from ’burnout’ (West et al., 2016; Shanafelt et al., 2017).

Decreasing the dependence of AL algorithms on the oracle and further alleviating their associated labelling burden can significantly improve the manner in which a physician is involved while also overcoming limitations inherent in the oracle. Although Kiyasseh et al. (2020) suggest performing oracle-free active learning, they only consider the extreme scenarios where an oracle is either unavailable or always available. We hypothesize that an algorithm capable of finding a middle-ground in terms of an oracle strategy could lead to less dependence on an oracle while not compromising, or potentially improving, performance. To design such an algorithm, we take inspiration from work in selective classification (Chow, 1970; El-Yaniv and Wiener, 2010) where algorithms learn to abstain from making a prediction.

Our Contributions. In this paper, we challenge the traditional assumptions of active learning, namely the availability of noise-free oracles and propose a dynamic strategy to deal with this.

  1. Selective Oracle Questioning (SoQal): a dynamic strategy that learns when to request a label from an oracle during active learning.

  2. A novel objective function that helps a network predict the zero-one classification loss incurred on the main task. We use this prediction to control the dependence of the network on an oracle.

2 Related Work

Active learning and healthcare have been relatively under-explored. A recent review of active learning methodologies can be found in Settles (2009). In the healthcare domain, Gong et al. (2019) propose to acquire instances from an electronic health record (EHR) database using a Bayesian deep latent Gaussian model to improve mortality prediction. Smailagic et al. (2018, 2019) introduce MedAL, a method that actively acquires unannotated medical images by measuring their distance in a latent space to images in the training set. Such similarity metrics, however, are sensitive to the original amount of labelled training data. The work of Wang et al. (2019) is similar to ours in that they focus on the electrocardiogram. Gal et al. (2017) adopt BALD (Houlsby et al., 2011) in the context of Monte Carlo Dropout to acquire datapoints that maximize the Jensen-Shannon divergence (JSD) across MC samples. There have been several attempts at learning from multiple or imperfect labelers (Dekel et al., 2012; Zhang and Chaudhuri, 2015; Sinha et al., 2019). Urner et al. (2012) propose choosing the oracle that should label a particular instance. Unlike our approach, they do not explore independence from an oracle. Yan et al. (2016) do consider abstention in an AL setting, yet it is performed by the labeler. Instead, our approach places the decision of abstention under the control of the learner. To the best of our knowledge, previous work, in contrast to ours, has assumed the existence of an oracle and has not explored a dynamic oracle selection strategy.

Selective classification and healthcare fit well with one another given the high-stakes scenarios present in the latter. Early work in selective classification by Chow (1970) introduces the risk-coverage trade-off whereby the empirical risk of a model is inversely related to its rate of abstentions. El-Yaniv and Wiener (2010) define perfect learning as an empirical risk of zero that corresponds to non-zero coverage of instances and propose the Consistent Selective Strategy (CSS) to achieve this. Wiener and El-Yaniv (2011)

use a support vector machine (SVM) to rank and reject instances based on the degree of disagreement between hypotheses. In some frameworks, these happen to be the same instances that active learning treats as being most informative. More recently,

Cortes et al. (2016)

outline an objective function that penalizes inappropriate abstentions alongside the rate at which they are performed. Instead of using uncertainty-based heuristics such as the entropy of the posterior predictive distribution or the Softmax-Response

(Geifman and El-Yaniv, 2017). Liu et al. (2019)

exploit portfolio theory and propose the gambler’s loss in order to learn a selection function that determines whether instances are rejected. However, this approach requires a significant amount of hyperparameter tuning. Most similar to our work is SelectiveNet

(Geifman and El-Yaniv, 2019) where a multi-head neural architecture is used in conjunction with an empirical selective risk (ESR) objective function and a percentile threshold. In contrast, our work proposes a different objective function, a thresholding mechanism, and specifically considers oracle selection. Moreover, ESR assumes that the labels associated with instances are known. In contrast, we extend the idea of selective classification to the setting where labels are unknown.

3 Methods

3.1 Active Learning

In this work, we consider a learner

, a neural network parameterized by

that maps inputs to outputs , where C is the number of classes. After training on a pool of labelled data for epochs, the learner is tasked with querying the unlabelled pool of data and acquiring the top of instances, , that it deems to be most informative.

The degree of informativeness of an instance is determined by an acquisition function, , such as that found in Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) or ALPS (Kiyasseh et al., 2020). Such approaches when used in conjunction with Monte Carlo Dropout (MCD) (Gal and Ghahramani, 2016) identify instances that lie in the region of classification uncertainty. This is a region in which hypotheses disagree the most about instances. One forward pass of MCD outputs a softmax posterior distribution where represents parameters sampled from the MC distribution. To obtain an accurate approximation of the hypothesis space, this is repeated T times resulting in for each instance.


where JSD is the Jensen-Shannon Divergence and H represents the entropy function. Once instances are acquired, they are provided to an oracle, who is assumed to be available and noise-free, for labelling before being added to the pool of labelled instances. This process is repeated until the performance of an algorithm is considered to be sufficient.

3.2 Selective Oracle Questioning

Traditionally, in AL, requesting a label from an oracle automatically follows the act of selecting an unlabelled instance. We challenge this convention and treat these two processes as independent of one another. This section describes how to choose whether or not to request a label after an unlabelled instance has been chosen.

Architecture. We assume the existence of a prediction network, , which for each instance,

, generates posterior class probabilities,

, and an oracle selection network, parameterized by that maps that same instance to a scalar, as shown in Fig. 1.

Figure 1: Selective Oracle Questioning Framework.

Objective Function. We interpret the scalar,

, as approximating the probability that an oracle is requested for a label. Ideally, a network should only be reliant on an oracle when it cannot classify an instance correctly itself. Therefore, high values of

should be associated with incorrect network predictions. Conversely, low values of should be associated with correct network predictions. We encourage this behaviour by assigning the zero-one loss, e, of as the ground truth label for .

We note that, for each instance, this ground truth label will inevitably shift during training as the network becomes more adept at classifying it. Early in training, the ratio of misclassified to correctly classified instances will be high. Late in training, the opposite is true. If such ratios are left unaccounted for, with e being used as the ground truth label, the majority of the outputs of will be high early during training and low near the end. Therefore, distinguishing between individual instances based solely on the output of would be difficult and thus deem it an unreliable signal for oracle selection. This scenario is equivalent to that of class imbalance. We describe how to mitigate this effect below.

Our objective function for a mini-batch of size, , thus consists of two terms: 1) a cross-entropy class prediction loss for the main task, and 2) a weighted binary cross-entropy loss for the oracle selection network.


where c is the target class. To offset the aforementioned class imbalance, we introduce a dynamic hyperparameter, , which changes according to the ratio of correctly classified to misclassified instances within a mini-batch, where is the Kronecker delta function. As training progresses, .

Thresholding. As we are dealing with unlabelled instances, we are interested in exploiting the output of as a proxy for whether an instance is correctly classified () or not (). The separability of these two states determine the reliability of such a proxy. In Fig. 1(b), we illustrate the distribution of the values that correspond to and on the labelled training data.

(a) Early in Training
(b) Late in Training
(c) Hellinger Distance
Figure 2: Density of the outputs of the oracle selection network conditioned on the zero-one classification error (a) early in training and (b) late in training. (c) Hellinger distance, , between distributions of outputs of selection function during training. Delegation of oracle questioning to the network occurs when . Notice the improved separability of the two distributions as a result of the training procedure.

At the end of each training epoch, the values in Fig. 1(b)

were fit to two unimodal Gaussian distributions. This generates

and for and , respectively. We quantify the separability of these two distributions using the Hellinger distance, .


If, at a particular acquisition epoch, does not exceed some threshold S, then cannot be relied upon and an oracle is always requested for a label. The value of S can be altered depending on the degree of trust one has in the network and labeller. When , and are evaluated using the value for each acquired unlabelled instance. We outline the probability of asking an oracle, p(A), in Eq. 4. Algorithms LABEL:algo:al and LABEL:algo:soqal in Appendix B illustrate the entire active learning procedure.


3.3 Chernoff Bound on Error Rate of Selection Network

Given that the selection network is tasked with making a binary decision, we can obtain a theoretical upper bound on its probability of making an error (via the overlap of density functions in Fig. 2). An error in this context can be interpreted as stubbornness, where the network does not ask for help when it should have, and over-reliance, where the network asks for help when it should not have. The Chernoff upper bound on the error rate is as follows. The full derivation can be found in Appendix C.


where and

represent the prior probabilities of each class corresponding to the zero-one loss.

is obtained by minimizing the exponent term.

4 Experimental Design

4.1 Datasets

Experiments were implemented in PyTorch

(Paszke et al., 2019) and were conducted on five publically-available datasets. These datasets consist of physiological time-series data such as the photoplethysmogram (PPG) and the electrocardiogram (ECG) alongside available cardiac arrhythmia labels. We use = PhysioNet 2015 PPG, = PhysioNet 2015 ECG (Clifford et al., 2015) (5-way), = PhysioNet 2017 ECG (Clifford et al., 2017) (4-way), = Cardiology ECG (Hannun et al., 2019) (12-way), and = PTB ECG (Bousseljot et al., 1995) (2-way).

4.2 Baselines

We experiment with baselines that exhibit varying degrees of oracle dependence. No Oracle was explored by Kiyasseh et al. (2020) where 0% of label are oracle-based and are instead based on network predictions. Epsilon Greedy

is a stochastic strategy from the reinforcement learning literature

Watkins (1989) where the degree of network exploration, performed with probability , is decayed exponentially throughout training. In our case, we exponentially decay the reliance of the network on an oracle as a function of the number of acquisition epochs. Entropy Response assumes that high entropy predictions generated by a network are indicative of instances of which the network is unsure. Therefore, we introduce a threshold, , such that if it is exceeded, an oracle is requested to label the instance chosen. The most dependent baseline is 100% Oracle, a traditionally-employed strategy in AL where 100% of labels are oracle-based.

We do not compare our methods to Softmax Response (Geifman and El-Yaniv, 2017) and SelectiveNet (Geifman and El-Yaniv, 2019), despite their strong performance for selective classification, as they do not trivially extend to the setting in which labels are unavailable.

4.3 Hyperparameters

Active Learning. For all experiments, we follow the hyperparameter choices in Kiyasseh et al. (2020). Namely, we chose the number of MC samples T = 20. Acquisitions of unlabelled instances were performed at pre-defined epochs of , . Moreover, the amount of instances acquired during each acquisition epoch is of the remaining unlabelled instances. Lastly, we chose the temporal period = 1 for all experiments involving temporal variants of acquisition functions, as described later.

Selective Oracle Questioning. In order for selective oracle questioning to be delegated to the network, we must have . Given that was observed to have an increasing trend, as seen in Fig. 1(c), we chose to balance between reliability of the selection function and independence from an oracle. We also explore the sensitivity of SoQal to this choice of .

5 Experiments

5.1 Selective Oracle Questioning with Noise-Free Oracle

The ability of a learner to appropriately determine when to request labels from an oracle can significantly alleviate the associated labelling burden. In this section, we evaluate this ability amongst the various oracle selection strategies. In Fig. 3, we illustrate the validation AUC of SoQal during training compared to that of the proposed baselines. We show that the 100% oracle strategy outperforms the remaining methods. This can be seen by the and in Figs. 2(a) and 2(b), respectively. We expect this behaviour as labels from a noise-free oracle are likely to be accurate. Conversely, the no oracle strategy struggles, as seen by an and in Figs. 2(a) and 2(b), respectively. This can be explained by the idea that complete independence from an oracle, whereby labels are network-generated, is likely to lead to noisy labels and thus hinder performance. Based on these findings, it is clear that a dynamic oracle questioning strategy can offer a balance.

Figure 3: Mean validation AUC as a function of oracle selection strategies on (a) using BALDMCP and (b) using BALDMCD. Results are averaged across 5 seeds.

We illustrate, in Table 1, the test AUC of the oracle questioning strategies on all datasets. Across - , we show that SoQal consistently outperforms its counterparts. For instance, while using BALDMCD on , SoQal achieves an compared to and for Epsilon Greedy and Entropy Response, respectively. These findings suggest that SoQal is better equipped to know when and for which instance a label is requested from an oracle. However, we observe that SoQal performs on par and relatively worse than the remaining methods on and , respectively. We hypothesize that the former result is due to the cold-start problem (Konyushkova et al., 2017) whereby AL algorithms fail to learn due to few available labelled training data. We support this claim with experiments in Appendix I. As for the worse performance on , we believe this is due to the high degree of independence endowed upon the learner given the choice of . Increasing the value of will cede control to the oracle and thus improve performance, an effect we quantify in Section 5.4.

Dataset Ac. Function Oracle Questioning Method
No Oracle Entropy Response Epsilon Greedy SoQal (ours) 100% Oracle No AL
0.465 0.017 0.496 0.039 0.491 0.028 0.621 0.021 0.653 0.013 0.577 0.014
0.464 0.023 0.517 0.043 0.501 0.043 0.645 0.015 0.676 0.020
0.500 0.023 0.548 0.034 0.548 0.042 0.598 0.055 0.634 0.030
Temporal 0.496 0.024 0.536 0.040 0.521 0.059 0.646 0.067 0.659 0.033
0.573 0.063 0.584 0.041 0.609 0.071 0.707 0.038 0.713 0.053 0.679 0.040
0.589 0.045 0.638 0.043 0.637 0.044 0.677 0.042 0.735 0.028
0.602 0.044 0.582 0.017 0.643 0.033 0.677 0.024 0.722 0.018
Temporal 0.575 0.017 0.612 0.050 0.605 0.019 0.648 0.057 0.735 0.011
0.581 0.014 0.588 0.013 0.673 0.015 0.721 0.025 0.802 0.008 0.716 0.012
0.623 0.020 0.676 0.058 0.665 0.028 0.720 0.044 0.798 0.007
0.631 0.010 0.629 0.004 0.643 0.041 0.731 0.033 0.787 0.008
Temporal 0.600 0.005 0.630 0.014 0.654 0.019 0.730 0.024 0.794 0.002
0.486 0.011 0.489 0.030 0.474 0.037 0.468 0.021 0.585 0.011 0.486 0.023
0.493 0.030 0.504 0.026 0.492 0.024 0.499 0.029 0.605 0.024
0.505 0.032 0.504 0.039 0.473 0.010 0.495 0.012 0.588 0.033
Temporal 0.511 0.030 0.496 0.023 0.496 0.023 0.503 0.010 0.532 0.027
0.717 0.006 0.715 0.005 0.718 0.006 0.661 0.105 0.937 0.004 0.710 0.097
0.719 0.009 0.678 0.074 0.774 0.047 0.453 0.136 0.705 0.013
0.679 0.056 0.664 0.064 0.726 0.019 0.638 0.145 0.900 0.036
Temporal 0.720 0.010 0.689 0.061 0.741 0.028 0.571 0.161 0.708 0.002
Table 1: Mean test AUC of oracle questioning strategies in the presence of a noise-free oracle. Results are shown for datasets

and all acquisition functions. Mean and standard deviation values are shown across five seeds. ’No AL’ is the strategy that does not employ active learning.

5.2 Selective Oracle Questioning with Noisy Oracle

In healthcare, physicians may be ill-trained, fatigued, or unable to diagnose a case due to its difficulty. We simulate these scenarios by introducing two types of label noise. We stochastically flip each label 1) to any other label randomly (Random), or 2) to its nearest neighbour from a different class in a compressed subspace (Nearest Neighbour). Whereas the first form of noise is extreme, the latter form is more realistic as it may represent uncertainty in physician diagnoses. To simulate various magnitudes of noise, we chose the probability of introducing noise, . In Fig. 4, we illustrate the effect of label noise on the test AUC of oracle questioning strategies. The remaining results can be found in Appendix H.

(a) ,
Figure 4: Average AUC of the oracle questioning strategies in the absence and presence of various magnitudes of label noise on using . With up to 80% random or nearest neighbour label noise, SoQal still outperforms its counterpart methods that are trained without label noise.

In Fig. 4, we show that SoQal outperforms the remaining strategies across all noise types and levels (except with 40% random noise). For instance, with 5% random noise, SoQal achieves an compared to and for Epsilon Greedy and Entropy Response, respectively. Secondly, SoQal is better able to deal with label noise than its counterparts. Specifically, SoQal with 80% random noise achieves which is greater than and achieved by Epsilon Greedy and Entropy Response with no noise, respectively. This effect, which is even more pronounced when dealing with nearest neighbour noise, indicates the utility of SoQal in the presence of a noisy oracle. Finally, we observe that the introduction of label noise occasionally improves performance. This can be seen by the increase in SoQal’s AUC from 0.64 (no noise) to 0.66 (5% random noise). We hypothesize that this is due to inherent label noise in the public datasets. Therefore, by introducing further noise, we may be nudging these labels towards their ground truth values.

5.3 Degree of Dependence of SoQal on Oracle

It could be argued that the superiority of SoQal is simply due to high oracle dependence, as would be naively expected. In this section, we quantify SoQal’s dependence on an oracle using the oracle ask-rate: the proportion of all instance acquisitions whose labels are requested from an oracle. In Fig. 4(a), we illustrate this oracle ask-rate for different label noise scenarios.

(a) Oracle ask-rate for different label noise scenarios
(b) Correlation between oracle ask-rate and generalization performance
Figure 5: (a) SoQal’s oracle ask-rate and (b) correlation between oracle ask-rate and average test AUC. Results are averaged across five seeds and all datasets, , and are shown for each acquisition function and label noise scenario.

In Fig. 4(a), we show that the oracle ask-rate varies based on the acquisition function used. For instance, at 20% random noise, requests labels 65% of the time whereas the remaining acquisition functions do so approximately 77% of the time. We hypothesize that this variability in the oracle ask-rate is due to the variability in the difficulty of the instances acquired by the acquisition functions. In other words, decreased dependence by

could be indicative of the acquisition of instances that are relatively farther away from the hyperplane. Thus, they are easier to classify and require less oracle guidance.

In the presence of label noise, decreased dependence is indeed associated with improved generalization performance. This claim is supported by the negative correlation between the oracle ask-rate and the test AUC observed in Fig. 4(b). In other words, networks are requesting labels less and performing better. Such findings reaffirm the conclusion that SoQal knows when to request a label from an oracle.

5.4 Controlling Oracle Dependence via Hellinger Threshold,

When , all label requests are sent to the oracle. Therefore, the value of should control the oracle ask-rate and thus performance. We illustrate the performance of SoQal for a range of values of in Table 2. We confirm the expected positive relationship between and the oracle ask-rate where as , the oracle ask-rate increases from 86% to 100%. Moreover, for this particular dataset and acquisition function, is the optimal value as it achieves an with an oracle ask-rate below 100%. This finding reaffirms our previous hypothesis that the original labels in the dataset may be noisy. Therefore, not requesting these particular labels from the oracle is advantageous.

Threshold, 0.100 0.125 0.150 0.175 0.200 0.300 0.400
Average Oracle Ask Rate % 86 85 89 90 94 100 100
AUC 0.716 0.744 0.721 0.753 0.768 0.743 0.755
Table 2: Mean test AUC of SoQal and oracle ask rate in response to various threshold values, . Results are shown for and BALDMCD across five seeds. Experiments are performed with a noise-free oracle.

6 Discussion and Future Work

In this work, we proposed a dynamic oracle questioning strategy, SoQal, in the context of active learning and healthcare. We showed that while striking a balance between independence from and over-reliance on an oracle, SoQal outperforms strong baseline methods. Furthermore, in the presence of noisy oracles which represent ill-trained or fatigued physicians, SoQal decreases its dependence by 35% and continues to outperform its counterparts. Indeed, we showed that this decreased dependence was appropriate and was correlated with improved generalization performance. We now mention several exciting avenues worth exploring.

Incorporating Prior Information. As it stands, and in the absence of a priori information, the default mode for SoQal is deferral to an oracle. If, however, relevant a priori information is available (such as the extent of the noise inherent in the oracle’s labels), then either the default mode or the Hellinger threshold, , can be altered accordingly. The latter can also be changed during training if a dynamic label noise detector is present.

Incorporating Multiple Oracles. In this work, we explored a dynamic oracle questioning strategy in the presence of a single oracle. Realistic clinical scenarios may include multiple experts of various levels of competency. Therefore, an interesting line of research could focus on a strategy that dynamically determines the ideal expert for the instance at hand.

7 Broader Impact

The exploration of less burdensome active learning algorithms in the context of healthcare can alleviate the exigent burden placed on medical practitioners. This is particularly acute in environments where burnout of physicians and nurses is increasingly observed due to a lack of electronic health record usability (Melnick et al., 2020) and increased expectations (Ferket, 2020). The decreased dependence of SoQal on an expert, at appropriate times, could prevent the disruption of clinical workflows and thus put patient care at centre stage. On the other hand, inappropriate independence of an algorithm from an expert can lead to incorrect learning signals that are reinforced throughout the algorithm’s lifetime. Consequently, a high degree of misdiagnoses can occur, thus negatively impacting clinical decision making and patient outcomes. SoQal attempts to balance this independence against diagnostic accuracy.

As for low-resource clinical settings where physicians are either ill-trained or unavailable and labelled data is scarce, SoQal offers a scalable approach for operating in such environments. It labels data iteratively without over-dependence on potentially unreliable ‘experts’. This, in turn, generates large, labelled datasets that can be successfully leveraged by data-hungry deep learning algorithms.


  • Z. I. Attia, S. Kapa, F. Lopez-Jimenez, P. M. McKie, D. J. Ladewig, G. Satam, P. A. Pellikka, M. Enriquez-Sarano, P. A. Noseworthy, T. M. Munger, et al. (2019)

    Screening for cardiac contractile dysfunction using an artificial intelligence–enabled electrocardiogram

    Nature Medicine 25 (1), pp. 70–74. Cited by: §1.
  • R. Bousseljot, D. Kreiseler, and A. Schnabel (1995) Nutzung der ekg-signaldatenbank cardiodat der ptb über das internet. Biomedizinische Technik/Biomedical Engineering 40 (s1), pp. 317–318. Cited by: §4.1.
  • C. Chow (1970) On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory 16 (1), pp. 41–46. Cited by: §1, §2.
  • G. D. Clifford, C. Liu, B. Moody, H. L. Li-wei, I. Silva, Q. Li, A. Johnson, and R. G. Mark (2017) AF classification from a short single lead ECG recording: the physionet/computing in cardiology challenge 2017. In 2017 Computing in Cardiology, pp. 1–4. Cited by: §4.1.
  • G. D. Clifford, I. Silva, B. Moody, Q. Li, D. Kella, A. Shahin, T. Kooistra, D. Perry, and R. G. Mark (2015) The physionet/computing in cardiology challenge 2015: reducing false arrhythmia alarms in the icu. In 2015 Computing in Cardiology Conference, pp. 273–276. Cited by: §4.1.
  • C. Cortes, G. DeSalvo, and M. Mohri (2016) Learning with rejection. In International Conference on Algorithmic Learning Theory, pp. 67–82. Cited by: §2.
  • O. Dekel, C. Gentile, and K. Sridharan (2012) Selective sampling and active learning from single and multiple teachers.

    Journal of Machine Learning Research

    13 (Sep), pp. 2655–2697.
    Cited by: §2.
  • R. El-Yaniv and Y. Wiener (2010) On the foundations of noise-free selective classification. Journal of Machine Learning Research 11 (May), pp. 1605–1641. Cited by: §1, §2.
  • K. Ferket (2020) Burnout in nurses across practice domains: implications and correlations to physician burnout. The Resilient Healthcare Organization: How to Reduce Physician and Healthcare Worker Burnout. Cited by: §7.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: §3.1.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §2.
  • Y. Geifman and R. El-Yaniv (2017) Selective classification for deep neural networks. In Advances in neural information processing systems, pp. 4878–4887. Cited by: §2, §4.2.
  • Y. Geifman and R. El-Yaniv (2019) Selectivenet: a deep neural network with an integrated reject option. arXiv preprint arXiv:1901.09192. Cited by: §2, §4.2.
  • W. Gong, S. Tschiatschek, R. Turner, S. Nowozin, and J. M. Hernández-Lobato (2019) Icebreaker: element-wise active information acquisition with bayesian deep latent gaussian model. arXiv preprint arXiv:1908.04537. Cited by: §2.
  • A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine 25 (1), pp. 65. Cited by: §4.1.
  • N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011) Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: §2, §3.1.
  • D. Kiyasseh, T. Zhu, and D. A. Clifton (2020) ALPS: Active Learning via Perturbations. arXiv preprint arXiv:2004.09557. Cited by: §1, §3.1, §4.2, §4.3.
  • K. Konyushkova, R. Sznitman, and P. Fua (2017) Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235. Cited by: §5.1.
  • Z. Liu, Z. Wang, P. P. Liang, R. R. Salakhutdinov, L. Morency, and M. Ueda (2019) Deep gamblers: learning to abstain with portfolio theory. In Advances in Neural Information Processing Systems, pp. 10622–10632. Cited by: §2.
  • E. R. Melnick, L. N. Dyrbye, C. A. Sinsky, M. Trockel, C. P. West, L. Nedelec, M. A. Tutty, and T. Shanafelt (2020) The association between perceived electronic health record usability and professional burnout among us physicians. In Mayo Clinic Proceedings, Vol. 95, pp. 476–487. Cited by: §7.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.1.
  • R. Poplin, A. V. Varadarajan, K. Blumer, Y. Liu, M. V. McConnell, G. S. Corrado, L. Peng, and D. R. Webster (2018) Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2 (3), pp. 158. Cited by: §1.
  • B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison, Department of Computer Sciences. Cited by: §1, §2.
  • T. D. Shanafelt, L. N. Dyrbye, and C. P. West (2017) Addressing physician burnout: the way forward. Jama 317 (9), pp. 901–902. Cited by: §1.
  • S. Sinha, S. Ebrahimi, and T. Darrell (2019) Variational adversarial active learning. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 5972–5981. Cited by: §2.
  • A. Smailagic, P. Costa, A. Gaudio, K. Khandelwal, M. Mirshekari, J. Fagert, D. Walawalkar, S. Xu, A. Galdran, P. Zhang, et al. (2019) O-medal: online active deep learning for medical image analysis. arXiv preprint arXiv:1908.10508. Cited by: §2.
  • A. Smailagic, P. Costa, H. Y. Noh, D. Walawalkar, K. Khandelwal, A. Galdran, M. Mirshekari, J. Fagert, S. Xu, P. Zhang, et al. (2018) MedAL: accurate and robust deep active learning for medical image analysis. In IEEE International Conference on Machine Learning and Applications, pp. 481–488. Cited by: §2.
  • N. Tomašev, X. Glorot, J. W. Rae, M. Zielinski, H. Askham, A. Saraiva, A. Mottram, C. Meyer, S. Ravuri, I. Protsyuk, et al. (2019) A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572 (7767), pp. 116–119. Cited by: §1.
  • R. Urner, S. B. David, and O. Shamir (2012) Learning from weak teachers. In Artificial intelligence and statistics, pp. 1252–1260. Cited by: §2.
  • G. Wang, C. Zhang, Y. Liu, H. Yang, D. Fu, H. Wang, and P. Zhang (2019)

    A global and updatable ecg beat classification system based on recurrent neural networks and active learning

    Information Sciences 501, pp. 523–542. Cited by: §2.
  • C. J. C. H. Watkins (1989) Learning from delayed rewards. Cited by: §4.2.
  • C. P. West, L. N. Dyrbye, P. J. Erwin, and T. D. Shanafelt (2016) Interventions to prevent and reduce physician burnout: a systematic review and meta-analysis. The Lancet 388 (10057), pp. 2272–2281. Cited by: §1.
  • Y. Wiener and R. El-Yaniv (2011) Agnostic selective classification. In Advances in Neural Information Processing Systems, pp. 1665–1673. Cited by: §2.
  • S. Yan, K. Chaudhuri, and T. Javidi (2016) Active learning from imperfect labelers. In Advances in Neural Information Processing Systems, pp. 2128–2136. Cited by: §2.
  • C. Zhang and K. Chaudhuri (2015) Active learning from weak and strong labelers. In Advances in Neural Information Processing Systems, pp. 703–711. Cited by: §2.