Exploratory Machine Learning with Unknown Unknowns

02/05/2020 ∙ by Yu-Jie Zhang, et al. ∙ Nanjing University 72

In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to the known labels. In this paper, we study a new problem setting in which there are unknown classes in the training dataset misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates the training dataset by actively augmenting the feature space to discover potentially unknown labels. Our approach consists of three ingredients including rejection model, feature acquisition, and model cascade. The effectiveness is validated on both synthetic and real datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study the task in which there are unknown labels in the training dataset, namely some training instances belonging to a certain class are wrongly perceived as others, and thus appear unknown to the learned model. This is always the case when the label space is misspecified due to the insufficient feature information. Consider the task of medical diagnosis, where we need to train a model for community healthcare centers based on their patient records, to help diagnose the cause of a patient with cough and dyspnea. As shown in Figure 1, there are actually three causes: two common ones (asthma and pneumonia), as well as an unusual one (lung cancer) whose diagnosis crucially relies on the computerized tomography (CT) scan device, yet too expensive to purchase. Thus, the community healthcare centers are not likely to diagnose patients with dyspepsia as cancer, resulting in that the class of “lung cancer” becomes invisible in the collected training dataset, and thus the learned model will be unaware of this unobserved class.

Figure 1: An illustration for the emerging unknown unknowns in the task of medical diagnosis, where the patients with lung cancer are misdiagnosed as asthma or pneumonia due to the lack of CT scan devices, and thus appear as unknown to the learned model.

Similar phenomena occur in many other applications. For instance, the trace of a new-type aircraft was mislabeled as old-type aircrafts until performance of the aircraft detector is found poor (i.e., capability of collected signals is inadequate), and the officer suspects that there are new-type aircrafts unknown previously. When feature information is insufficient, there is a high risk for the learner to misperceive some classes existing in the training data as others, leading to the existence of unknown classes. Especially, unknown classes are sometimes of more interest, like in the above two examples. Therefore, it is crucial for the learned model to discover unknown classes and classify known classes well simultaneously.

However, the conventional supervised learning (SL), where a predictive model is trained on a given labeled dataset and then deployed to classify unseen instances into known labels, crucially relies on a high-quality training dataset. Thus, when the aforementioned unknown unknowns emerged in the training data, SL cannot obtain a satisfied learned model.

2 ExML: A New Learning Framework

The problem we are concerned with is essentially a class of unknown unknowns

. In fact, how to deal with unknown unknowns is the fundamental question of the robust artificial intelligence 

(TGD:robust-AI), and many studies are devoted to addressing various aspects including distribution changes (journals/tkde/PanY10), open category learning (journals/pami/ScheirerRSB13), high-confidence false predictions (journals/jqid/AttenbergPF15), etc. Different from them, we study a new problem setting ignored previously, that is, the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information.

The problem turns out to be quite challenging, as feature space and label space are entangled and both of them are unreliable. Notably, it is infeasible to merely pick out instances with low predictive confidence as unknown classes, because we can hardly distinguish two types of low-confidence instances: (i) instances from unknown classes owing to the incomplete label space; (ii) instances from known classes because of insufficient feature information. This characteristic reflects the intrinsic hardness of learning with unknown unknowns due to the feature deficiency, and therefore it is necessary to ask for external feature information.

2.1 Exploratory Machine Learning

To handle the unknown unknowns caused by the feature deficiency, we resort to the human in the learning loop to actively augment the feature space. The idea is that when a learned model remains performing poorly even fed with many more data, the learner will suspect the existence of unknown classes and subsequently seek several candidate features to augment. Figure 2 shows a straightforward example that the learner receives a dataset and observes there are two classes with poor separability, resulting in a noticeable low-confidence region. After a proper feature augmentation, the learner will then realize that there exists an additional class that is unknown previously due to the insufficient feature information.

Figure 2: An example illustrates that an informative feature can substantially improve the separability of low-confidence samples and make the unknown label distinguishable.

Enlightened by the above example, we introduce a new learning framework called exploratory machine learning (ExML), which explores more feature information to deal with the unknown unknowns caused by feature deficiency. The terminology of exploratory learning is originally raised in the area of education, defined as an approach to teaching and learning that encourages learners to examine and investigate new material with the purpose of discovering relationships between existing background knowledge and unfamiliar content and concepts (1993:exploratory; spector2014handbook). In the context of machine learning, our proposed framework encourages learners to examine and investigate the training dataset via exploring new feature information, with the purpose of discovering potentially unknown classes and classifying known classes. Figure 3 compares the proposed ExML to conventional supervised learning (SL). SL views the training dataset as an observable representation of environments and exploits it to train a model to predict the label. By contrast, ExML considers the training dataset is operational, where learners can examine and investigate the dataset by exploring more feature information, and thereby discovers unknown unknowns due to the feature deficiency.

Figure 3: Comparison of two learning frameworks. Supervised learning exploits observable dataset for prediction. Exploratory machine learning explores more features based on operational dataset for both predicting and discovering unknown classes.

We further develop an approach to implement ExML, consisting of three ingredients: rejection model, feature acquisition, and model cascade. The rejection model identifies suspicious instances that potentially belong to unknown classes. Feature acquisition guides which feature should be explored, and then retrains the model on the augmented feature space. Model cascade refines the selection of suspicious instances by cascading rejection models and feature acquisition modules. We present empirical evaluations on synthetic data to illustrate the idea, and provide experimental studies on real datasets to validate the effectiveness of our approach further.

2.2 Problem Formulation

We formalize the problem setting, both training and testing.

Training Dataset. The learner receives a training dataset , where is from the observed feature space, and is from the incomplete label space with known classes. We consider the binary case (i.e., ) for simplicity. Note that there exist some training samples that are actually from unknown classes yet wrongly labeled as known classes due to the feature deficiency.

Candidate Features and Cost Budget. Besides the training dataset, the learner can access a set of candidate features , whose values are unknown before acquisition. For the example of medical diagnosis (Figure 1), a feature refers to signals returned from the CT scan devices, only available after patients have taken the examination.

A certain cost will be incurred to acquire any candidate feature for any sample. The learner aims to identify top informative features from the pool under the given budget . For simplicity, the cost of each acquisition is set as uniformly and the learner desires to find the best feature, i.e., .

Testing Stage. Suppose the learner identifies the best feature as , she will then augment the testing sample with this feature. Thus, the sample is finally in the augmented feature space , where is the feature space of . The learned model requires to predict the label of the augmented testing sample, either classified to one of known classes or discovered as the unknown classes (abbrev. uc).

3 A Practical Approach

Due to the feature deficiency, the learner might be even unaware of the existence of unknown classes based on the observed training data. To address the difficulties, our intuition is that learner will suspect the existence of unknown classes when the learned model performs badly. Therefore, we assume that instances with high predictive confidence are safe, i.e., they will be correctly predicted as one of known classes.

We justify the necessity of the assumption. Actually, there are some previous works studying the problem of high-confidence false predictions (journals/jqid/AttenbergPF15; conf/aaai/LakkarajuKCH17), under the condition that the feature information is sufficient and learner can query an oracle for ground-truth labels. However, when features are insufficient as in our scenario, the problem would not be tractable unless there is an oracle able to provide exact labels based on the insufficient feature representation, which turns out to be a rather strong assumption that does not hold in reality generally.

Thereby, following the methodology of ExML (examining the training dataset via exploring new feature information), we design a novel approach, which consists of three components: rejection model, feature acquisition, and model cascade. Figure 4 illustrates main procedures, and we will describe details of each component subsequently.

3.1 Rejection Model

As shown in Figure 4(a), at the beginning, the learner requires to train an initial model on the original dataset, with capability of identifying low-confidence instances as suspicious. We realize this goal by the learning with rejection technique (ALT16:reject-theory)

, where the learned model will abstain from predicting instances whose maximum conditional probability is lower than a given threshold

, where .

Specifically, we learn a function pair , where is the predictive function for known classes and is the gate function to reject the unknown class. We predict the sample to unknown class if , and otherwise to the class of . The pair can be learned by the expected risk minimization,


where is the 0-1 loss of the rejection model parameterized by the threshold and is the data distribution over .

(a) overall procedure
(b) best feature acquisition
Figure 4: Left figure shows the overall procedure of ExML, which begins with an initial model (blue part), followed by exploring the best candidate feature (green part). Afterwards, a model is retrained based on the augmented dataset, and finally cascaded with the initial model to discover the unknown class (red part). Right figure describes the procedure of best feature acquisition.

Due to the non-convexity arising from the indicator function, the calibrated surrogate loss function

is introduced by ALT16:reject-theory and defined as

to approximate the original loss. Hence, we turn to the empirical risk minimization (ERM) over the surrogate loss,


where and are regularization parameters, and is the reproducing kernel Hilbert space (RKHS) induced by a kernel function . According to the representer theorem (books/lib/ScholkopfS02), the optimal solution of (2) is in the form of and , where and are coefficients to learn. Thus, (2) can be converted into a quadratic programming problem and solved efficiently. Readers can refer to the the seminal work of ALT16:reject-theory for more details.

3.2 Feature Acquisition

If the initial model is unqualified (for instance, it rejects too many samples for achieving desired accuracy), learner will suspect the existence of unknown classes and explore new features to augment. In our setting, learner requires to select the best feature from candidates and retrain a model based on the augmented data, as demonstrated in Figure 4(b). Consequently, the central questions are in the following two aspects:

  • how to measure the quality of candidate features?

  • how to allocate the budget to identify the best feature and then use the augmented data to retrain the model?

Feature quality measure. Denote by the data distribution over , where is the augmented feature space of the -th candidate feature, . The crucial idea is to use the Bayes risk of rejection model on as the measure


where is the expected risk of function over , and minimizes over all measurable functions. The better the feature is, the lower the Bayes risk should be.

The Bayes risk essentially reflects the minimal error of any rejection model can attain on given features, whose value should be smaller if the feature is more informative.

Due to the inaccessibility of the underlying distribution , we approximate the Bayes risk by its empirical version over the augmented data , defined as


where , and is the rejection model learned by ERM over the surrogate loss (2) on augmented dataset .

Based on the feature quality measure (3) and its empirical version (4), we now introduce the budget allocation strategy to identify the best candidate feature.

Budget allocation strategy. Without loss of generality, suppose all the candidate features are sorted according to their quality, namely . Our goal is to identify the best feature within the limited budget, and meanwhile the model retrained on augmented data should have good generalization ability.

To this end, we propose the uniform allocation strategy as follows, under the guidance of the proposed criterion (3).

Uniform Allocation For each candidate feature , , learner allocates budget and obtains an augmented dataset . We can thus compute the empirical feature measure by (4), and select the feature with the smallest risk.

The above strategy is simple yet effective. Actually, the approach can identify the best feature with high probability, as will be rigorously demonstrated in Theorem 4.

Median Elimination Besides, we propose another variant inspired by the bandit theory (2019:bandit-book) to further improve the budget allocation efficiency. Specifically, we adopt the technique of median elimination (JMLR'06:Elimination), which removes half of poor candidate features after every iteration and thus avoids allocating too many budgets on poor features. Concretely, we maintain an active candidate feature pool initialized as at the beginning and proceed to eliminate half of it for episodes. In each episode, budget is allocated uniformly for all active candidate features, and the augmented datasets are updated to accommodate the selected instances. By calculating the empirical feature measure on the augmented dataset, the poor half of the active candidates can be eliminated. The whole algorithm is summarized in Algorithm 1. As shown in Figure 4(b), since poor features are eliminated earlier, the budget left for the selected feature is improved from to in this way, which ensures better generalization ability of the learned model. Meanwhile, as shown in the bandits literature (JMLR'06:Elimination), median elimination can also explore the best candidate feature more efficiently than uniform allocation. We currently focus on identifying the best feature, and our framework can be further extended to the scenario of identifying top features () by introducing more sophisticated techniques in the bandit theory (ICML'12:BAI-top-k; AISTAS'17:BAI-top-k-instance).

0:  Feature acquisition budget , original dataset , candidate feature pool , threshold .
0:  Selected feature and corresponding model .
1:  Initialization: dataset for each feature , the set of active features , .
2:  for  do
3:     Randomly select samples from and query their value of active candidate features ;
4:     Update to accommodate selected samples and train a model on by ERM (2), for all active candidate feature ;
5:     Compute according to (4), for all ;
6:     Update as half of features in with lower ;
7:  end for
Algorithm 1 Median Elimination for Feature Acquisition

3.3 Model Cascade

After feature acquisition, learner will retrain a model on augmented data. Considering that the augmented model might not always better than the initial model, particularly when the budget is not enough or candidate features are not quite informative, we propose the model cascade mechanism to cascade the augmented model with the initial one. Concretely, high-confidence predictions are accepted in the initial model, the rest suspicious are passed to the next layer for feature acquisition, those augmented samples with high confidence will be accepted by the augmented model, and the remaining suspicious continue to the next layer for further refinement.

Essentially, our approach can be regarded as a layer-by-layer processing for identifying unknown classes, and the layers can stop until human discovers remaining suspicious are indeed from certain unknown classes. For simplicity, we only implement a two-layer architecture.

4 Theoretical Analysis

This section presents theoretical results. We first investigate the attainable excess risk of supervised learning, supposing that the best feature were known in advance. Then, we provide the result of ExML to demonstrate the effectiveness of our proposed criterion and budget allocation strategies.

For each candidate feature , we denote the corresponding hypothesis space as , where and are induced feature mapping and RKHS of kernel in the augmented feature space.

Supervised learning with known best feature. Suppose the best feature is known in advance, we can obtain samples augmented with this particular feature. Let be the model learned by supervised learning via minimizing (2). From learning theory literatures (MLSS'03:Bousquet; book'2018:foundation), for any , with probability at least , the excess risk is bounded by


where is the approximation error measuring how well hypothesis spaces , approach the target, in terms of the expected surrogate risk . The constant factor is .

Thus, when the best feature is known in advance and given the budget , excess risk of supervised learning converges to the inevitable approximate error in the rate of .

Exploratory learning with unknown best feature. In reality, the best feature is unfortunately unknown ahead of time and candidate features are available. We show that by means of ExML, the excess risk also converges, in the rate of , yet without requiring to know the best feature. Let be the identified feature and be the augmented model returned by ExML with uniform allocation on kernel based hypotheses. Then, with probability at least , we have


where and . The constant and .

Compared with excess risk bounds of (5) and (6), we can observe that ExML exhibits a similar convergence rate of SL with known best feature, yet without requiring to know the best feature. An extra times factor is paid for best feature identification. We note that under certain mild technical assumptions, the dependence can be further improved to by median elimination (JMLR'06:Elimination), as poor candidate features are removed in earlier episodes.

Proof of Theorem 4.

The excess risk of the learned model can be decomposed into three parts,

where is the generalization error of the learned model and is the difference between empirical criterion of the selected feature and that of the best feature, where refers to the model trained on the best feature with a budget. Besides, captures the excess risk of relative to the Bayes risk. Notice that since the empirical criterion of the selected feature is the lowest among all candidates. Thus, to prove the theorem, it is sufficient to bound and .

We bound based on the following lemma on the generalization error of the rejection model, which can be regarded as a two-side counterpart of Theorem 1 of ALT16:reject-theory. Let and be the kernel-based hypotheses . Then for any , with probability of over the draw of a sample of size from , the following holds for all :


where and is the kernel function associated with . Therefore, the generalization error of the can be directly bounded by,


with probability at least .

Then we process to analyze the excess risk of trained on the best feature with surrogate loss , where can be decomposed into two parts as

Based on Lemma 4, is bounded by,


with probability at least .

Before presenting the analysis on , we introduce the results on consistency and the generalization error of the surrogate loss function. First, we show that the excess risk with respect to 0/1 loss for any function is bound by the excess risk with respect to the surrogate loss .

[Theorem 3 of ALT16:reject-theory] Let denote the expected risk in terms of the surrogate loss of a pair . Then, the excess error of is upper bounded by its surrogate excess error as follows,

where and .

Besides, the generalization error of the surrogate loss is as follows, Let and be the kernel-based hypotheses, defined as . Then for any , with probability of over the draw of a sample of size from , the following holds for all :


where and is the kernel function associated with . . Lemma 4 can be obtained by the standard analysis of generalization error based on the Rademacher complexity (book'2018:foundation).

Based on the consistency of the surrogate loss function shown in Lemma 4, we have


Denote , the first term of (12) can be further bounded by the generalization error bound of the the surrogate loss function and the optimality of for the ERM problem (2) in hypotheses 111For simplicity, we analyze the version of ERM problem in terms of the constraint on the RKHS norm instead of in terms of its Lagrange multiplier as (2), since their regularization paths are identical. as,

Thus, the following inequality


satisfies with probability at least .

Taking , and , we complete the proof by combining (8), (9), and (12), obtaining that

holds with probability at least . ∎

5 Related Work

In this section, we discuss some related topics.

Open Category Learning is also named as learning with new classes, which focuses on handling unknown classes appearing only in the testing phase (journals/pami/ScheirerRSB13; journals/pami/ScheirerJB14; conf/aaai/DaYZ14; journals/ml/Mendes-JuniorSW17; conf/icml/LiuGDFH18; arXiv'19:LAC). Although these studies also care about the unknown classes detection, they differ from us significantly and thus cannot apply to our situation: first, they do not consider the feature deficiency; second, there exist unknown classes in training data in our setting, while theirs only appear in the testing.

High-Confidence False Predictions appear due to model’s unawareness of such kind of mistake (journals/jqid/AttenbergPF15; conf/aaai/LakkarajuKCH17; conf/aaai/BansalW18). To handle such unknown unknowns, existing studies typically ask for external human expert to help identifying high-confidence false predictions and then retrain the model with the guidance. Although these works also consider unknown unknowns and resort to external human knowledge, their setting and methodology differ from ours: our unknown unknowns are caused due to the feature deficiency, so the learner requires to augment features rather than querying the labels.

Active Learning aims to achieve greater accuracy with fewer labels by asking queries of unlabeled data to be labeled by the human expert (book'12:active-learning). Besides, there are some works querying features (conf/icdm/MelvillePM05; journals/ml/DhurandharS15; conf/kdd/HuangXXSNC18), which tries to improve learning with missing features via as fewer as possible queries of entry values (feature of an instance). Unlike their setting, we augment new features to help the identification of the unknown classes rather than querying missing values of the given feature to improve the performance of known classes classification.

Learning with Rejection gives the classifier an option to reject an instance instead of providing a low-confidence prediction (journals/tit/Chow70). Plenty of works are proposed to design effective algorithms (journals/jmlr/YuanW10; NIPS16:boosting-reject; conf/nips/WangQ18; conf/nips/ShimHY18) and establish theoretical foundations (Stats06:reject-theory; journals/jmlr/BartlettW08; ALT16:reject-theory; arXiv'19:bousquet-rejection). As aforementioned, methods of rejection cannot be directly applied in exploratory machine learning since it will result in inaccurate rejections of instances from known classes, and meanwhile, it cannot exploit new features.

6 Experiments

We conduct experiments on both synthetic and real datasets to evaluate exploratory learning in the following aspects:

  • advantage of exploratory learning over conventional supervised learning in handling unknown unknowns;

  • effectiveness of the proposed ExML approach, and the usefulness of feature acquisition and model cascade.

We choose kernel support vector machine as the base model with Gaussian kernel

, and the bandwidth is set as . Besides, parameters , are set as . We select the best rejection threshold of augmented model from the pool for each algorithm, and the rejection threshold of the initial model is selected by cross validation to ensure 95% accuracy on high-confidence predictions. The feature acquisition budget is set as , where is number of training samples, is number of candidate features, is the budget ratio.

6.1 Synthetic Data for Illustration

We first illustrate the advantage of exploratory machine learning by comparing ExML with SL on the synthetic data.

(a) ground-truth
(b) training data
(c) candidates
Figure 5: Visualization of synthetic data: (a) shows the ground-truth distribution; (b) presents training data (only first two dims are observable); (c) plots -sne of candidate features with various qualities (the larger the angle is, the better the feature is).
(a) SL
(b) ExML
(c) allocation
Figure 6: (a), (b) show performance comparison of SL and ExML; (c) reports budget allocation of ExML with median elimination.

Setting. Following the illustrative example in Figure 1, we generate data with -dim feature and classes, each class has samples. Figure 5(a)

presents the ground-truth distribution. In detail, instances from each class are generated from a 3-dim Gaussian distributions. The means and variances are

and for class 1, and for class 2 as well as and for class 3, where is a identity matrix. We fix and set .

However, as shown in Figures 5(b), the third-dim feature is unobservable in training data, resulting in an unknown class (uc) located in the intersection area of known classes (kc1 and kc2). Samples from uc are mislabeled as kc1 or kc2 randomly. Besides, to generate candidate features in various qualities, we project the third-dim feature to different located in the XZ-plane with angles to horizon varies from to , the larger the better. We plot four of them in Figure 5(c) via -sne. The budget ratio is set as .

Results. We first conduct SL to train a rejection model based on the -dim training data, and then perform ExML to actively augment the feature within the budget to discover unknown unknowns. Figures 6(a) and 6(b) plot the results, demonstrating a substantial advantage of ExML over SL in identifying unknown class and predicting known classes.

Furthermore, Figure 6(c) reports the budget allocation of each candidate feature over times repetition. We can see that the allocation evidently concentrates to the more informative features (with larger angles), validating that the effectiveness of the strategy for best feature identification.

6.2 Benchmark Data for Evaluation

We further evaluate on a UCI dataset Mfeat (journals/kybernetika/BreukelenDTH98), which is a multi-view dataset containing 2000 instances for 10 handwritten number uniformly. Six views of features are extracted by various method for representing the raw data and are of different qualities. We list the detailed description for these features in Table 1 and sort them in the descending order of their own qualities: Fac Pix Kar Zer Fou Mor.

Feature Description Budget SL ExML ExML ExML Recall
Fac Profile correlations 10% 93.39 1.66 71.80 9.55 92.39 2.79 92.40 2.78 48%
20% 93.39 1.66 82.26 7.52 91.95 3.32 92.00 3.27 46%
30% 93.39 1.66 89.29 4.72 92.20 3.33 92.50 2.86 44%
Pix Pixel averages in 2 3 windows 10% 92.19 2.47 70.53 8.27 90.54 6.27 90.55 6.31 58%
20% 92.19 2.47 81.70 7.16 90.84 6.17 90.87 6.09 54%
30% 92.19 2.47 88.67 4.14 90.45 5.74 91.82 4.26 68%
Kar Karhunen-Love coefficients 10% 86.87 3.43 70.25 10.2 85.55 4.94 85.90 4.85 56%
20% 86.87 3.43 81.46 6.88 85.21 5.46 86.49 4.81 54%
30% 86.87 3.43 86.01 5.41 86.52 4.71 88.18 3.57 56%

Zernike moments

10% 73.82 8.82 69.61 10.7 72.96 10.4 76.17 8.52 82%
20% 73.82 8.82 80.86 8.02 77.31 7.89 81.72 7.33 82%
30% 73.82 8.82 86.07 5.51 81.11 6.79 86.33 5.04 86%
Fou Fourier coefficients 10% 68.73 9.07 69.42 9.68 68.88 11.8 75.92 8.81 82%
20% 68.73 9.07 82.11 6.48 77.93 8.27 85.03 4.39 88%
30% 68.73 9.07 89.90 3.69 82.45 5.20 89.35 3.89 92%
Mor Morphological features 10% 57.47 15.3 69.09 11.3 66.58 13.5 71.07 11.1 80%
20% 57.47 15.3 79.60 10.1 73.61 8.86 79.74 9.92 84%
30% 57.47 15.2 87.44 7.34 78.31 9.00 86.98 7.07 90%
Table 1: Evaluation on Mefeat dataset. Features are sorted in descending feature quality. Bold font indicates algorithms significantly outperforms than others (paired -test at 95% significance level).

Setting. To understand ExML better, we compare algorithms in various environments, where each one of six views of features is taken as original feature and the rest are prepared as candidate features. We vary the budget ratio from 5% to 30%. Since Mfeat is a multi-class dataset, we randomly sample 5 configurations to convert it into the binary situation, where each known classes and the unknown class contains three original classes. The training dataset contains 600 instances and 1200 instances are used for testing. For each configuration, training instances are randomly sampled 10 times. The results on accuracy are reported over the 50 times repetition.

Contenders and Measure. Apart from the conventional supervised learning (SL), we include two variants of ExML: ExML and ExML, to examine the usefulness of model cascade, where aug/csd denotes the model trained on augmented feature space with or without cascade structure, and UA/ME refers to the strategy of feature acquisition, uniform allocation or median elimination. Note that our approach ExML is essentially ExML.

Besides, we introduce the recall of best feature identification to measure the effectiveness of the feature acquisition, defined as the rate of the number of cases when the identified feature is among one of its top 2 features to the total number. For each original feature, the top candidate is measured by the attainable accuracy of the augmented model trained on the whole training dataset with this particular feature when .

Results. Table 1

reports mean and standard deviation of the predictive accuracy, and all features are sorted in descending order by their quality. We first compare SL to (variants of) ExML. When the original features are in high quality (Kar, Pix, Fac), SL could achieve favorable performance and there is no need to explore new features. However, in the case where uninformative original features are provided, which is of more interest for ExML, SL degenerates severely and

ExML (the single ExML model without model cascade) achieves better performance even with the limited budget. Besides, from the last column of the table, we can see that informative candidates (top 2) are selected to strengthen the poor original features, which validates the efficacy of the proposed budget allocation strategy.

Since the ExML is not guaranteed to outperform SL, particularly with the limited budget on poor candidate features, we propose the cascade structure. Actually, ExML approach (aka, ExML) achieves roughly best-of-two-worlds performance, in the sense that it is basically no worse or even better than the best of SL and ExML. It turns out that even ExML could behave better than ExML when the candidate features are uninformative. These results validate the effectiveness of the model cascade component.

6.3 Real Data of Activities Recognition

We additionally conduct experiment on the RealDisp dataset of activities recognition task (conf/huc/BanosDPRTA12), where 9 on-body sensors are used to capture various actions of participants. Each sensors is placed on different parts of the body and provides 13-dimensional features including 3-dim from acceleration, 3-dim from gyro, 3-dim from magnetic field orientation and another 4-dim from quaternions. Hence, we have 117 features in total.

Setting. In our experiments, three types of actions (walking, running, and jogging) are included to form the dataset containing 2000 instances, where 30% of them are used for training and 70% for testing. In the training data, one sensor is deployed and the class of jogging is mispercevied as walking or running randomly. The learner would explore the rest eight candidate features to discover the unknown unknowns. Thus, there are 9 partitions, and each is repeated for 10 times by sampling the training instances randomly.

Figure 7: Performance comparisons of all the contenders.
Figure 8: Illustration of budget allocation with median elimination.

Results. Figure 8 shows the mean and std of accuracy, our approach ExML (aka, ExML) outperforms others, validating the efficacy of our proposal.

In addition, Figure 8 illustrates the budget allocation when the budget ratio . The -th row denotes the scenario when the -th sensor is the original feature, and patches with colors indicate the fraction of budget allocated to each candidate feature. The number above a patch means the attainable accuracy of the model trained on the whole training dataset with the particular feature. We highlight the top two candidate features of each row in white, and use blue color to indicate selected feature is not in top two. The results show that ExML with median elimination can select the top two informative features to augment for all the original sensor. The only exception is the 9-th sensor, but quality of the selected feature (91.8) does not deviate too much from the best one (93.6). These results reflect the effectiveness of our feature acquisition strategy.

7 Conclusion

This paper studies the task of learning with unknown unknowns, where there exist some instances in training datasets belonging to a certain unknown class but are wrongly perceived as known classes, due to the insufficient feature information. To address this issue, we propose the exploratory machine learning (ExML) to encourage the learner to examine and investigate the training dataset by exploring more features to discover potentially unknown classes. Following this idea, we design an approach consisting of three procedures: rejection model, feature acquisition, and model cascade. By leveraging techniques from bandit theory, we prove the rationale and efficacy of the feature acquisition procedure. Experiments validate the effectiveness of our approach.

There remain many directions worthy exploring for the exploratory machine learning. The first is to further refine the description of human ability in the learning loop. We currently focus on the low-confidence unknown unknowns by requesting humans provide candidate features, and it is interesting to further investigate the high-confidence unknown unknowns by means of asking humans for more information. The second is to consider a personalized cost for each candidate feature, since we usually need to pay a higher price to obtain more informative features in real-world applications. The learner should balance trade-off between accuracy and cost budget. In addition, as mentioned earlier, we will consider the top feature acquisition if the learner can afford more devices in the testing time, where feature correlations should be taken in account.