An Unbiased Risk Estimator for Learning with Augmented Classes

10/21/2019 ∙ by Yu-Jie Zhang, et al. ∙ 31

In this paper, we study the problem of learning with augmented classes (LAC), where new classes that do not appear in the training dataset might emerge in the testing phase. The mixture of known classes and new classes in the testing distribution makes the LAC problem quite challenging. Our discovery is that by exploiting cheap and vast unlabeled data, the testing distribution can be estimated in the training stage, which paves us a way to develop algorithms with nice statistical properties. Specifically, we propose an unbiased risk estimator over the testing distribution for the LAC problem, and further develop an efficient algorithm to perform the empirical risk minimization. Both asymptotic and non-asymptotic analyses are provided as theoretical guarantees. The efficacy of the proposed algorithm is also confirmed by experiments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in machine learning encourage its application in high-stack scenarios, where robustness is the central requirement 

(Dietterich, 2017, 2019)

. The robustness of a learning system relies on its adaptability to open and dynamic environments, in which the distribution, the label space and the feature space of a task could change. Under such a circumstance, classical supervised learning approaches might fail since they do not take the non-stationarity into consideration. Consequently, it is of great importance to design more robust and reliable algorithms to adapt to these unreliable and changing factors in environments.

This paper investigates the class-incremental learning (Zhou and Chen, 2002), where new classes may emerge in the learning process. Specifically, we concern about the problem of learning with augmented classes (LAC) (Da et al., 2014), one of the core tasks of class-incremental learning. In the LAC problem, classes that do not appear in the training dataset might emerge in the testing stage. These new classes, if simply neglected, will seriously deteriorate learning performance due to the misclassification of instances therein. Thus, a desired learning system should be able to identify these unknown classes and retain good generalization ability over the testing distribution.

The fundamental obstacle of the LAC problem is how to model relationships between known and new classes. In the literature, researchers propose various assumptions to model such relationships, including low-density separation (Da et al., 2014) or open space property (Scheirer et al., 2013)

, etc. A variety of algorithms are proposed with satisfactory empirical performance. Nevertheless, their theoretical properties are generally unclear. To the best of our knowledge, there is no method that provides theoretical investigations on classifiers’ generalization ability over the testing distribution, where both known and new classes exist.

In this paper, we discover that by exploiting the cheap and vast unlabeled data, an unbiased risk estimator over the testing distribution for the LAC problem can be developed. This paves a way to design algorithms with nice statistical properties. Our intuition is that, though labeled instances from new classes are absent from labeled data, their distribution information is contained in unlabeled data, though mixed with the distribution of known classes. Therefore, we can access the distribution of new classes by separating the distribution of labeled data from that of unlabeled data, based on which an unbiased risk estimator can be established. Figure 1 illustrates our idea.

Figure 1: Approximate the distribution of new classes by distributions of labeled data and unlabeled data in the training dataset.

More precisely, we propose the class shift condition to model the connection between training and testing distributions for the LAC problem. Under such a condition, the distribution of new classes can be directly reduced to the difference between the distribution of unlabeled data and that of labeled data (from known classes). Based on the reduction, we can evaluate the risk of classifiers over testing distribution in the training stage, where minimizing its empirical estimator finally gives our Eulac algorithm, short for Exploiting Unlabeled data for Learning with Augmented Classes.

Since the empirical risk estimator is evaluated directly over the testing distribution, the Eulac

algorithm enjoys several favorable properties. Theoretically, our algorithm enjoys both asymptotic (consistency) and non-asymptotic (generalization error bound) guarantees. Notably, the non-asymptotic analysis further justifies the capability of our algorithm in utilizing unlabeled data, since the generalization error becomes smaller with an increasing number of unlabeled data. Moreover, extensive experiments further validate the effectiveness of our approach. It is noteworthy to mention that our approach can now perform the standard cross validation procedure to select parameters, while most geometric-based algorithms cannot and thus heavily rely on experience to tune parameters, since the cross validation of these algorithms is biased due to unavailability of the testing distribution.

The main contributions of this paper are as follows.

  • We propose the class shift condition for the LAC problem, which models the connection between training and testing distributions.

  • Based on the class shift condition, we establish an unbiased risk estimator over the testing distribution, by exploiting unlabeled data.

  • We propose the Eulac

    algorithm to empirically minimize the unbiased estimator. We further prove its theoretical effectiveness by both asymptotic and non-asymptotic analysis, and validate its empirical superiority by extensive experiments.

This paper is organized as follows. First, we will discuss the relationship between our method and the related works of LAC problem. Meanwhile, some other relevant topics regarding reliable machine learning algorithms will also be included (Section 2). Then, we formally describe the LAC problem and introduce notations used in the rest of this paper (Section 3). Next, we propose the class shift condition and show how to establish an unbiased risk estimator with labeled and unlabeled training data, which finally gives our algorithm (Section 4). Afterwards, we provide the theoretical properties of our algorithm containing both asymptotic and non-asymptotic analyses(Section 5). Finally, we validate the effectiveness of our method by extensive experiments (Section 6).

2 Related Work

Class-incremental learning (C-IL) (Zhou and Chen, 2002) aims to facilitate the learning system with capability of handling new classes that appear in the learning process, which is a fundamental task for robust and reliable learning in open and dynamic environments. Learning with augmented classes (LAC), the focus of this paper, is one of the core tasks of C-IL, where classes that do not appear in training data might emerge in the testing stage. The pioneering work of Da et al. (2014) proposes to exploit unlabeled data for the LAC problem, and authors design a novel algorithm by tuning the decision boundary to pass through low-density regions. While we share the same problem setup, our approach differs from theirs in various aspects. We propose the class shift condition to model connections between known and new classes, whereas theirs use the geometric assumption instead. Moreover, our approach enjoys stronger theoretical guarantees and superior empirical performance.

Another related topic is the open set recognition (Scheirer et al., 2013)

, which is an alternative terminology mainly used in the pattern recognition community. 

Scheirer et al. (2013) introduce the concept of “open space risk” to penalize predictions outside the support of training data, based on which many approaches are proposed (Scheirer et al., 2013, 2014). Besides, there are also works based on the nearest neighbor method (Mendes-Junior et al., 2017) or the extreme value theory (Rudd et al., 2018). Although these algorithms achieve satisfactory empirical behavior, they generally lack theoretical guarantees.

Two exceptions are works of Scott and Blanchard (2009) and Liu et al. (2018)

. Authors focus on the Neyman-Pearson classification problem, where false positive predictions on known classes are minimized with the constraint of desired novelty detection ratio, or vice. 

Liu et al. (2018) design a meta-algorithm to take the existing novelty detection algorithm as a subroutine to recognize new classes. Although the meta-algorithm enjoys PAC-style guarantees, statistical properties of subroutines are unclear. Moreover, these works mainly contribute to the analysis for novelty detection, while the predictive error over the testing distribution, or the generalization ability, is not investigated.

Apart from the batch setting, researchers also consider the scenario of emerging new classes in the streaming data (Fink et al., 2006; Muhlbaier et al., 2009), where a few labeled instances from new classes are available and the learner requires to update model incrementally to adapt to emerging classes. Subsequently, Mu et al. (2017) and Cai et al. (2019) study the problem of streaming classification with emerging new classes, which is more challenging since the learner requires to detect new classes, update model, and predict with unlabeled streaming data.

Learning with rejection also concerns about reliability of predictions, where the classifier is provided with an option to reject an instance instead of producing a low confidence prediction (Chow, 1970). Plenty of works are proposed with effective implementation (Yuan and Wegkamp, 2010; Cortes et al., 2016b; Wang and Qiao, 2018; Geifman and El-Yaniv, 2019) or theoretical foundations (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Cortes et al., 2016a). However, algorithms for learning with rejection are not applicable to the LAC problem, since new classes do not necessarily locate in the low confidence region.

3 Problem Statement

In this section, we provide formal descriptions of the learning with augmented classes (LAC) problem and introduce notations used throughout the paper.

LAC Problem.

In the training stage, the learner receives a labeled dataset , sampled from the training distribution over , where denotes the feature space and is the label space of known classes.

In the testing stage, the learner requires to predict instances from the testing distribution , where new classes might emerge. Since the specific partition of new classes is unobserved, the learner would predict all of them as a single new class . So the testing distribution is defined over , where is the augmented label space.

The goal of the learner is to train a classifier minimizing the following expected risk with respect to 0-1 loss in order to retain a good generalization ability over the testing distribution,


where denotes the indicator function.

In our setup, the learner can additionally receive an unlabeled dataset sampled from the testing distribution apart from labeled data.


We use the uppercase to denote distributions, where training, testing and new classes distributions are denoted by , and , respectively. Besides, the lowercase is the density function, where the joint, conditional and marginal density functions are indicated by the subscripts , () and (), respectively. For instance, we denote by the marginal density function of the testing distribution over feature space.

Although the original training distribution is defined over label space , the learning is implemented on the augmented label space . Thus, we redefine all the distributions over the space , where holds for all and . Meanwhile, holds for all and .

4 Our Proposal

In this section, we propose an unbiased risk estimator over the testing distribution, where the class shift condition is introduced first to bridge training and testing distributions. Based on this condition, we then study an ideal situation where the testing distribution is accessible. Next, we show how to approximate the ideal situation with empirical training data, which finally gives our Eulac algorithm.

4.1 Problem Refinement

The LAC problem studies the situation where new classes appear in the testing stage. Although not explicitly stated, previous works (Scheirer et al., 2013; Da et al., 2014; Mendes-Junior et al., 2017) actually rely on the essential assumption that the distribution of known classes remains unchanged with the augmentation of new classes. Following the same spirit, we introduce the following class shift condition to rigorously depict the connection between training and testing distributions of the LAC problem.

Definition 1 (Class Shift Condition).

We claim the training distribution , the testing distribution , and distribution of new classes are under the class shift condition when


where is a certain mixture proportion.

Class shift condition essentially states that the testing distribution can be regarded as a mixture of those of known and new classes with a certain proportion , where is actually the distribution of known classes.

4.2 A Practical Unbiased Risk Estimator

We now turn to develop the unbiased risk estimator for the LAC problem, where the major obstacle is how to approximate the testing distribution with training data. Instead of imposing assumptions on new classes, we find that the information contained in vast and cheap unlabeled data collected from the environments is actually a good advisor. Before describing the way to estimate the testing distribution via unlabeled data, we first consider an ideal situation where the testing distribution is available.

An Ideal Case.

When the testing distribution is available, the LAC problem degenerates to a standard multi-class classification problem, which can be addressed by many established algorithms. Among those approaches, we adopt the one-versus-rest (OVR) strategy, which enjoys sound theoretical foundations (Zhang, 2004) and nice practical performance (Rifkin and Klautau, 2004). The risk minimization problem for the OVR strategy is formulated as,


where is the classifier trained for the -th class, ; and denotes the classifier for the new class. For simplicity, we substitute with in the formulation.

is a certain binary surrogate loss function such as hinge loss, exponential loss, etc. After obtaining classifiers by minimizing the empirical version of (

3), the learner can predict an instance to the index of the classifier with maximum output, namely, .

Approximating Testing Distribution.

Unfortunately, the testing distribution is unavailable in the training stage, due to the existence of new class. Thus, the risk minimization problem (3) is far from practice. We now proceed to reduce the OVR risk to a more operational one established over distributions of labeled and unlabeled data.

First, we note that, according to the definition of class shift condition, the joint probability density of the testing distribution can be decomposed into two parts,

where the last equality follows from the fact that holds for all and

. The first part is the joint probability density function of the training distribution, which can be accessed by the labeled data. The only unknown term is the second part, namely, the marginal density function of the new class. So, to access the testing distribution, it is sufficient to estimate the distribution of new classes.

The estimation can be achieved by exploiting the labeled and unlabeled data. The basic observation is that, under the class shift condition, by summing over the label space we have


which shows that the marginal density function of the new class can be calculated by the difference between those of testing and training distributions. Although the calculation still requires the knowledge of testing distribution, what we really demand is only its marginal distribution over the feature space, which can be estimated effectively by unlabeled data.

Consequently, we can evaluate the OVR risk in the training stage through an equivalent risk .

Proposition 1.

Under the class shift condition, the following equality holds for all measurable functions ,

where is defined as,

Remark 1.

The risk can be evaluated while training since the first term is the expectation over which can be effectively approximated by the labeled data, and the second term is established on the marginal distribution of testing data, which can be approached by the unlabeled data.

One last problem regarding the risk is that even with convex binary surrogate loss functions, the corresponding risk minimization problem is non-convex, and the non-convexity comes from the terms and . Inspired by the work of du Plessis et al. (2014), we eliminate these non-convex terms by carefully choosing surrogate loss functions, as stated in the following proposition.

Proposition 2.

Under the class shift condition, the following equality holds for all measurable functions and when the surrogate loss function satisfies for all ,

where the LAC risk is defined as,

Proposition 2 is a direct consequence of Proposition 1 with desired surrogate loss functions. Many loss functions satisfy the condition, such as logistic loss , square loss and double hinge loss

After acquiring the LAC risk , the classifiers can be trained via minimizing the corresponding empirical risk estimator . Since LAC risk is equivalent to the ideal OVR risk , its empirical estimator is unbiased over the testing distribution. Such a property guarantees performance of our algorithm in the testing stage.

4.3 Empirical Implementation

Specifically, we consider minimizing the empirical LAC risk in a reproducing kernel Hilbert space (RKHS), formulated as


where is the RKHS associated to a PDS kernel and is its norm. The empirical LAC risk is the empirical version of defined by


where can be any surrogate loss functions satisfying the condition in Proposition 2.

According to the representer theorem (Scholkopf and Smola, 2001), the optimal solution of the optimization problem (5) is provably in the form of


where is the -th coefficient of the -th classifier.

Plugging (7) into (5), we get a convex optimization problem with respect to , which can be solved efficiently. After obtaining the classifiers , the learner can just predict as .

Notice that the implementation of our algorithm requires the knowledge of the mixture proportion , where plenty of works (Kawakubo et al., 2016; Ramaswamy et al., 2016) have explored to estimate from the labeled dataset and the unlabeled data. We adopt the method of Ramaswamy et al. (2016). Algorithm 1 summarizes main procedures.

The last issue regarding the implementation is the parameters selection. Since the risk estimator is established on the testing distribution directly, we can perform an unbiased cross validation procedure to select the parameters. On the contrary, the cross validation process could be biased for geometric-based algorithms since the distribution of new classes is unknown, and thus the setting of parameters heavily relies on experience for these approaches.

0:  labeled dataset , unlabeled dataset , kernel function , regularization parameter .
0:  Classifiers’ coefficients
1:  Estimate the mixture ratio of the observed classes in the testing distribution.
2:  Solve the convex optimization problem (5).
Algorithm 1 Eulac Algorithm

5 Theoretical Analysis

In this section, we establish both asymptotic and non-asymptotic analysis for our approach. Specifically, we first show the infinite-sample consistency of the LAC risk . Then, we investigate the generalization property via generalization error analysis. All proofs can be found in the appendix.

5.1 Infinite-sample Consistency

At first, we show that the LAC risk is infinite-sample consistent with the risk over the testing distribution with respect to 0-1 loss. Namely, by minimizing the expected risk on , we can get classifiers achieving Bayes rule over the testing distribution, which is the ultimate goal of the LAC problem. The formal statement is as follows.

Theorem 1.

Under the class shift condition, suppose the surrogate loss function is convex, bounded below, differential, satisfying and , then for any , there exists such that for all measurable functions ,


where and is the Bayes error over the testing distribution .

Theorem 1 follows from Proposition 2 and analysis in the seminal work of Zhang (2004), where the consistency property of OVR risk is investigated. Since the LAC risk is equivalent to the OVR risk , it is naturally infinite-sample consistent.

Many loss functions satisfy assumptions in Theorem 1 such as logistic loss and square loss . Further, a more quantitative result can be obtained for the square loss.

Theorem 2.

Under the class shift condition, when using as the surrogate loss function, for all measurable functions , we have,

Theorem 2 shows that the excess risk of is actually the upper bound of the excess risk of 0-1 loss. Thus, by minimizing the LAC risk , we can obtain classifiers performing well on the testing distribution with respect to 0-1 loss.

5.2 Finite-sample Convergence

We establish the generalization bound for the proposed algorithm in this part. Since the algorithm actually minimizes the empirical risk estimator with a regularization term of the RKHS , it is equivalent to investigate the generalization ability of classifiers in the kernel-based hypothesis set ,


where is a feature mapping associated with the PDS kernel , and is an element in RKHS . We have the following generalization error bound.

Theorem 3.

Assume that holds for all and the surrogate loss function is bounded by and is -Lipschitz continuous111Common surrogate loss functions including logistic loss, exponential loss and square loss satisfy these conditions, since and .. Let be the kernel-based hypothesis set. Then, for any , with probability at least over the draw of labeled samples of size from and unlabeled samples of size from , the following holds for all ,

Based on Theorem 3, by the standard argument (Bousquet et al., 2003; Mohri et al., 2012), we can obtain the estimation error bound.

Theorem 4.

Under the assumptions of Theorem 3 and let be the optimal solution of the optimization problem (5) with certain , we have

where denotes and . The parameter is a constant related to in (5). For a better presentation, we use the -notation to keep the dependence on , and only.

Remark 2.

Theorem 3 and Corollary 4 show that, the estimation error of the trained classifiers decreases with a growing number of labeled and unlabeled data. An important message delivered here is that our algorithm can achieve better performance by collecting more unlabeled data, which theoretically justifies its effectiveness in exploiting unlabeled data. Experiments also validate the same tendency.

5.3 Overview of Theoretical Results

Recall that the ultimate goal of the LAC problem is to obtain classifiers that approach Bayes rule over the testing distribution, and thus we need to minimize the excess risk . According to the the consistency guarantee presented in Section 5.1, it suffices to minimize the excess risk , which can be further decomposed into the estimation error and the approximation error as follows,

Theorem 4 demonstrates that the estimation error converges to zero with an increasing number of labeled and unlabeled data. Meanwhile, the term of approximation error measures how well the hypothesis set is in approximating Bayes risk, which is not accessible in general (Mohri et al., 2012).

To conclude, the consistency guarantee (in Section 5.1) and estimation error bound (in Section 5.2) theoretically justify the effectiveness of our algorithm.

6 Experiment

In this section, we conduct experiments to examine performance of the proposed Eulac algorithm from the following three aspects.

  • Comparisons on benchmark datasets: we compare various algorithms on benchmark datasets, to validate the efficacy of our approach.

  • Comparisons with an increasing number of unlabeled data: we examine empirical behavior with an increasing number of unlabeled data, to demonstrate effectiveness of our approach in utilizing unlabeled data.

  • Performance in various environments: we report performance of our algorithm in various environments, where the mixture ratio varies.

In all experiments, we randomly generate 10 class configurations for each dataset to simulate the augmentation of classes unless otherwise specified, where half of the total classes are chosen as new classes. For datasets whose class number is less than 4, we generate the maximum number of class configurations it can produce. In each class configuration, 500 instances are randomly selected as training data from known classes. The testing and unlabeled datasets contain 1000 instances sampled from the whole dataset. The instances sampling procedure also repeats 10 times.

6.1 Comparisons on Benchmark Datasets

In this part, we conduct experiments on benchmark datasets.


We perform the empirical studies over 10 benchmark datasets, including 9 datasets: usps, segment, satimage, optdigits, pendigits, SenseVeh, landset, mnist and shuttle. The brief statistics of these datasets are listed in Table 1.

Index Datasets # class # dim
min max avg
1 usps 10 256 708 1553 929
2 segment 7 19 330 330 330
3 satimage 6 36 626 1533 1073
4 optdigits 10 64 554 572 562
5 pendigits 10 16 1055 1144 1099
6 SenseVeh 3 100 12316 26423 20527
7 landset 6 73 626 1533 1073
8 mnist 10 780 6313 7877 7000
9 shuttle 7 9 10 45586 8286
Table 1: Statistics of datasets used in the experiments.


We compare Eulac with six methods, including four without exploiting unlabeled data and two utilizing them. The four algorithms are,

  • OVR-SVM is a powerful strategy for the multi-class classification problem (Rifkin and Klautau, 2004). In order to adapt OVR-SVM to the LAC problem, the algorithm predicts an instance as new when , otherwise it predicts as the classical OVR-SVM.

  • W-SVM (Scheirer et al., 2014) is an SVM-based algorithm, where both one-class SVM and binary SVM incorporating with extreme value theory (EVT) are used to predict for the new class.

  • OSNN (Mendes-Junior et al., 2017) is a nearest neighbor-based algorithm, which predicts an instance as new class if it shares similar distances with two nearest neighbors from different classes.

  • EVM (Rudd et al., 2018)

    is also based on the extreme value theory, and it uses non-linear radial basis functions.

Another two algorithms exploiting unlabeled data are,

  • LACU-SVM (Da et al., 2014) is an SVM based algorithm that utilizes the geometry property of unlabeled data to tune the decision boundaries of classifiers.

  • PAC-iForest (Liu et al., 2018) is an iForest(Liu et al., 2008) based method, which selects the rejection threshold by using unlabeled data to ensure a desired novelty detection ratio. We use PAC-iForest to detect new classes and SVM to classify known classes.

Parameters Setting.

For all SVM-based algorithms (OVR-SVM, W-SVM, LACU-SVM, Eulac), we use the Gaussian kernel , where the bandwidth parameter for OVR-SVM, W-SVM and Eulac is selected from the candidate sets and by the 5-fold cross validation, respectively. For W-SVM and Eulac, the regularization parameter is selected from the pool . While for OVR-SVM, this parameter is set as 1. Other parameters not specified, including parameters of LACU-SVM, are set according to corresponding papers. In the last, we use the square loss as the surrogate loss function for our algorithm.

For OSNN, we select the rejection threshold by the cross validation method proposed by authors. When the number of known classes is 2, this parameter selection is not effective and we set . For EVM, we set and the tail size is selected by cross validation. For PAC-iForest, it is unknown on how to set alien-detection rate under the LAC setting, so we report results with . Last, we adopt the approach of Ramaswamy et al. (2016) to estimate the mixture ratio , whose value is required by our algorithm and PAC-iForest as an input parameter.

usps 75.42 4.87 79.77 4.97 63.14 8.91 61.14 6.27 69.20 8.34 55.69 13.3 50.27 14.2 86.52 2.72
segment 71.78 5.12 80.82 9.38 85.10 5.98 82.13 5.88 40.69 12.5 63.64 13.1 57.60 17.7 86.17 5.80
satimage 54.67 9.80 76.29 13.2 62.48 11.2 72.10 8.16 51.56 17.3 60.76 7.79 56.94 11.1 81.25 6.18
optdigits 80.11 3.80 87.82 4.64 86.97 3.79 72.00 8.33 80.92 3.68 71.65 5.46 69.54 8.86 91.54 2.95
pendigits 72.78 5.19 87.79 3.95 86.69 3.39 89.94 1.30 70.66 6.18 73.21 4.52 71.74 3.59 88.41 4.81
SenseVeh 48.07 3.80 45.96 2.32 49.91 6.88 51.24 3.91 51.61 3.31 54.12 7.19 33.63 3.37 77.33 2.17
landset 60.43 7.65 68.91 17.0 73.25 9.23 76.00 7.79 53.59 9.88 70.50 7.16 67.20 6.69 85.70 4.46
mnist 66.74 2.76 75.38 4.62 57.75 10.9 58.39 5.94 63.53 7.58 48.31 9.62 36.46 10.5 80.66 5.38
shuttle 37.39 14.1 58.48 34.5 48.21 16.4 34.18 13.4 29.36 8.70 24.39 13.5 66.49 17.9
Eulac w/ t/ l 9/ 0/ 0 8/ 1/ 0 8/ 1/ 0 8/ 1/ 0 9/ 0/ 0 9/ 0/ 0 9/ 0/ 0 rank first 8/ 9
Table 2: Macro-F1 score comparisons on benchmark datasets. The best method is emphasized in bold. Besides, indicates that Eulac is significantly better than the compared methods (paired -tests at 95% significance level) and – indicates numerical limits or errors.


Table 2 reports performance in terms of Macro-F1 score. We can see that Eulac algorithm outperforms other contenders in most datasets, which validates the helpfulness of unlabeled data and effectiveness of our algorithm in exploiting them. Note that it is surprising that W-SVM and EVM achieve better performance than LACU-SVM and PAC-iForest, which are fed with unlabeled data. This indicates that the usage of unlabeled data does not necessarily improve performance in general. Another reason might be that these geometric-based methods require to set parameters empirically and the default one may not be proper in all datasets. By contrast, Eulac algorithm can perform an unbiased cross validation procedure to select proper parameters.

(a) mnist (b) landset (c) pendigits (d) usps
Figure 2: Macro-F1 score comparisons of Eulac, LACU-SVM, and PAC-iForest when the number of unlabeled data increases.

6.2 Comparisons with Increasing Number of Unlabeled Data

We compare the Eulac algorithm with cnotenders to examine its effectiveness in exploiting unlabeled data. Concretely, we vary the size of unlabeled data from 250 to 1500 with an interval of 250 on 4 datasets: mnist, landset, pendigits, and usps. Figure 2 presents results of Macro-F1 score.

The results show that the score of LACU-SVM remains unchanged or even drops in four dataset, while performance of our algorithm improves when provided with more unlabeled data, which is in accordance with the theoretical analysis in Section 5. This again validates that our algorithm can exploit unlabeled data effectively. Notice that PAC-iForest also enjoys theoretical guarantees, nevertheless, the guarantees are only for the novelty detection ratio and thus the overall performance over the testing distribution is not promised to be improved, as validated in experiments.

6.3 Performance in Various Environments

We are curious about performance of our algorithm in various environments where the fraction of new classes increases. To this end, we conduct experiments on 9 datasets with the unknown class ratio ranging from 0, 0.2, 0.6, 0.8. The mixture ratio is supposed to be known in advance. Figures 3(a) and 3(b) show performance variation in terms of accuracy and Macro-F1.

(a) Accuracy
(b) Macro-F1
Figure 3: Performance of our approach in various environments.

We observe that our algorithm retains high performance in most cases with a changing mixture ratio , which verifies the adaptivity of our algorithm in various environments under different unknown class ratios.

7 Conclusion

In this paper, we discover that it is achievable to establish an unbiased risk estimator for the LAC problem by exploiting unlabeled data. The key observation is that the distribution of new classes can be effectively approximated by distributions of labeled and unlabeled data, which enables us to evaluate the risk of an classifier over testing distribution in the training stage. Subsequently, we develop the Eulac algorithm to perform empirical minimization of this unbiased risk, and the approach enjoys nice theoretical properties. Notably, we provide generalization bound over the testing distribution, which demonstrates that performance of our approach improves with an increasing number of labeled and unlabeled data, and thus theoretically justifies the effectiveness in exploiting unlabeled data. Extensive empirical studies also validate efficacy of the proposed approach.


  • Bartlett and Wegkamp (2008) Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, pages 1823–1840, 2008.
  • Bousquet et al. (2003) Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi.

    Introduction to statistical learning theory.

    In Advanced Lectures on Machine Learning, Machine Learning Summer Schools 2003, pages 169–207, 2003.
  • Cai et al. (2019) Xin-Qiang Cai, Peng Zhao, Kai Ming Ting, Xin Mu, and Yuan Jiang. Nearest neighbor ensembles: An effective method for difficult problems in streaming classification with emerging new classes. In Proceedings of the 19th International Conference on Data Mining (ICDM), 2019.
  • Chow (1970) C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, pages 41–46, 1970.
  • Cortes et al. (2016a) Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Proceedings of International Conference on Algorithmic Learning Theory (ALT), pages 67–82, 2016a.
  • Cortes et al. (2016b) Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Boosting with abstention. In Advances in Neural Information Processing Systems 29 (NIPS), pages 1660–1668, 2016b.
  • Da et al. (2014) Qing Da, Yang Yu, and Zhi-Hua Zhou. Learning with augmented class by exploiting unlabeled data. In

    Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI)

    , pages 1760–1766, 2014.
  • Dietterich (2017) Thomas G. Dietterich. Steps toward robust artificial intelligence. AI Magazine, pages 3–24, 2017.
  • Dietterich (2019) Thomas G. Dietterich. Robust artificial intelligence and robust human organizations. Frontiers Computer Science, pages 1–3, 2019.
  • du Plessis et al. (2014) Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems 27 (NIPS), pages 703–711, 2014.
  • Fink et al. (2006) Michael Fink, Shai Shalev-Shwartz, Yoram Singer, and Shimon Ullman. Online multiclass learning by interclass hypothesis sharing. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 313–320, 2006.
  • Geifman and El-Yaniv (2019) Yonatan Geifman and Ran El-Yaniv.

    Selectivenet: A deep neural network with an integrated reject option.

    In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 2151–2159, 2019.
  • Herbei and Wegkamp (2006) Radu Herbei and Marten H Wegkamp. Classification with reject option. Canadian Journal of Statistics, pages 709–721, 2006.
  • Kawakubo et al. (2016) Hideko Kawakubo, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Computationally efficient class-prior estimation under class balance change using energy distance. IEICE Transactions on Information and System, pages 176–186, 2016.
  • Koltchinskii (2011) Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.
  • Liu et al. (2008) Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), pages 413–422, 2008.
  • Liu et al. (2018) Si Liu, Risheek Garrepalli, Thomas G. Dietterich, Alan Fern, and Dan Hendrycks. Open category detection with PAC guarantees. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 3175–3184, 2018.
  • Mendes-Junior et al. (2017) Pedro Ribeiro Mendes-Junior, Roberto Medeiros de Souza, Rafael de Oliveira Werneck, Bernardo V. Stein, Daniel V. Pazinato, Waldir R. de Almeida, Otávio A. B. Penatti, Ricardo da Silva Torres, and Anderson Rocha. Nearest neighbors distance ratio open-set classifier. Machine Learning, pages 359–386, 2017.
  • Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
  • Mu et al. (2017) Xin Mu, Kai Ming Ting, and Zhi-Hua Zhou. Classification under streaming emerging new classes: A solution using completely-random trees. IEEE Transactions on Knowledge and Data Engineering, 29(8):1605–1618, 2017.
  • Muhlbaier et al. (2009) Michael D. Muhlbaier, Apostolos Topalis, and Robi Polikar. Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes. IEEE Transactions on Neural Networks and Learning Systems, pages 152–168, 2009.
  • Ramaswamy et al. (2016) Harish G. Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 2052–2060, 2016.
  • Rifkin and Klautau (2004) Ryan M. Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, pages 101–141, 2004.
  • Rudd et al. (2018) Ethan Rudd, Lalit P. Jain, Walter J. Scheirer, and Terrance Boult. The extreme value machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • Scheirer et al. (2013) Walter J. Scheirer, Anderson Rocha, Archana Sapkota, and Terrance E. Boult. Towards open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
  • Scheirer et al. (2014) Walter J. Scheirer, Lalit P. Jain, and Terrance E. Boult. Probability models for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 2317–2324, 2014.
  • Scholkopf and Smola (2001) Bernhard Scholkopf and Alexander J. Smola.

    Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

    MIT Press, 2001.
  • Scott and Blanchard (2009) Clayton Scott and Gilles Blanchard. Novelty detection: Unlabeled data definitely help. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics AISTATS, pages 464–471, 2009.
  • Wang and Qiao (2018) Wenbo Wang and Xingye Qiao. Learning confidence sets using support vector machines. In Advances in Neural Information Processing Systems 31 (NeurIPS), pages 4934–4943, 2018.
  • Yuan and Wegkamp (2010) Ming Yuan and Marten H. Wegkamp. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, pages 111–130, 2010.
  • Zhang (2004) Tong Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, pages 1225–1251, 2004.
  • Zhou and Chen (2002) Zhi-Hua Zhou and Zhaoqian Chen.

    Hybrid decision tree.

    Knowledge-Based Systems, pages 515–528, 2002.

Appendix A Proof of Proposition 1


For simplicity, we substitute by in the proof. First, recall that the risk of OVR strategy is defined as,

According to the class shift condition, we have


Since holds for all and , we can reform the as,

Since the marginal distribution of new classes is unknown, we reduce it to the difference of the marginal distributions of training and testing data. Under the class shift condition, we have,

By summing over the label space , we obtain the marginal distribution ,

where the can be further converted to the following form,


We complete the proof by plugging  (10) into (9). ∎

Appendix B Proofs of Theorem 1 and Theorem 2

Before showing proofs of Theorem 1 and Theorem 2, for self-contentedness, we introduce results regarding infinite-sample consistency (ISC) of OVR strategy orignially provided by  Zhang [2004].

Theorem 5 (Theorem 10 of Zhang [2004]).

Consider the OVR method, whose surrogate loss function is defined as . Assume is convex, bounded below, differentiable, and when . Then, OVR method is infinite-sample consistency (ISC) on with respect to 0-1 classification risk.

Then, we present the relationship between the risk of an ISC method and the Bayes error,

Theorem 6 (Theorem 3 of Zhang [2004]).

Let be the set of all vector Borel measurable functions, which take values in . For , let . If is ISC on with respect to 0-1 classification risk, then for any , there exists such that for all underlying Borel probability measurable , and ,


where is defined as , and is the optimal Bayes error.

For the OVR strategy, we can further obtain a more quantitative bound.

Theorem 7 (Theorem 11 of Zhang [2004]).

Under the assumptions of Lemma 5. The function is concave on . Assume that there exists a constant such that

then we know that for any ,

The proofs of Theorem 1 and Theorem 2 are consequences of Proposition 2 and the above theorems. We provide proofs in the following,

Proof of Theorem 1.

According to Proposition 2, the LAC risk equals to the risk of OVR strategy . Thus, to prove the infinity sample consistency of LAC risk , it is sufficient to demonstrate such a property of OVR strategy over distribution , which is shown as Theorem 5 and Theorem 6. ∎

Proof of Theorem 2.

To prove Theorem 2, we first show the consistency of OVR strategy with square loss. It is easy to verify that, when taking , we have , which is concave on . As a consequence, the inequality

holds for all , with .

According to Theorem 7, the excess risk with respect to the 0-1 loss function over is bounded by that of the OVR method,

where . Then by applying the equality of the risk of OVR strategy and that of our algorithm from Proposition 2, we complete the proof. ∎

Appendix C Proof of Theorem 3


Recall that,




To obtain the generalization bound of , it is sufficient to establish the generalization bound of and .

Firstly, we study the generalization bound of . With the kernel-based hypothesis set and , according to McDiarmid’s inequality and the standard analysis for generalization bound based on Rademacher complexity [Mohri et al., 2012, Theorem 3.1], we have that


holds with probability at least , where

The Rademacher complexity of hypothesis set can be further bounded by,