Hierarchical Dirichlet Process-based Open Set Recognition

06/29/2018 ∙ by Chuanxing Geng, et al. ∙ 0

In this paper, we proposed a novel hierarchical dirichlet process-based classification framework for open set recognition (HDP-OSR) where new categories' samples unseen in training appear during testing. Unlike the existing methods which deal with this problem from the perspective of discriminative model, we reconsider this problem from the perspective of generative model. We model each known class data in training set as a group in hierarchical dirichlet process (HDP) while the testing set as a whole is treated in the same way, then co-clustering all the groups under the HDP framework. Based on the properties of HDP, our HDP-OSR does not overly depend on training samples and can achieve adaptive change as the data changes. More precisely, HDP-OSR can automatically reserve space for unknown categories while it can also discover new categories, meaning it naturally adapts to the open set recognition scenario. Furthermore, treating the testing set as a whole makes our framework take the correlations among the testing samples into account whereas the existing methods obviously ignore this information. Experimental results on a set of benchmark data sets indicate the validity of our learning framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In real-world recognition/classification tasks, limited by various objective factors, it is usually difficult to collect training instances exhaustively of all classes when training a classifier. A more realistic scenario is open set recognition (OSR)

[1], where incomplete knowledge of the world exists at training time, and unknown classes can be submitted to an algorithm during testing, requiring the classifiers not only to accurately classify the seen (known) classes, but also to effectively deal with the unseen (unknown) ones.

The main challenge for OSR is that the traditional classifiers usually trained under closed set scenario divide over-occupied space for known classes, thus resulting in misclassifying the instances of unknown classes unseen in training as the known classes. To meet this challenge, related studies have been conducted under a number of frameworks, assumptions and names [2, 3, 4, 5, 6, 7]. For example, Phillips et al. [2]

proposed a typical framework for open set identity recognition in a study on evaluation methods for face recognition, while Li and Wechsler

[3]

again viewed open set face recognition from an evaluation perspective and proposed the Open Set TCM-kNN algorithm. It is Scheirer et al.

[1]

that first formalized the open set recognition problem and proposed a preliminary solution—1-vs-Set machine, which incorporates an open space risk term in modeling to account for the space beyond the reasonable support of known classes. Although 1-vs-Set machine decreases the region of known class for each binary support vector machine (SVM), the space occupied by each known class remains unbounded, and the open space risk still exists. As shown in Fig. 1, the 1-vs-Set machine will make misclassifications if the instances of unknown classes ?2,?3 appear in testing. To overcome this problem, researchers have further made many efforts.

Fig. 1: Only known classes 1-4 are available in training, while unknown classes ?1-?6 appear during testing. ’A’ and ’B’ are the decision boundaries of class 1 obtained by 1-vs-Set machine, while ’C’ is the decision boundary of class 1 obtained by OSNN.

Scheirer et al. [8] incorporated non-linear kernels into a solution that further limited the open space risk by positively labeling only set with finite measure, and they proposed a novel Weibull-calibrated SVM (W-SVM), which combines the statistical extreme value theory (EVT) for score calibration with one-class and binary SVMs. Based on the intuition that we can reject the large set of unknown classes even under an assumption of incomplete class knowledge if the positive data for any known classes is accurately modeled without overfitting, Jain et al. [9] invoked EVT to model the positive training data at the decision boundary and proposed the -SVM algorithm. Note that both W-SVM and -SVM adopt the threshold-based classification scheme, thus the threshold plays a key role. However, the thresholds in those models are usually assumed to be equal for all known classes, which is not reasonable since the distributions of known classes in feature space are unknown. On the other hand, the authors in [8, 9] recommended setting this threshold according to the problem openness, but unfortunately the openness of the corresponding problem usually is also unknown. To overcome these deficiencies, Scherreik et al. [10] proposed the probabilistic open set SVM classifier (POS-SVM), where the unique reject threshold for each known class is empirically determined according to the knowledge of known classes.

Recently, Jnior et al. [11] extended the Nearest-Neighbor classifier to OSR scenario, and proposed the OSNN classifier. Zhang et al. [12] proposed the SROSR algorithm based on sparse representation, where they modeled the tails of the matched and sum of non-matched reconstruction error distributions using EVT. To address the gap that the existing algorithms take little to no distributional information into account when learning recognition functions and lack a strong theoretical foundation, Rudd et al. [13] formulated a theoretically sound classifier—the Extreme Value Machine (EVM), which is further developed in [14]

. Besides, researchers also explored open set recognition based on deep neural networks

[15, 16, 17, 18, 19, 20, 21, 22, 23].

In summary, all current existing OSR algorithms are designed specially for recognizing individual instances, even these instances are all arriving collectively in batch like image-set recognition [24]. Only one decision that so-designed recognizer can make is to either reject or categorize them to some known class instance by instance using some empirically-set threshold. Thus the threshold plays a key role, however, the selection for it is usually based on the knowledge of known classes, inevitably incurring risks due to no available information from the unknown classes. As shown in Fig. 1, the decision boundary111The decision boundary of a class defines the region in which a possible testing sample will be classified as belonging to that class. ’C’ obtained by OSNN for known class 1 can reject the large set of unknown classes, whereas it still makes a misclassification when unknown class ?4 appears in testing.

On the other hand, a more realistic or desired OSR system should NOT just rest on a reject decision but should go further, especially for discovering the hidden unknown classes among the reject instances. Unfortunately, existing OSR methods do not directly provide such a mechanism. Although Bendale and Boult [25] introduced the open world recognition framework, which can collect and label (e.g. by humans) the reject instances to further use for updating the OSR model, in fact, it is still a post hoc strategy needing human intervention. Meanwhile, the authors in [21] transferred the knowledge of the similarity and difference in known classes for new class rediscovery among the already-rejected samples, obviously this is also a post hoc approach. Therefore, it is necessary to design a model specifically for this problem.

Towards this goal, in this paper, we introduce a novel collective/batch decision strategy with an aim to extend existing open set recognition for new class discovery. Since the properties of hierarchical Dirichlet process (HDP) fits our problem, we adapt the HDP with slight modification to addressing the OSR problem as an initial solution towards collective decision for open set recognition. Actually, the HDP can also be replaced by other Bayesian nonparametric techniques [26] like the hierarchical beta process [27], but beyond our focus here. Concretely, a collective decision-based OSR framework (CD-OSR) is proposed. Thanks to the properties of the HDP which does not overly depend on training data and can achieve adaptive change as the data changes, Our CD-OSR does not need to define the specific threshold and can automatically reserve space for unknown classes in testing, naturally resulting in the function of new class discovery. Additionally, treating the testing instances in batch makes CD-OSR take into account correlations among the instances obviously ignored by existing methods. Note that CD-OSR can handle both batch and individual instances. Specifically, the contributions and details of our CD-OSR can be highlighted as follows:

  1. []

  2. A novel collective/batch decision strategy is first introduced for open set recognition, which can address the instances in batch, even individual instances. Specifically, a collective decision-based OSR framework (CD-OSR) is proposed, which can solve the existing OSR problem, together with simultaneous new class discovery.

  3. CD-OSR does not need to define the specific threshold and can automatically reserve space for unknown classes in testing, naturally resulting in a new class discovery function.

  4. Treating the testing instances in batch makes CD-OSR consider correlations among the instances obviously ignored by the other existing OSR motheds.

  5. A thorough empirical evaluation of CD-OSR is reported, showing the significant improvement in classification performance, and the function of new class discovery.

The remainder of this paper is organized as follows. Section II gives the related work in open set recognition. Section III introduces a novel collective decision strategy for open set recognition, and a collective decision-based OSR framework is specifically given. Experimental evaluation is given in Section IV, where the classification performance and new class discovery’s function are reported. Finally, Section V gives a conclusion.

2 Related Work

With the formalization of OSR developed in [1], the openness of a particular problem or data universe is defined by considering the number of training, target, and testing classes:

(1)

Larger openness corresponds to more open problems, while the problem is completely closed when Furthermore, the OSR problem can be defined as follows: given a set of training data , an open space risk , and an empirical risk , the goal of OSR is to find a measurable recognition function defined by minimizing the following open set risk

(2)

where is a regularization constant. Thanks to the guidance of this definition, a large number of OSR algorithms have been proposed. Next, we will briefly review the relevant representative approaches.

2.1 The Existing OSR Methods

2.1.1 The 1-vs-Set Machine

Using the definition of OSR, an SVM-based OSR method called the 1-vs-Set machine [1] is proposed, where the open space risk

is considered to be the ratio of the Lebesgue measure of positively labeled open space compared to the overall measure of positively labeled space. Concretely, a hyperplane ’B’ (shown in Fig. 1) parallelling the separating hyperplane ’A’ obtained by the SVM is added in score space, leading to a slab in feature space. Furthermore, the open space risk for a linear kernel slab model is defined as follows:

(3)

where and denote the marginal distances of the corresponding hyperplanes, and is the separation needed to account for all positive data. Additionally, user-specified parameters and are given to weight the importance between the margin spaces and .

After training the 1-vs-Set machine, a testing instance that appears between the two hyperplanes would be labeled with the appropriate class. Otherwise, it is considered as non-target class or rejected, depending on which side of the slab it resides. As discussed in Section 1, the 1-vs-Set machine reduces the open space risk to some extent, however, it still occupies the infinite space, meaning the open space risk still exists.

2.1.2 The W-SVM Model

To further reduce the open space risk, Scheirer et al. [8]

incorporated non-linear kernels into a solution that further limited open space risk by positively labeling only sets with finite measure. They formulated a compact abating probability (CAP) model, where probability of class membership abates as points move from known data to open space. Specifically, a Weibull-calibrated SVM (W-SVM) model was proposed, which combined the EVT for score calibration with two separated SVMs. The first is a one-class SVM CAP model used as a conditioner: if the one-class SVM predicts the posterior estimate

of an input instance is less than a threshold

, the instance will be rejected outright. Otherwise, it will be passed to the second SVM. Then the second one is a binary SVM CAP model via a fitted Weibull cumulative distribution function, yielding the posterior estimate

for the corresponding positive class as well as a reverse Weibull fitting, obtaining the posterior estimate for the corresponding negative classes. Defined an indicator variable: if and otherwise, then the W-SVM model for OSR is defined as follows

(4)

where denotes all the known classes, and is the threshold of the second SVM CAP model.

Additionally, the thresholds and are set empirically, e.g., is fixed to 0.001 as specified by the authors, while is recommend to set depending on the openness of the specific problem by

(5)

The W-SVM effectively limits the open space risk by the threshold-based classification schemes, however, such a threshold setting, especially for , is risky since we usually have no prior knowledge about unknown classes.

2.1.3 The OSNN Model

Adapting the traditional closed-set Nearest Neighbor classifier to the OSR scenario, Jnior et al. [11] proposed the OSNN classifier. Let represent the class of the corresponding instance and be the set of training labels (known classes). The OSNN first finds the nearest neighbor and of testing instance , where . Then, one can calculate the ratio , in which is the Euclidean distance between instances and in the feature space. If is less than or equal to the pre-set threshold (), is classified as the same label of . Otherwise, it is considered as unknown, i.e.,

(6)

Note that applying a threshold on the ratio of similarity scores seems better than on the similarity scores themselves as reported in [11]

. However, the selection of such a threshold is still an empirical setting, inevitably incurring risks due to lacking available information from the unknown classes. As described in Section 1, the OSNN will make a misclassification when unknown class ?4 appears in testing. In addition, just selecting two reference instances from different classes for comparison makes the OSNN model vulnerable for outliers.

2.2 Unseen Class Discovery in existing OSR

In fact, there are also some researchers paying attention to the unknown class discovery in OSR. To further extend open set recognition, the authors in [25] formalized the open world recognition problem: a recognition system should perform four tasks including detecting unknown classes (open set recognition), choosing which instances to label for addition to the model, labelling those instances, and then updating the classifier. Ideally, all of these tasks should be automated, but in [25]

, the authors just presumed supervised learning with labels obtained by human labeling. In addition, the unknown class discovery in

[25] actually is a post hoc strategy.

Besides, Shu et al. [21]

mainly focused on discovering the unknown classes hiding among the reject instances by transferring the knowledge of the similarity and difference in known classes. Correspondingly, a joint open classification framework was proposed with four components: an Open Classification Network (OCN) used for open set recognition, a Pairwise Classification Network (PCN) for classifying whether two input instances are from the same class or not, an auto-encoder for learning representation from unlabeled instances, and a hierarchical clustering model for clustering the reject instances from OCN using PCN as the distance function. Actually, when the testing instances are coming, this framework first performs open set recognition using OCN, then collects the reject instances for further clustering. Therefore, this is obviously a post hoc approach as well. In addition, the use of knowledge in known classes is risky when the transferred knowledge in known and unknown classes differs.

3 Collective Decision for Open Set Recognition

As discussed previously, the current existing OSR methods are designed specially for recognizing individual instances, even these instances are all arriving collectively in batch. Hence recognizers in decision either reject or categorize them to some known class instance by instance using empirically-set threshold, where the threshold plays a key role, however, the selection for it is usually based on the knowledge of known classes, inevitably incurring risks due to lacking available information from the unknown classes. On the other hand, a more realistic OSR system should NOT just rest on a reject decision but should go further, especially for discovering the hidden unknown classes in the reject instances, regrettably, existing OSR methods do not pay much attention. Although [25, 21] have made some efforts, they are just the post hoc strategy.

To overcome these limitations mentioned above, we here introduce a novel collective decision strategy for OSR problem with an aim to extend existing open set recognition for new class discovery. Specifically, a collective decision-based OSR framework (CD-OSR) is proposed by slightly modifying the HDP. Thanks to the properties of the HDP, Our CD-OSR does not need to define the specific threshold and can automatically reserve space for unknown classes in testing, naturally resulting in the function of new class discovery. Moreover, treating the testing instances in batch makes CD-OSR consider correlations among the instances obviously ignored by existing methods. Additionally, CD-OSR can handle both batch and individual instances.

Next, we first give a brief review of the Hierarchical Dirichlet Processe (HDP) [28]

widely used in machine learning problems for co-clustering multiple groups of data by sharing mixture components among the groups. Note that we use the terms ”group” and ”class” interchangeably in the rest of the paper, then we assume that each group/class data comes from a mixture model with an unknown number of components. In addition, we also use the terms ”components” and ”subclasses” interchangeably in the rest of the paper.

3.1 Hierarchical Dirichlet Process

The Dirichlet process (DP) [29, 30] considered as a distribution over distributions is a stochastic process which is mainly used in clustering and density estimation problems as a nonparametric prior defined over the number of mixture components. As a hierarchical extension to DP, the Hierarchical Dirichlet Process [28] is proposed, modeling each group of data in the form of a Dirichlet process mixture model (DPM). Under this hierarchical structure, an elegant way of sharing parameters is provided, allowing the DPM models across different groups to be connected together through a higher level DP.

Let denote the sample in the group where denotes the number of samples in group , is the total number of groups, and represents the parameters of the mixture component associated with , then the HDP framework is completed as follows:

(7)

where as a global distribution is distributed as a Dirichlet process with concentration parameter and base distribution , and for each group is distributed according to the DP with concentration parameter and base distribution . Moreover, as increases, the number of components (or clusters) used to represent each group data increases. Note that although increasing can add the clusters used to represent the data of all groups, the degree to which these clusters are shared between groups will decrease at the same time [31].

An intuitive understanding of the generative process defined by a HDP model can be through an analogy to the Chinese Restaurant Franchise (CRF) which extends upon the Chinese restaurant process, allowing multiple restaurants to share a set of dishes. In the CRF metaphor, customer in restaurant is associated with and sits at table while table is associated with one of the random draws from , i.e., , denoting the global menu of dishes. Moreover, a dish from the global menu served at table in restaurant is represented by the indicator variable . In addition, the concentration parameter

controls the prior probability of serving a new dish at a new table

[32].

In this framework, the restaurants correspond to groups, the tables in each restaurant correspond to the mixture components in the DP mixture model, and the dishes in the global menu correspond to the unique set of parameters shared among the restaurants.

The conditional distributions for and can be obtained by integrating out and , respectively.

(8)

where represents the number of tables in restaurant , and is the number of customers in restaurant at table . According to (8), the conditional is assigned to one of the existing with probability or a new table drawn from with probability . Note that we omit and for ease of writing. Similarly,

(9)

where represents the number of tables across all restaurants serving dish , and denotes the total number of tables occupied by all restaurants. According to (9), the conditional distribution inherits one of the existing with probability or a new dish drawn from with probability , and we also omit and for ease of writing.

The inference of CRF can be performed by using a Gibbs sampling scheme [33]. Rather than dealing with the ’s and ’s directly, the indicator variables and are sampled instead. Specifically, let , , and , we have
Sampling ,

(10)

Sampling ,

(11)

where the set variables or the counts, e.g., , or , , denote the corresponding superscripts removed from the sets or from the calculation of the counts. In addition, the conditional distribution for is omitted as a conjugate pair and are chosen in this study, allowing us to integrate out analytically. For more details, we refer the reader to [28].

3.2 CD-based Open Set Recognition

Since the properties of hierarchical Dirichlet process (HDP) described above fits our problem, we here adapt the HDP with slight modification to addressing the OSR problem as an initial solution towards collective decision for open set recognition. Concretely, the CD-OSR works as follows.

(1) Training Phase: In our CD-OSR framework, we first divide the training set into a fitting set and a validation set (the details are given in subsection 4.1.1). Next, we model each known class data in

as a group of HDP using a Gaussian mixture model (GMM) with an unknown number of components, while the whole validation set

as one batch is treated in the same way, then co-clustering all the groups under the HDP framework. In addition, unlike the HDP, we also append a parameter denoting the proportion of the corresponding subclass in its class in the CD-OSR, where the corresponding subclass will be omitted if the is below some constant after co-clustering. Note that the role of should not be confused with the threshold used in the existing OSR methods. Actually, the here is just used for avoiding the negative influence of the outliers from the known classes. Then repeat this process several times to preform a grid search operation on the corresponding candidate parameter set, thus obtaining the appropriate initialization parameter values for the CD-OSR.

Note that our CD-OSR does not adopt the threshold-based classification scheme, thus needing not to optimize the threshold. Instead, we only need to obtain the better initialization parameters in the training phase. Besides, these parameters actually do not overly depend on training data due to the properties of the HDP.

Fig. 2: Each known class (here is class 1-4), as a group in CD-OSR, is modeled by a Dirichlet Process while the testing set (including unknown categories or not) as a whole is treated in the same way, then all the groups are co-clustered under the CD-OSR framework. For a testing instance, it would be labeled as the appropriate known class or unknown class, depending on whether the subclass this sample is located associates with the corresponding known class or not. The number in the circle indicates the corresponding subclass.

(2) Testing Phase: Fixing the appropriate initialization parameters achieved in training, we will obtain our CD-OSR recognition framework. Similar to the training phase, we model each known class data in training set as a group of the CD-OSR using a GMM with an unknown number of components, while the whole testing set as one collective/batch222This kind of operation is completely for convenience. In fact, the size of batch does not significantly impact the classification performance, and the subsection 4.2.3 reports this result in detail. is treated in the same way, then co-clustering all the groups under the CD-OSR framework. Furthermore, we will obtain one or more subclasses representing the corresponding class after co-clustering all the groups. Fig. 2 shows the testing phase of the CD-OSR. For a testing instance, it would be labeled as the appropriate known class or unknown classes, depending on whether the subclass this instance is located associates with the corresponding known class or not.

Note that the testing phase is nothing but a co-clustering process, which seems to have the flavor of lazy learning to some extent. Furthermore, the collective/batch operation for the testing set makes our CD-OSR can address the instances in batch, even individual instances. Unlike the existing methods which infer unknown classes depending on the empirically-set threshold, our CD-OSR does not need to define the specific threshold and can provide explicit modeling for unknown classes appearing in testing, thus resulting in the function of new class discovery which will be detailed in the subsection 4.3. And such an ability intuitively makes our CD-OSR have zero open space risk under ideal conditions where all classes including known and unknown classes are mutually exclusive. Moreover, under the CD-OSR framework, each new/unknown class will inherently have only one subclass as we have no available knowledge from unknown classes. In addition, this collective operation also makes our framework consider the correlations among the testing instances obviously ignored by the existing methods.

Besides, the key to accurate prediction of our CD-OSR is the sharing of subclasses between the testing set’s group and the groups of the training set. However, the known classes of the training data may also share the same subclasses between themselves, resulting in an unidentifiable problem. Therefore, we usually set a lager to decrease the degree to which the subclasses are shared between those classes. Intuitively, if all classes (including known and unknown classes) are mutually exclusive, the subclasses associated with the different classes would be different, making the input instance identifiable. Furthermore, we state the following proposition. %ֻ ڵ һ ʼ֮ǰ һ

Proposition 1.

Assume the set of potential classes, known and unknown, are mutually exclusive, and let , , , , and be described as above, and denote the number of subclasses associated with the corresponding known classes. Then our HDP-OSR framework can model the subclasses associated with the corresponding known classes with probability or unknown classes with probability , whilst it would have zero open space risk.

Proof.

This proposition can be obviously obtained from the generative process of HDP. ∎

4 Experimental Evaluation

To verify the effectiveness of our CD-OSR framework, we carry out several experiments on the benchmark datasets commonly used in OSR scenario, including the LETTER [34], USPS [35], and PENDIGITS [36] which can be easily obtained from the LIBSVM machine learning data repository (https://www.csie.ntu.edu.tw/cjlin/libsvmtools/data sets/multiclass.html). As an initial solution towards collective decision for open set recognition, we compare our CD-OSR with the mainstream methods to the OSR problem, including the 1-vs-Set machine, W-OSVM333W-OSVM is the W-SVM model which only uses the one-class SVM CAP model., W-SVM, -SVM and OSNN, where the W-SVM and -SVM are the currently popular algorithms.

Here we mainly focus on the comparisons of the F-measure among these methods mentioned above since it better emphasizes the distinction between correct positive and negative classifications [8]

. The F-measure is defined as a harmonic mean of Precision and Recall

where

and

and TP, FN and FP respectively represent true positive, false negative and false positive of known classes. Note that while the computations of Precision and Recall are only for available known classes, the FN and FP actually also consider the false unknown classes and false known classes by taking the false negative and the false positive into account [11]. Concretely, we use the micro-F-measure [11]

as an evaluation metric and the higher the micro-F-measure, the better the performance of an OSR algorithm.

For comparison, we also give the recognition accuracy for these algorithms. As a common choice for evaluating the performance of decision classifiers, Accuracy is usually defined as follows

where TP, FP, and FN described as above, while TN denotes true negative of known classes. For the recognition accuracy of open set, we consider that a correct response should be either the correct classification (correctly classifying the positive or negative classes) or ”rejection” if the testing instance is from an unknown class. Therefore, the Accuracy for OSR () can be redefined as

(12)

where the and respectively denote the correct and false reject for unknown classes.

In addition, the experimental setup for validating the different methods, including the experimental protocol and the parameter setting, is given in Section 4.1. Section 4.2 presents the main experimental results, while the function of new class discovery is demonstrated in Section 4.3.

4.1 Experimental setup

4.1.1 Experimental protocol

Fig. 3: Data partitioning. The dataset is first divided into training and testing sets, then the training set is further divided into a fitting set and a validation set containing a ’closed set’ simulation and a ’open set’ simulation.

As described in Section 1, the selection of suitable thresholds for the corresponding OSR methods is difficult and risky due to lacking available information from unknown classes. To mitigate this challenge, similar to the OSNN [11], a parameter optimization phase adapted to the OSR scenario is performed to find the better parameters of the corresponding methods, especially for those methods using threshold-based classification scheme, where the thresholds are obtained by a tradeoff on F-measure between the simulations of ’Closed-Set’ and ’Open-Set’ scenarios built in the validation set.

As shown in Fig. 3, the dataset is first divided into training set owning known classes and testing set including known and unknown classes, respectively. Among the classes occurring in training set, half are chosen to act as ”known” classes in the simulation, the other half as ”unknown” in the simulation. For this, the training set is divided into a fitting set just containing the ”known” classes, and a validation set which contains a ’Closed-Set’ simulation only owning the ”known” classes and a ’Open-Set’ simulation including all the classes appearing in the training set. Note that, in the training phase, all the methods are trained with and evaluated on . Furthermore, we give the experimental protocol for all the experiments performed in this paper. For each experiment, we

  1. []

  2. randomly select available classes as known classes for training from the dataset;

  3. randomly choose 60% of the instances in each of the selected classes as training set;

  4. select the remaining 40% of the instances from step 2 and the instances from other classes excluding the classes as testing set;

  5. randomly select classes as ”known” classes for fitting from the training set, while the remaining classes as ”unknown” classes for evaluating;

  6. randomly choose 60% of instances from each ”known” classes of the training set as fitting instances in ;

  7. select the remaining 40% of the instances from step 5 as the ’Closed-Set’ simulation, while the remaining 40% of the instances from step 5 and the ones from other classes excluding the classes in training set as the ’Open-Set’ simulation;

  8. train all the models with and evaluate them on , then find the suitable parameters, especially for those methods using threshold-based classification scheme;

  9. evaluate the performance for all the methods with 10 randomized training and testing sets after the parameters of corresponding models are determined.

Remark: while several different randomness in the experiments, e.g., the Gibbs sampling during the inference process, the random division for the dataset and so on, the experimental results in our paper are from the repetition of multiple evaluations based on the corresponding random division for the dataset.

4.1.2 Parameter setting

In this part, we give the details of parameter setting for all the methods used in this paper. For the 1-vs-Set machine, we use the default setting in the code provided by the authors. For the W-OSVM and W-SVM adopting one-vs-rest approach, we fix the threshold for the one-class SVM CAP model in 0.001 as specified by the authors, while a grid search in is performed for threshold . Similar to the W-SVM, the -SVM also uses the one-vs-rest approach, and a grid search in is performed for threshold . Regarding the related SVM parameters including the W-OSVM, W-SVM and -SVM, we perform grid search for and . Furthermore, the implementation codes including 1-vs-Set machine, W-OSVM, W-SVM and -SVM can be found at https://github.com/ljain2/libsvm-openset. For the OSNN, only the threshold needs to be optimized, and we adopt the same strategy described in [11].

For our CD-OSR, we have two learning phase. In the training phase, our goal is to get the the appropriate initialization parameter. Towards this goal, we model each known class in the fitting set and the validation set

using the Gaussian mixture model, where each component is associated with a Gaussian distribution with the mean vector

and covariance matrix , i.e., . For the base distribution

, we define a conjugate prior, i.e., Gaussian-Wishart distribution

(13)

where the is the prior mean, the is a scaling constant controlling the deviation of the mean vectors of mixture components from the prior mean, the is the prior covariance matrix, and the is the number of degrees of freedom of the distribution. In order to confirm the validity of our learning framework, we do not take an overly complicated means to select the initialization parameters in the CD-OSR. In contrast, we here let the simply equal the mean of all the instances in , the equal 1, and the be selected by performing a grid search from the set . Furthermore, the is set as follows

(14)

where the is a scaling constant and also obtained by performing a grid search from the set , represents the number of known classes in 444There are a total of groups under the CD-OSR framework, where the former groups represent the known classes in and the -th group represents the validation set ., is the total number of the instances in , and the second term on the right side of (14) denotes the common pooled covariance matrix of the known classes [37]. Moreover, for the base distributions and , the concentration parameters are given by the vague gamma priors [38]. Specifically, we set and to ensure enough subclasses used to represent each known class, while reducing the sharing of subclasses between the different known classes. In addition, the constant is set to , while the maximum number of iterations of CD-OSR is set to 30 in this paper.

After the training phase, we will obtain the appropriate initialization parameter values for CD-OSR. Fixing these parameters, we only need to respectively replace fitting set and validation with the training set and testing set, then repeat multiple rounds of the co-clustering process to obtain the final experimental evaluation.

4.2 Performance Evaluation

In this section, we mainly show the comparisons of the average F-measure for our CD-OSR with the 1-vs-Set, W-OSVM, W-SVM, -SVM, and OSNN in Section 4.2.1, while the comparisons on recognition accuracy, as a supplement, are reported in Section 4.2.2. It should be noted that all algorithms are evaluated according to the experimental protocol described in Section 4.1.1. Besides, we also show the impact of the size of the testing data batch on performance in Section 4.2.3.

4.2.1 Comparisons on F-measure

In this part, we show the comparisons of the performance on F-measure for the corresponding methods, where the specific experimental results are reported as follows.

LETTER Dataset: The LETTER dataset has a total of 20000 instances from 26 classes, where every instance owns 16 features. To recast Letter dataset as a dataset for open set problem, we randomly select 10 available classes as known classes for training, and vary openness by adding a subset of the remaining 16 classes. Fig. 4 shows the average F-measure results on this dataset. With the openness less than about 12%, the performance of our CD-OSR is comparable to the W-SVM and -SVM, however, it is significantly higher than the other five methods used in this paper when the openness is larger than about 12%. Furthermore, the changing trend of F-measure in the CD-OSR is also relatively stable when varying the openness.

Fig. 4:

F-measure for multi-class open set recognition on LETTER dataset. Error bars reflect the standard deviation.

Fig. 5: F-measure for multi-class open set recognition on USPS dataset, and the performance of W-OSVM is not shown due to its poor F-measure. Error bars reflect the standard deviation.
Fig. 6: F-measure for multi-class open set recognition on PENDIGITS dataset. Error bars reflect the standard deviation.

USPS Dataset

: The USPS dataset has a total of 7291 instances from 10 classes, where every instance owns 256 features. In this paper, principal component analysis (PCA) is used to project instance space into 39 dimensional subspace, retaining 95% of the instances’ information. Similar to the operation of LETTER dataset, we randomly select 5 available classes as known classes for training, and vary openness by adding a subset of the remaining 5 classes. Fig. 5 shows the average F-measure results on this dataset. As can be seen from Fig. 5, our CD-OSR obtains much higher performance than the 1-vs-Set, W-SVM and

-SVM with increasing the openness. Although the OSNN obtains the higher performance than the CD-OSR when the openness is larger than about 12%, its performance is much lower than our method when openness is less than 12%, especially when the openness equals zero. Furthermore, compared to the other methods, the changing trend of F-measure in OSNN is the most stable, followed by our CD-OSR, W-SVM and -SVM. Note that, the performance of W-OSVM is not shown in Fig. 5, due to its poor F-measure.

PENDIGITS Dataset: The PENDIGITS dataset has a total of 10992 instances from 10 classes, where every instance owns 16 features. Similar to the operation of USPS dataset, we randomly select 5 available classes as known classes for training, and vary openness by adding a subset of the remaining 5 classes. Fig. 6 shows the average F-measure results on this dataset. As can be seen from Fig. 6, our CD-OSR obtains much higher performance than the other methods as the openness increases. Simultaneously, the performance of our CD-OSR is almost unchanged when varying the openness.

Remark: From the experimental results reported above, we can find that the classification performance of the CD-OSR is significantly improved compared to other existing OSR algorithms. However, what we still want to emphasize is that the CD-OSR currently does not make full use of the information from the known class labels. More precisely, it just uses this kind of information to divide the training data into different groups, while the discriminative information from these labels, actually, is not fully utilized. Nevertheless, the CD-OSR still achieves at least comparable classification performance than other existing OSR methods making full use of label information like W-SVM, -SVM, and so forth.

4.2.2 Comparisons on Recognition Accuracy

In this part, we simply report the comparisons of recognition accuracy for these algorithms used in our paper. As can be seen from Fig. 7 and Fig. 9, our CD-OSR obtains much higher performance on recognition accuracy than the other five methods as the openness increases, while the changing trend on recognition accuracy of CD-OSR is also relatively stable when varying the openness, especially for PENDIGITS dataset.

Fig. 7: Recognition accuracy for multi-class open set recognition on LETTER dataset. Error bars reflect the standard deviation.
Fig. 8: Recognition accuracy for multi-class open set recognition on USPS dataset, and the performance of W-OSVM is not shown due to its poor recognition accuracy. Error bars reflect the standard deviation.
Fig. 9: Recognition accuracy for multi-class open set recognition on PENDIGITS dataset. Error bars reflect the standard deviation.

For the USPS dataset, the OSNN has the best performance on the recognition accuracy when the openness larger than about 6%, followed by our CD-OSR as shown in Fig. 8. However, it also obtains worse performance compared to our CD-OSR when the openness is less than 6%, especially when the openness equals zero. In addition, the performance of OSNN on LETTER dataset is relatively poor compared to our CD-OSR, W-SVM and -SVM, which means its performance may be heavily dependent on the corresponding dataset.

Remark: Note that the results of the reported here is just as a supplement to further illustrate the effectiveness of our learning framework. Actually, the is not a commonly used evaluation metric for OSR problem, since it denotes the sum of the performance of the correct classification for known classes and the rejection for unknown classes, thus making it difficult to evaluate the OSR models objectively. For example, when the reject performance plays the leading role, i.e., the testing set contains large number of instances of unknown classes while only a few instances for the known classes, the can still achieve a high value, even though the fact is that the recognizer’s classification performance is really low, and vice versa. In addition, as the OSR problem faces a new scenario, the evaluation metrics specifically customized for OSR are also worth exploring.

4.2.3 The Impact of the Size of the Batch on Performance

Since our CD-OSR adopts the collective/batch decision strategy, meaning it can address the data in batch. Then a natural problem is that whether the size of batch for testing data has an impact on the performance of CD-OSR. To explore this problem, we conduct the following experiments.

For each dataset in our experiments, we choose a medium openness: 18.35% for LETTER dataset (10 unknown classes), 12.29% for USPS dataset (3 unknown classes), and 12.29% for PENDIGITS (3 unknown classes), then vary the size of the batch by changing the number of testing instances. Specifically, we randomly select 20%, 40%, 60%, 80%, 100% of the whole testing set for each dataset, and then repeat 10 times of the co-clustering process to obtain the final experimental evaluation. Fig. 10–Fig. 12 show the performance of F-measure on these datasets, and the (a) in these figures denotes the boxplot graph for the different number of testing instances, while (b) represents the corresponding errorbar graph, where error bars reflect the standard deviation. From these experimental results, we can find that the batch size of the testing instances has almost no significant impact on the performance of the CD-OSR. Therefore, we can flexibly set the batch size according to the needs of the tasks.

(a)
(b)
Fig. 10: The F-measure on LETTER dataset when openness = 18.35%. (a) denotes the boxplot graph for the different number of testing instances, while (b) represents the corresponding errorbar graph, where error bars reflect the standard deviation.
(a)
(b)
Fig. 11: The F-measure on USPS dataset when openness = 12.29%. (a) denotes the boxplot graph for the different number of testing instances, while (b) represents the corresponding errorbar graph, where error bars reflect the standard deviation.
(a)
(b)
Fig. 12: The F-measure on PENDIGITS dataset when openness = 12.29%. (a) denotes the boxplot graph for the different number of testing instances, while (b) represents the corresponding errorbar graph, where error bars reflect the standard deviation.

4.3 New class discovery

In this section, we show the function of new class discovery under our CD-OSR framework. Unlike the existing methods infer the unknown classes depending on accurately modeling for the known classes, the CD-OSR is able to provide explicit modeling for unknown classes appearing in testing, thus it can discover new classes. As mentioned above, each new class will inherently have only one subclass due to the fact that the true labels of unknown classes are unknown, making it impossible to further aggregate the newly generated subclasses. Therefore, these newly discovered classes are just at subclass level. Fortunately, we can still roughly estimate the number of real unknown classes based on the number of subclasses of known classes, which can be used as a prior for the other clustering algorithms such as K-means to further discover the real classes among the unknown subclasses. Concretely, we have

(15)

where denotes the number of subclasses corresponding to unknown classes, denotes the number of subclasses of known classes, and here represents the number of known classes. Note that this is just a relatively rough estimate. Actually, a more realistic operation is that we can construct a candidate set according to this estimated number values, for other clustering algorithms to further determine an more accurate estimate.

Furthermore, Table 1 and 2 report the function of new class discovery for USPS and PENDIGITS datasets under CD-OSR framework, respectively. Each table has three columns, where the first column denotes the corresponding group data (known classes and testing set), the second one indicates the number of the subclasses of the corresponding group, and the third one represents the proportions of the corresponding subclasses in their group.

  Group Subclass Proportion of the corresponding subclass % Class1 (’2’) 1 98.67 Class2 (’9’) 4 21.54 8.31 67.08 2.15 Class3 (’1’) 4 7.54 37.01 3.21 51.12 Class4 (’6’) 3 26.05 50.30 20.96 Class5 (’3’) 7 22.55 9.79 18.00 7.74 19.59 2.73 18.00 Testing-Set 33 Known subclasses (: 19) New subclasses (: 14) 55.23 44.77  

There are five classes in training set while the testing set has all the classes (5 known classes and 5 unknown classes). The table gives the estimates of mixture proportions and number of subclasses in each group under CD-OSR framework.

TABLE I: NEW CLASS DISCOVERY ON USPS DATASET

For the USPS dataset shown in Table 1, we randomly select 5 classes (the real classes in brackets) as the known classes for training, while the testing set has all the classes (5 known classes and 5 unknown classes). According to (15), we can obtain the rough estimate

(16)

where the estimated number of unknown classes approaches the true number of unknown classes, and we may obtain more accurate estimate if the number of subclasses for the corresponding classes are relatively uniform.

Similar to the operation of USPS dataset, we also randomly choose 5 classes as the known classes for training, while testing set owns all the classes (5 known classes and 5 unknown classes), and Table 2 reports the specific results, where we can obtain the similar conclusion described above. Moreover, we can discover the internal distribution corresponding to each known class at the subclass level, which can be seen as a by-product of our approach. For example, the distribution of instances corresponding to class 1 (’2’) is very concentrated, where almost all the instances are clustered in one subclass , while the instances’ distribution of class 5 (’3’) is relatively scattered, where most of the instances are scattered in 7 subclasses, as shown in Table 1.

  Group Subclass Proportion of the corresponding subclass % Class1 (’4’) 7 5.25 5.39 52.77 12.68 19.97 1.02 2.48 Class2 (’2’) 5 5.25 69.83 3.94 5.98 14.29 Class3 (’1’) 11 25.22 2.92 1.60 5.54 9.77 4.08 13.85 33.38 1.17 1.17 1.02 Class4 (’9’) 15 2.53 7.74 6.00 2.05 13.11 18.96 7.90 2.69 8.06 7.27 5.21 8.06 1.74 4.90 1.42 Class5 (’6’) 5 43.85 35.96 7.57 11.04 1.42 Testing-Set 75 Known subclasses (: 43) New subclasses (: 32) 50.42 49.58  

There are five classes in training set while the testing set has all the classes (5 known classes and 5 unknown classes). The table gives the estimates of mixture proportions and number of subclasses in each group under CD-OSR framework.

TABLE II: NEW CLASS DISCOVERY ON PENDIGITS DATASET

5 Conclusion

The main contribution of this paper is to present a collective/batch decision strategy for open set recognition with an aim to extend existing open set recognition for new class discovery while considering correlations among the testing instances. To achieve this goal, we adapt the HDP with slight modification to addressing the OSR problem, leading an initial solution towards collective decision in OSR. What needs to be highlighted is that our CD-OSR does not overly depend on the training data and can achieve adaptive change as the data changes. More precisely, CD-OSR can provide explicit modeling for unknown classes appearing in testing, naturally resulting in the function of new class discovery, even though they are just at the subclass level. Furthermore, unlike the existing methods dealing with the OSR problem from the discriminative model perspective, the CD-OSR actually addresses this problem from the generative model perspective due to the use of HDP. Finally, the experimental results on a set of benchmark datasets indicate the validity of our learning framework.

Besides, it should be noted that modeling unknown classes only performs in the testing phase of our CD-OSR, whilst no available knowledge from unknown classes is utilized during the training phase, which seems to have the flavor of lazy learning to some extent. Thus the co-clustering process (testing process) will be repeated when other batch testing data arrives, resulting in higher computational overhead. Therefore, overcoming this limitation will be a promising research direction in the future. Furthermore, since the CD-OSR currently does not make full use of the discriminative information from the known class labels, embedding this kind of information more effectively will be also worth further exploring. In addition, replacing the Gibbs sampler with scalable deterministic inference techniques is a promising direction as well in the future work. In conclusion, the CD-OSR is just as a conceptual proof for open set recognition towards collective decision at present. Therefore, the more effective collective decision methods for OSR are worth further exploring in the future work.

Acknowledgments

The authors would like to thank the support from the Key Program of NSFC under Grant No. 61732006, NSFC under Grant No. 61672281, and the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant No. KYCX18_0306.

References

  • [1] W. J. Scheirer, A. D. R. Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1757–1772, 2013.
  • [2] P. J. Phillips, P. Grother, and R. Micheals, Evaluation Methods in Face Recognition. Springer New York, 2005.
  • [3] F. Li and H. Wechsler, “Open set face recognition using transduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1686–1697, 2005.
  • [4] Q. Wu, C. Jia, and W. Chen, “A novel classification-rejection sphere svms for multi-class classification problems,” in Natural Computation, 2007. ICNC 2007. Third International Conference on, vol. 1, pp. 34–38, IEEE, 2007.
  • [5] Y.-C. F. Wang and D. Casasent, “A support vector hierarchical method for multi-class classification and rejection,” in Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pp. 3281–3288, IEEE, 2009.
  • [6] B. Heflin, W. Scheirer, and T. E. Boult, “Detecting and classifying scars, marks, and tattoos found in the wild,” in IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems, pp. 31–38, 2012.
  • [7] D. A. Pritsos and E. Stamatatos, “Open-set classification for automated genre identification,” in European Conference on Advances in Information Retrieval, pp. 207–217, 2013.
  • [8] W. Scheirer, L. Jain, and T. Boult, “Probability models for open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2317–2324, 2014.
  • [9] L. P. Jain, W. J. Scheirer, and T. E. Boult, “Multi-class open set recognition using probability of inclusion,” in

    European Conference on Computer Vision

    , pp. 393–409, Springer, 2014.
  • [10] M. D. Scherreik and B. D. Rigling, “Open set recognition for automatic target classification with rejection,” IEEE Transactions on Aerospace and Electronic Systems, vol. 52, no. 2, pp. 632–642, 2016.
  • [11] P. R. M. Júnior, R. M. de Souza, R. d. O. Werneck, B. V. Stein, D. V. Pazinato, W. R. de Almeida, O. A. Penatti, R. d. S. Torres, and A. Rocha, “Nearest neighbors distance ratio open-set classifier,” Machine Learning, vol. 106, no. 3, pp. 359–386, 2017.
  • [12] H. Zhang and V. M. Patel, “Sparse representation-based open set recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 8, pp. 1690–1696, 2017.
  • [13] E. M. Rudd, L. P. Jain, W. J. Scheirer, and T. E. Boult, “The extreme value machine,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 762–768, 2018.
  • [14] E. Vignotto and S. Engelke, “Extreme value theory for open set classification-gpd and gev classifiers,” arXiv preprint arXiv:1808.09902, 2018.
  • [15] A. Bendale and T. E. Boult, “Towards open set deep networks,”

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 1563–1572, 2016.
  • [16] A. Rozsa, M. Günther, and T. E. Boult, “Adversarial robustness: Softmax versus openmax,” arXiv preprint arXiv:1708.01697, 2017.
  • [17] M. Hassen and P. K. Chan, “Learning a neural-network-based representation for open set recognition,” arXiv preprint arXiv:1802.04365, 2018.
  • [18] L. Shu, H. Xu, and B. Liu, “Doc: Deep open classification of text documents,” arXiv preprint arXiv:1709.08716, 2017.
  • [19] D. O. Cardoso, F. Franca, and J. Gama, “A bounded neural network for open set recognition,” in International Joint Conference on Neural Networks, pp. 1–7, 2015.
  • [20] D. O. Cardoso, J. Gama, and F. M. G. França, “Weightless neural networks for open set recognition,” Machine Learning, no. 106(9-10), pp. 1547–1567, 2017.
  • [21] L. Shu, H. Xu, and B. Liu, “Unseen class discovery in open-world classification,” arXiv preprint arXiv:1801.05609, 2018.
  • [22] Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi, “Generative openmax for multi-class open set classification,” arXiv preprint arXiv:1707.07418, 2017.
  • [23] H. Xu, B. Liu, and P. S. Yu, “Learning to accept new classes without training,” arXiv preprint arXiv:1809.06004, 2018.
  • [24] L. Wang and S. Chen, “Joint representation classification for collective face recognition,” Pattern Recognition, vol. 63, no. 5, pp. 182–192, 2017.
  • [25] A. Bendale and T. Boult, “Towards open world recognition,” in Computer Vision and Pattern Recognition, pp. 1893–1902, 2015.
  • [26] S. J. Gershman and D. M. Blei, “A tutorial on bayesian nonparametric models,” Journal of Mathematical Psychology, vol. 56, no. 1, pp. 1–12, 2012.
  • [27] R. Thibaux and M. I. Jordan, “Hierarchical beta processes and the indian buffet process,” in Artificial Intelligence and Statistics, pp. 564–571, 2007.
  • [28] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Publications of the American Statistical Association, vol. 101, no. 476, pp. 1566–1581, 2006.
  • [29] Y. W. Teh, “Dirichlet process,” in Encyclopedia of machine learning, pp. 280–287, Springer, 2011.
  • [30] S. J. Gershman and D. M. Blei, “A tutorial on bayesian nonparametric models,” Journal of Mathematical Psychology, vol. 56, no. 1, pp. 1–12, 2012.
  • [31]

    K. R. Canini, M. M. Shashkov, and T. L. Griffiths, “Modeling transfer learning in human categorization with the hierarchical dirichlet process,” in

    International Conference on Machine Learning, pp. 151–158, 2010.
  • [32]

    F. Akova, M. Dundar, Y. Qi, and B. Rajwa, “Self-adjusting models for semi-supervised learning in partially observed settings,” in

    Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 21–30, IEEE, 2012.
  • [33] H. Ishwaran and L. F. James, “Gibbs sampling methods for stick-breaking priors,” Journal of the American Statistical Association, vol. 96, no. 453, pp. 161–173, 2001.
  • [34] P. W. Frey and D. J. Slate, “Letter recognition using holland-style adaptive classifiers,” Machine learning, vol. 6, no. 2, pp. 161–182, 1991.
  • [35] J. J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on pattern analysis and machine intelligence, vol. 16, no. 5, pp. 550–554, 1994.
  • [36] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proceedings of the twenty-first international conference on Machine learning, p. 11, ACM, 2004.
  • [37] T. Greene and W. S. Rayens, “Partially pooled covariance matrix estimation in discriminant analysis,” Communications in Statistics-Theory and Methods, vol. 18, no. 10, pp. 3679–3702, 1989.
  • [38] M. Escobar and MikeWest, “Bayesian density estimation and inference using mixtures,” Publications of the American Statistical Association, vol. 90, no. 430, pp. 577–588, 1995.