Semi-supervised Wrapper Feature Selection with Imperfect Labels

11/12/2019 ∙ by Vasilii Feofanov, et al. ∙ Université Grenoble Alpes 0

In this paper, we propose a new wrapper approach for semi-supervised feature selection. A common strategy in semi-supervised learning is to augment the training set by pseudo-labeled unlabeled examples. However, the pseudo-labeling procedure is prone to error and has a high risk of disrupting the learning algorithm with additional noisy labeled training data. To overcome this, we propose to model explicitly the mislabeling error during the learning phase with the overall aim of selecting the most relevant feature characteristics. We derive a C-bound for Bayes classifiers trained over partially labeled training sets by taking into account the mislabeling errors. The risk bound is then considered as an objective function that is minimized over the space of possible feature subsets using a genetic algorithm. In order to produce both sparse and accurate solution, we propose a modification of a genetic algorithm with the crossover based on feature weights and recursive elimination of irrelevant features. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised feature selection approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider learning problems where the labeled training set comes along with a huge number of unlabeled training examples and where the dimension of the characteristic space is large. In this case, traditional learning approaches usually suffer from excessive computation time, poor learning performance and a lack of interpretation. In practice, the original set of features may contain irrelevant or redundant characteristics to the output and their removal may yield to better prediction performance and provides important keys for the interpretability of results (Guyon and Elisseeff, 2003; Chandrashekar and Sahin, 2014). Depending on the availability of class labels, the feature selection techniques can be supervised, unsupervised or semi-supervised. Being agnostic to the target variable, unsupervised approaches generally ignore the discriminative power of features, so their use may lead to poor performance. In contrast, supervised feature selection algorithms benefit from abundant labeled examples, so they effectively eliminate irrelevant and redundant variables. In semi-supervised feature selection (Sheikhpour et al., 2017), the aim is to exploit both available labeled and unlabeled training observations in order to provide a solution that preserves important structures of data and leads to high performance. Considerable progress has been made in this direction over the last few years. Filter methods (Yang et al., 2010; Zhao et al., 2008) score features following a criterion and perform selection before a learning model is constructed. Embedded techniques (Chen et al., 2017) perform model-based feature selection in order to infer the importance of features during the training process.

Finally, Wrapper approaches (Kohavi and John, 1997; Ren et al., 2008) use a learner to effectively find a subset of features that are discriminatevely powerful together. The underlying principle of these approaches is to search a feature subset by optimizing the prediction performance of a given learning model. In semi-supervised learning, one way to improve prediction performance and to increase data diversity, is to augment the training labeled set by pseudo-labeled unlabeled examples using either self-learning or co-training approaches (Amini et al., 2009; Sheikhpour et al., 2017; Feofanov et al., 2019). In this case, pseudo-labels are iteratively assigned to unlabeled examples with confidence score above a certain threshold. However, fixing this threshold is in some extent the bottleneck of these approaches. In addition, the pseudo-labels may be prone to error, so the wrapper will be learnt on biased and mislabeled training data.

Another related question would be ”how to optimize the wrapper”? Due to an exponential number of possible subsets, using exhaustive search for its optimization would be computationally infeasible. In addition, the use of sequential search algorithms like the one proposed by Ren et al. (2008)

, in the case of very large dimension becomes also infeasible. To overcome this problem, a common practice is to use heuristic search algorithms, for instance, a genetic algorithm

(Goldberg and Holland, 1988)

. However, for applications of large dimension, this approach may have large variance in output, and the set of selected features might be still large.

In this paper, we propose a new framework for semi-supervised wrapper feature selection with an explicit modeling of mislabel probabilities. To perform consistent self-learning, we use a recent work of

Feofanov et al. (2019) to find the threshold dynamically based on the transductive guarantees of an ensemble Bayes classifier. To eliminate the bias of the pseudo-labeling, we derive a new upper bound of the Bayes classifier that is computed on imperfect labels. This bound is based on the -bound (Lacasse et al., 2007) of the Bayes risk that is derived by considering the mean and the variance of the prediction margin. To extend this bound, we consider a mislabeling model proposed by Chittineni. Finally, we propose a modification of a genetic algorithm such that it takes into account feature weights during the optimization phase and provides a sparse solution of the feature selection task.

In the following Section we introduce the problem statement. Section 3 provides background information related to this work. Section 4 shows how to derive the -bound in the probabilistic framework. In Section 5 we show how to take into account mislabeling for the bound. In Section 6 we describe our algorithm to select features on semi-supervised data. Section 7 presents the experimental results. Finally, the conclusion is given in Section 8.

2 Framework

We consider multiclass classification problems with an input space and an output space , . We denote by (resp.

) an input (resp. output) random variable and

the input projection on a subset of features . We assume available a set of labeled training examples

, identically and independently distributed (i.i.d.) with respect to a fixed yet unknown probability distribution

over , and a set of unlabeled training examples supposed to be drawn i.i.d. from the marginal distribution , over the domain .

Following Koller and Sahami (1996), we call as a Markov blanket for the output if . Thus, we formulate the goal of semi-supervised feature selection as to find a minimal Markov blanket among all possible feature subsets based on the available labeled and unlabeled data.

A solution that satisfies the Markov blanket condition does not include irrelevant variables since they are independent from . In addition, when a Markov blanket is minimal, we exclude maximum possible number of features that are conditionally independent from given , i.e. redundant variables.

In this work, a fixed class of classifiers , called the hypothesis space, is considered and defined without reference to the training set. Further, we focus on the Bayes classifier (also called the majority vote classifier) defined for all as

(1)

We formulate the task of the learner as to choose a posterior distribution over observing the training set such that the the true risk of the classifier is minimized:

Given an observation , its margin is defined in the following way:

where is the vote given by the Bayes classifier to the class membership of an example being . The margin measures the confidence of the prediction: if it is strictly positive for a example , then the example is correctly classified.

3 Related Work

In this Section we introduce the -Bound proposed for the supervised case and describe briefly the self-learning algorithm.

3.1 -Bound

Lacasse et al. (2007) proposed to upper bound the Bayes classifier by taking into account the mean and the variance of the prediction margin. The multi-class version of this bound is given in the following theorem.

[Theorem 3 in Laviolette et al. (2014)] Let and

be respectively the first and the second statistical moments of the margin:

Then, for all choice of on a hypothesis space , and for every distribution over , such that , we have:

(2)

3.2 Self-learning Based on the Transductive Bound

The idea of a self-learning algorithm (Vittaut et al., 2002; Tür et al., 2005) is to iteratively assign pseudo-labels to a subset of training unlabeled examples that have their associated class vote above a threshold. By extending the work of (Amini et al., 2009) to the multi-class classification case; Feofanov et al. (2019) proposed to obtain the threshold dynamically by minimizing the conditional Bayes error , defined by:

(3)

Where, is the risk of the Bayes classifier over examples having a margin greater or equal than . The conditional Bayes error represents the proportion of this risk over unlabeled examples having margin above . An unlabeled example is going to be pseudo-labeled, if its prediction vote is higher than a corresponding threshold.

Feofanov et al. (2019) showed empirically that this approach outperforms significantly the classic self-learning algorithm, when the threshold is manually fixed. In reality, by fixing the threshold, the algorithm may inject much error, therefore, in most cases the classic self-learning algorithm performs worse than the corresponding supervised algorithm.

4 Probabilistic -Bound

In our approach, we define the risk by explicitly taking into account the posterior probability

as follows:

where denotes the Bayes risk in classifying an observation .

Let be a random variable such that

is a discrete random variable that is equal to the margin

with probability , . Then, is defined by the following probability law:

(4)

The random variable as defined above is connected to the Bayes risk in the following way:

Proof.

One can notice that :

Applying the total probability law, we obtain:

Let and be respectively the first and the second statistical moments of the random variable defined by the law (4):

Then, for all choice of on a hypothesis space , and for all distributions over and over , such that , we have:

(5)
Proof.

To prove a theorem, we apply the Cantelli-Chebyshev inequality formulated in the following lemma: Let be a random variable with the mean and the variance . Then, for every , we have:

By taking , and , we apply Lemma 4 and deduce:

From lemma 4 we obtain that , which ends the proof. ∎

5 Learning with Imperfect Labels

In this section, we show how to evaluate the Bayes risk in the case where we have an imperfect output with a different distribution than the true output .

Further, we assume that the Bayes classifier is optimal in terms of risk minimization, i.e. it is equivalent to the maximum a posteriori rule:

Then, the Bayes risk for and can be written as:

Let the label imperfection be described by the following probability law:

It is assumed that the true class conditional distribution remains the same when the imperfect label is given, i.e. . [Section 3.2.1.  in Chittineni (1980)] For all distributions , , , we have:

(6)
(7)
Proof.

Let be the output of the Bayes classifier learnt with the true labels of all training samples.

Consider :

By taking the expectation with respect to , we derive the inequality. ∎

Note that in some semi-supervised learning approaches, this model was employed to correct the mislabeling errors induced by the learner assigning pseudo-labels to unlabeled training examples (Krithara et al., 2008). From this result, we can obtain a -bound in the presence of mislabeled examples. For this, we introduce a random variable defined in the similar way as replacing by . Let and be respectively the first and the second statistical moments of the random variable defined by the law (4):

Then, for all distributions over and , over , such that , we have:

(8)
Proof.

The inequality is directly obtained from Theorem 4 and Theorem 5. ∎

Consequently, given an imperfect label, we can evaluate -bound in this ”noisy” case; then, using the term , we perform a correction of the bound to get the true -bound. One can notice that when , there is no mislabeling, so the regular -bound is obtained.

6 Wrapper with Imperfect Labels

In the next Section, we present a new framework for wrapper feature selection using both labeled and unlabeled data based on the probabilistic framework with mislabeling errors presented above.

6.1 Framework

The algorithm starts from a supervised Bayes classifier initially trained on available labeled examples. Then, it iteratively retrains the classifier by assigning pseudo-labels at each iteration to unlabeled examples that have prediction vote above a certain threshold found by minimizing Equation (3) (Feofanov et al., 2019, Algorithm 1).

As a result, we obtain a new augmented training set that increases the diversity of training examples. However, the votes of the classifier are possibly biased and the pseudo-labeled examples contain mislabeling errors.

In this sense, we propose a wrapper strategy that performs feature selection by modelling these mislabeling errors. In other words, we search a feature subset in the space of possible subsets that minimizes the -bound with imperfect labels (Corollary 5

). The bound is estimated using the augmented training set. In order to solve the optimization problem, we perform a heuristic search using a genetic algorithm.

6.2 Genetic Algorithm and Its Limitations

A genetic algorithm (Goldberg and Holland, 1988) is an evolutionary optimization algorithm inspired by the natural selection process. A fitness function is optimized by evolving iteratively a population of candidates (in our case, binary representation of possible feature subsets).

Starting from a randomly drawn population, the algorithm produces iteratively new populations, called generations, by preserving parents, candidates with best fitness, and creating offspring from parents using operation of crossover and mutation (Figure 1). After a predefined number of generations the algorithm is stopped, and a candidate with the best fitness in the last population is returned. Further, we call this algorithm as the classic genetic algorithm (denotes as CGA in the following).

(0, 0, 1, 0, 1, 1, 0, 1)

(1, 0, 1, 0, 1, 1, 0, 1)

(1, 1, 1, 0, 0, 1, 0, 0)

Parent 1:

Child:

Parent 2:

1
Figure 1: A simple scheme of how new a child is generated from two parents. The crossover procedure (red and blue colors) is followed by mutation (green color).

The CGA can be very effective for the wrapper feature selection in the case where the number of features is very large. However, there might be several limitations. Firstly, the algorithm may have a large variance in giving results depending on the initialization of population. Therefore, it needs usually a large number of generations to have a stable output.

Another problem would be that during the crossover a child inherits features from the parents at random; ignoring any information like feature importance. Because of that, a solution the algorithm outputs is generally not sparse as it could be. To produce a sparse solution, it is usually spread to limit the search space by fixing the number of selected features (Persello and Bruzzone, 2016). However, it is not very clear which number of features should be taken.

6.3 Feature Selection Genetic Algorithm

In this Section, we describe the new method, Feature Selection Genetic Algorithm (FSGA). The main idea of the algorithm is to take into account the importance of features during the generation of new population. This strategy allows to output a sparse solution preserving discriminative power and not fixing the number of features. The figure 2 illustrates a flowchart of how the algorithm works.

initialize population

compute fitness

compute feature weights

test feature relevance

last generation?

done

select parents

perform crossover

perform mutation

yes

no

new population
Figure 2: The diagram of main steps in the Feature Selection Genetic Algorithm (FSGA).

Below, we describe in detail the different steps of the algorithm.

  • Initialization: We initialize the population by randomly generating feature subsets of a fixed length. In our experiments, this length is equal to . Each candidate is a feature subset, which is different from CGA

    , where the subset is binarized.

  • Fitness Computation: For each candidate , we train a supervised model and compute the score reflecting the strength of the subset.

  • Feature Weights Computation: For each candidate , we obtain weights

    from the learning model. For this, ensemble methods based on decision trees can be used.

  • Feature Relevance Test:

    To accelerate convergence and reduce variance of the algorithm, we embed a test to eliminate irrelevant to response variables. We are inspired by the work of

    Tuv et al. (2009), where variables are compared with their copies with randomly permuted values. For each feature, we compute the average weights:

    where is the population of generation . We find features that have average weights less than a fixed threshold : . The found features as well as their copies with randomly permuted values are included to the best subset in the population according to the fitness score. A new supervised model, which gives feature weights, is learnt on these features. If the difference between the weight of a feature (that belongs to ) and the weight of its noisy counterpart is not significant, the feature is removed and will not be further considered by the algorithm.

  • Parent Selection: Among the population , candidates with best fitness are selected, preserved for the next population and used to produce new offspring.

  • Crossover: A new child is generated by mating two parents. We take randomly the crossover point that characterizes the proportion of features inherited from the 1st parent. The rest of the features are inherited from the second parent. In contrast to the CGA, we inherit variables according to their weights. For each parent, its features are sorted by their weights in the decreasing order. We fill the child by the features of the first parent in the specified order until we reach the quota. The rest of features are taken from the second parent under a condition that there are no repetitions.

  • Mutation: To increase the diversity of candidates, we perform mutation of children in the same way as in CGA. In addition, we define a possibility to mutate the number of features in the subset. For each child, its length can be randomly increased, decreased or remain the same.

Figure 3: Result of feature selection on the synthetic data set. The features are sorted in the following order: 8 informative features, 6 redundant features, 6 irrelevant ones. On the graph, each cell represents the number of times when a feature was chosen by a feature selection method divided by the number of experiments (20).

7 Experimental Results

We conducted a number of experiments aimed at evaluating how the consideration of the mislabeling errors of pseudo-labeled unlabeled examples can help to learn an efficient wrapper feature selection model.

7.1 Framework

To this end, we compared the proposed approach with state-of-the-art models on a series of numerical experiments described below.

We consider the Random Forest algorithm

(Breiman, 2001), denotes as RF, with 200 trees and the maximal depth of trees as the Bayes classifier with the uniform posterior distribution. For an observation

, we evaluate the vector of class votes

by averaging over the trees the vote given to each class by the tree. A tree computes a class vote as the fraction of training examples in a leaf belonging to a class.

To evaluate the -bound, we approximate probabilities by the class vote reflecting the confidence in predicting . To estimate , we use the 5-fold cross-validation on the training labeled set by comparing true labels with predicted ones over different validation sets.

To evaluate the quality of selection, we use the self-learning algorithm (SLA) with automatic thresholding (Feofanov et al., 2019). In other words, at first, we find a feature subset using a feature selection method, then we train SLA on the selected features and compute its performance. In our experiments we compare the following methods:

  • a baseline, which is a fully supervised Random Forest (RF) trained using only labeled examples and on complete subset of features;

  • an embedded selection by rescaled linear square regression (RLSR) proposed by Chen et al. (2017);

  • a feature ranking by Semi_Fisher score (SFS) proposed by Yang et al. (2010);

  • a feature ranking by semi-supervised Laplacian score (SSLS) proposed by Zhao et al. (2008);

  • the approach proposed in this paper: a wrapper based on the -bound with imperfect labels (WIL); the bound is computed using training labeled examples and training unlabeled examples pseudo-labeled by SLA; the approach is optimized by the classic genetic algorithm (WIL-CGA) and the feature selection algorithm (WIL-FSGA).

In addition, we averaged over 20 random (train/unlabeled/test) sets of the initial collection and report the average classification accuracy over the 20 trials on the unlabeled training set as well as on the test set.

The hyperparameters of all methods are set to their default values as there are not enough labeled training samples to tune them correctly. Specifically,

for RLSR is set to 0.1; the umber of nearest neighbours is set to 20 for SSLS and SFS. For the genetic algorithms, the number of generations is set to 20, the population size is 40 and the number of parents is set to 8.

7.2 Experiments on Synthetic Data

We first test the algorithms on synthetic data that were generated using the Pedregosa et al.’s implementation of the algorithm that created the Madelon data set (Guyon, 2003). The size of training labeled, training unlabeled and test sets are respectively 100, 900, and 100. We fixed the number of classes to 3; the number of features to 20 wherein 8 features are informative, 6 are linear combinations of the latter, and 6 features are noise.

For RLSR, SFS and SSLS, the number of selected features is set to 8. The feature selection results averaged over 20 different splits. Figure 3 illustrates which features were selected by each method. Below, we report the results of feature selection with respect to the elimination of irrelevant and redundant variables.

  • Elimination of irrelevant variables: From Figure 3 it can be seen that the filter methods, SFS and SSLS, eliminated effectively the noise (variables 15-20), whereas RLSR coped with this task worst of all. Due to the fact that in CGA a new child inherits features from its parents purely at random, the WIL-CGA does not discard the irrelevant variables completely and probably needs more generations to cope with that. Thanks to the feature relevance test, WIL-FSGA is as effective as the filter methods.

  • Elimination of redundant variables: Since the redundant variables (9-14) are linear combinations of the informative ones (1-8), they are useful individually for classification as well. Because of that, the filter methods, SFS and SSLS, tend to underselect some informative features (e.g. 2,3,8). This might be caused by the individual weakness of these variables compared to the ”strong” variables (e.g. 4,5,6) and their redundant counterparts.

    In contrast to the filters, the wrapper approaches, WIL-CGA and WIL-FSGA, search features that will be jointly strong, therefore, they are less prone to the non-selection of the informative variables. Finally, it can be clearly seen that RLSR is affected by noise, so it underselects both the informative and the redundant variables.

Data set # of lab. examples, # of unlab. examples, # of test examples, Dimension, # of classes,
Protein 97 875 108 77 8
Isolet 140 1264 156 617 26
Fashion 99 9801 100 784 10
MNIST 99 9801 100 784 10
Coil20 130 1166 144 1024 20
PCMAC 175 1574 194 3289 2
Gisette 69 6861 70 5000 2
Table 1: Characteristics of data sets used in our experiments ordered by dimension .
Data set Score RF RLSR SFS SSLS WIL
CGA FSGA
ACC ACC ACC ACC ACC ACC
Protein .751 .024 77 .726 .024 19 .712 .028 19 .685 .028 19 .755 .028 42 .755 .031 26
.742 .043 .721 .046 .716 .049 .673 .048 .761 .047 .75 .042
Isolet .817 .014 617 .822 .02 73 .672 .022 73 .666 .016 73 .842 .012 319 .822 .016 55
.817 .022 .814 .029 .659 .037 .649 .036 .849 .023 .815 .032
Fashion .688 .014 784 .591 .016 86 .528 .034 86 .512 .031 86 .688 .018 407 .662 .022 73
.684 .038 .59 .041 .52 .054 .504 .045 .686 .03 .658 .033
MNIST .774 .016 784 .21 .022 86 .11 .002 86 .446 .062 86 .825 .021 413 .782 .016 78
.776 .05 .212 .03 .109 .003 .451 .052 .832 .047 .796 .039
Coil20 .928 .012 1024 .922 .013 102 .81 .015 102 .813 .018 102 .941 .01 518 .937 .012 58
.926 .026 .916 .025 .816 .025 .809 .023 .935 .023 .931 .025
PCMAC .815 .025 3289 .817 .021 222 .726 .047 222 .595 .057 222 .811 .021 1656 .818 .025 57
.829 .035 .825 .038 .727 .061 .598 .066 .825 .033 .832 .036
Gisette .877 .013 5000 .669 .084 293 .877 .012 293 .615 .041 293 .879 .015 2503 .873 .016 64
.865 .042 .683 .086 .874 .035 .614 .059 .881 .038 .873 .029
Table 2: The classification performance on the unlabeled and the test sets (ACC-U and ACC-T respectively) of various data sets presented in Table 1. In addition, the number of features (averaged over 20 trials) used for learning by each method is indicated. The bold face is used to emphasize the highest performance rate. By the symbol we indicate that the performance is significantly worse compared to the best result, according to Mann-Whitney U test (Mann and Whitney, 1947) at the p-value level equal to 0.01.

7.3 Experiments on Real Data Sets

In addition, we validate our approach on 6 publicly available data sets (Chang and Lin, 2011; Li et al., 2018). The associated applications are image recognition, with the MNIST, the Coil20 and the Gisette data sets; the text classification database PCMAC; application to bioinformatics with the Protein data set; finally the Isolet database represents a speech recognition task.

The main characteristics of all data sets are summarized in Table 1. Since we are interested in practical use of the algorithm, we test the algorithms under the condition that . For the MNIST and the Fashion data sets, we consider its subset of 10000 observations. For RLSR, SFS and SSLS, we fix the number of selected features as . Table 2 summarizes the performance results and reports the number of features used for learning. These results show that

  • The proposed approach compares well to other methods. On data sets Isolet, MNIST, Coil20 the algorithm improves significantly the performance over the supervised baseline RF by using unlabeled data and reducing the original dimension.

  • For all data sets, being one of the best in terms of performance, WIL-FSGA reduces also drastically the original dimension, which is especially the case for the data sets of larger dimension (Coil20, PCMAC, Gisette).

  • Since the number of selected features should be predefined for RLSR, SFS and SSLS, it becomes difficult to determine this number, especially in the semi-supervised context. This may lead to significant drop in performance that can be seen on the MNIST data set.

8 Conclusion

In this paper we proposed a new semi-supervised framework for wrapper feature selection. To increase the diversity of labeled data, unlabeled examples are pseudo-labeled using a self-learning algorithm. We extended the -bound to the case where these examples are given imperfect class labels. An objective of the proposed wrapper is to minimize this bound using a genetic algorithm. To produce a sparse solution, we proposed a modification of the latter by taking into account feature weights during its evolutionary process. We provided empirical evidence of our framework in comparison with a supervised baseline, two semi-supervised filter techniques as well as an embedded feature selection algorithm. The proposed modification of the genetic algorithm provides a trade-off in tasks, where both high performance and low dimension are reached.

References

  • Amini et al. (2009) Amini, M., Usunier, N., and Laviolette, F. (2009). A transductive bound for the voted classifier with an application to semi-supervised learning. In Advances in Neural Information Processing Systems 21, pages 65–72.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
  • Chandrashekar and Sahin (2014) Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011).

    LIBSVM: A Library for Support Vector Machines.

    ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
  • Chen et al. (2017) Chen, X., Yuan, G., Nie, F., and Huang, J. Z. (2017).

    Semi-supervised feature selection via rescaled linear regression.

    In IJCAI, volume 2017, pages 1525–1531.
  • Chittineni (1980) Chittineni, C. (1980). Learning with imperfectly labeled patterns. Pattern Recognition, 12(5):281–291.
  • Feofanov et al. (2019) Feofanov, V., Devijver, E., and Amini, M.-R. (2019). Transductive bounds for the multi-class majority vote classifier.

    Proceedings of the AAAI Conference on Artificial Intelligence

    , 33(01):3566–3573.
  • Goldberg and Holland (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2):95–99.
  • Guyon (2003) Guyon, I. (2003). Design of experiments of the nips 2003 variable selection benchmark. In

    NIPS 2003 workshop on feature extraction and feature selection

    .
  • Guyon and Elisseeff (2003) Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
  • Kohavi and John (1997) Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324.
  • Koller and Sahami (1996) Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages 284–292, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Krithara et al. (2008) Krithara, A., Amini, M.-R., and Goutte, C. (2008). Semi-Supervised Document Classification with a Mislabeling Error Model. In European Conference on Information Retrieval (ECIR’08), pages 370–381. Springer.
  • Lacasse et al. (2007) Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. (2007). Pac-bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural information processing systems, pages 769–776.
  • Laviolette et al. (2014) Laviolette, F., Morvant, E., Ralaivola, L., and Roy, J. (2014). On Generalizing the C-Bound to the Multiclass and Multi-label Settings. In NIPS 2014 Workshop on Representation and Learning Methods for Complex Outputs, Dec 2014, Montréal, Canada.
  • Li et al. (2018) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2018). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94.
  • Mann and Whitney (1947) Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50–60.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Persello and Bruzzone (2016) Persello, C. and Bruzzone, L. (2016).

    Kernel-based domain-invariant feature selection in hyperspectral images for transfer learning.

    IEEE Transactions on Geoscience and Remote Sensing, 54(5):2615–2626.
  • Ren et al. (2008) Ren, J., Qiu, Z., Fan, W., Cheng, H., and Yu, P. S. (2008). Forward semi-supervised feature selection. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining, pages 970–976, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Sheikhpour et al. (2017) Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recogn., 64(C):141–158.
  • Tür et al. (2005) Tür, G., Hakkani-Tür, D. Z., and Schapire, R. E. (2005). Combining active and semi-supervised learning for spoken language understanding. Speech Communication, 45:171–186.
  • Tuv et al. (2009) Tuv, E., Borisov, A., Runger, G., and Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res., 10:1341–1366.
  • Vittaut et al. (2002) Vittaut, J., Amini, M., and Gallinari, P. (2002). Learning classification with both labeled and unlabeled data. In 13th European Conference on Machine Learning (ECML’02), pages 468–479.
  • Yang et al. (2010) Yang, M., Chen, Y.-J., and Ji, G.-L. (2010). Semi_fisher score: A semi-supervised method for feature selection. In 2010 International Conference on Machine Learning and Cybernetics, volume 1, pages 527–532. IEEE.
  • Zhao et al. (2008) Zhao, J., Lu, K., and He, X. (2008). Locality sensitive semi-supervised feature selection. Neurocomputing, 71(10-12):1842–1849.

References

  • Amini et al. (2009) Amini, M., Usunier, N., and Laviolette, F. (2009). A transductive bound for the voted classifier with an application to semi-supervised learning. In Advances in Neural Information Processing Systems 21, pages 65–72.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
  • Chandrashekar and Sahin (2014) Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011).

    LIBSVM: A Library for Support Vector Machines.

    ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27.
  • Chen et al. (2017) Chen, X., Yuan, G., Nie, F., and Huang, J. Z. (2017).

    Semi-supervised feature selection via rescaled linear regression.

    In IJCAI, volume 2017, pages 1525–1531.
  • Chittineni (1980) Chittineni, C. (1980). Learning with imperfectly labeled patterns. Pattern Recognition, 12(5):281–291.
  • Feofanov et al. (2019) Feofanov, V., Devijver, E., and Amini, M.-R. (2019). Transductive bounds for the multi-class majority vote classifier.

    Proceedings of the AAAI Conference on Artificial Intelligence

    , 33(01):3566–3573.
  • Goldberg and Holland (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2):95–99.
  • Guyon (2003) Guyon, I. (2003). Design of experiments of the nips 2003 variable selection benchmark. In

    NIPS 2003 workshop on feature extraction and feature selection

    .
  • Guyon and Elisseeff (2003) Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
  • Kohavi and John (1997) Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273–324.
  • Koller and Sahami (1996) Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages 284–292, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Krithara et al. (2008) Krithara, A., Amini, M.-R., and Goutte, C. (2008). Semi-Supervised Document Classification with a Mislabeling Error Model. In European Conference on Information Retrieval (ECIR’08), pages 370–381. Springer.
  • Lacasse et al. (2007) Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. (2007). Pac-bayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural information processing systems, pages 769–776.
  • Laviolette et al. (2014) Laviolette, F., Morvant, E., Ralaivola, L., and Roy, J. (2014). On Generalizing the C-Bound to the Multiclass and Multi-label Settings. In NIPS 2014 Workshop on Representation and Learning Methods for Complex Outputs, Dec 2014, Montréal, Canada.
  • Li et al. (2018) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2018). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94.
  • Mann and Whitney (1947) Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50–60.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Persello and Bruzzone (2016) Persello, C. and Bruzzone, L. (2016).

    Kernel-based domain-invariant feature selection in hyperspectral images for transfer learning.

    IEEE Transactions on Geoscience and Remote Sensing, 54(5):2615–2626.
  • Ren et al. (2008) Ren, J., Qiu, Z., Fan, W., Cheng, H., and Yu, P. S. (2008). Forward semi-supervised feature selection. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining, pages 970–976, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Sheikhpour et al. (2017) Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). A survey on semi-supervised feature selection methods. Pattern Recogn., 64(C):141–158.
  • Tür et al. (2005) Tür, G., Hakkani-Tür, D. Z., and Schapire, R. E. (2005). Combining active and semi-supervised learning for spoken language understanding. Speech Communication, 45:171–186.
  • Tuv et al. (2009) Tuv, E., Borisov, A., Runger, G., and Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res., 10:1341–1366.
  • Vittaut et al. (2002) Vittaut, J., Amini, M., and Gallinari, P. (2002). Learning classification with both labeled and unlabeled data. In 13th European Conference on Machine Learning (ECML’02), pages 468–479.
  • Yang et al. (2010) Yang, M., Chen, Y.-J., and Ji, G.-L. (2010). Semi_fisher score: A semi-supervised method for feature selection. In 2010 International Conference on Machine Learning and Cybernetics, volume 1, pages 527–532. IEEE.
  • Zhao et al. (2008) Zhao, J., Lu, K., and He, X. (2008). Locality sensitive semi-supervised feature selection. Neurocomputing, 71(10-12):1842–1849.