1 Introduction
We consider learning problems where the labeled training set comes along with a huge number of unlabeled training examples and where the dimension of the characteristic space is large. In this case, traditional learning approaches usually suffer from excessive computation time, poor learning performance and a lack of interpretation. In practice, the original set of features may contain irrelevant or redundant characteristics to the output and their removal may yield to better prediction performance and provides important keys for the interpretability of results (Guyon and Elisseeff, 2003; Chandrashekar and Sahin, 2014). Depending on the availability of class labels, the feature selection techniques can be supervised, unsupervised or semisupervised. Being agnostic to the target variable, unsupervised approaches generally ignore the discriminative power of features, so their use may lead to poor performance. In contrast, supervised feature selection algorithms benefit from abundant labeled examples, so they effectively eliminate irrelevant and redundant variables. In semisupervised feature selection (Sheikhpour et al., 2017), the aim is to exploit both available labeled and unlabeled training observations in order to provide a solution that preserves important structures of data and leads to high performance. Considerable progress has been made in this direction over the last few years. Filter methods (Yang et al., 2010; Zhao et al., 2008) score features following a criterion and perform selection before a learning model is constructed. Embedded techniques (Chen et al., 2017) perform modelbased feature selection in order to infer the importance of features during the training process.
Finally, Wrapper approaches (Kohavi and John, 1997; Ren et al., 2008) use a learner to effectively find a subset of features that are discriminatevely powerful together. The underlying principle of these approaches is to search a feature subset by optimizing the prediction performance of a given learning model. In semisupervised learning, one way to improve prediction performance and to increase data diversity, is to augment the training labeled set by pseudolabeled unlabeled examples using either selflearning or cotraining approaches (Amini et al., 2009; Sheikhpour et al., 2017; Feofanov et al., 2019). In this case, pseudolabels are iteratively assigned to unlabeled examples with confidence score above a certain threshold. However, fixing this threshold is in some extent the bottleneck of these approaches. In addition, the pseudolabels may be prone to error, so the wrapper will be learnt on biased and mislabeled training data.
Another related question would be ”how to optimize the wrapper”? Due to an exponential number of possible subsets, using exhaustive search for its optimization would be computationally infeasible. In addition, the use of sequential search algorithms like the one proposed by Ren et al. (2008)
, in the case of very large dimension becomes also infeasible. To overcome this problem, a common practice is to use heuristic search algorithms, for instance, a genetic algorithm
(Goldberg and Holland, 1988). However, for applications of large dimension, this approach may have large variance in output, and the set of selected features might be still large.
In this paper, we propose a new framework for semisupervised wrapper feature selection with an explicit modeling of mislabel probabilities. To perform consistent selflearning, we use a recent work of
Feofanov et al. (2019) to find the threshold dynamically based on the transductive guarantees of an ensemble Bayes classifier. To eliminate the bias of the pseudolabeling, we derive a new upper bound of the Bayes classifier that is computed on imperfect labels. This bound is based on the bound (Lacasse et al., 2007) of the Bayes risk that is derived by considering the mean and the variance of the prediction margin. To extend this bound, we consider a mislabeling model proposed by Chittineni. Finally, we propose a modification of a genetic algorithm such that it takes into account feature weights during the optimization phase and provides a sparse solution of the feature selection task.In the following Section we introduce the problem statement. Section 3 provides background information related to this work. Section 4 shows how to derive the bound in the probabilistic framework. In Section 5 we show how to take into account mislabeling for the bound. In Section 6 we describe our algorithm to select features on semisupervised data. Section 7 presents the experimental results. Finally, the conclusion is given in Section 8.
2 Framework
We consider multiclass classification problems with an input space and an output space , . We denote by (resp.
) an input (resp. output) random variable and
the input projection on a subset of features . We assume available a set of labeled training examples, identically and independently distributed (i.i.d.) with respect to a fixed yet unknown probability distribution
over , and a set of unlabeled training examples supposed to be drawn i.i.d. from the marginal distribution , over the domain .Following Koller and Sahami (1996), we call as a Markov blanket for the output if . Thus, we formulate the goal of semisupervised feature selection as to find a minimal Markov blanket among all possible feature subsets based on the available labeled and unlabeled data.
A solution that satisfies the Markov blanket condition does not include irrelevant variables since they are independent from . In addition, when a Markov blanket is minimal, we exclude maximum possible number of features that are conditionally independent from given , i.e. redundant variables.
In this work, a fixed class of classifiers , called the hypothesis space, is considered and defined without reference to the training set. Further, we focus on the Bayes classifier (also called the majority vote classifier) defined for all as
(1) 
We formulate the task of the learner as to choose a posterior distribution over observing the training set such that the the true risk of the classifier is minimized:
Given an observation , its margin is defined in the following way:
where is the vote given by the Bayes classifier to the class membership of an example being . The margin measures the confidence of the prediction: if it is strictly positive for a example , then the example is correctly classified.
3 Related Work
In this Section we introduce the Bound proposed for the supervised case and describe briefly the selflearning algorithm.
3.1 Bound
Lacasse et al. (2007) proposed to upper bound the Bayes classifier by taking into account the mean and the variance of the prediction margin. The multiclass version of this bound is given in the following theorem.
3.2 Selflearning Based on the Transductive Bound
The idea of a selflearning algorithm (Vittaut et al., 2002; Tür et al., 2005) is to iteratively assign pseudolabels to a subset of training unlabeled examples that have their associated class vote above a threshold. By extending the work of (Amini et al., 2009) to the multiclass classification case; Feofanov et al. (2019) proposed to obtain the threshold dynamically by minimizing the conditional Bayes error , defined by:
(3) 
Where, is the risk of the Bayes classifier over examples having a margin greater or equal than . The conditional Bayes error represents the proportion of this risk over unlabeled examples having margin above . An unlabeled example is going to be pseudolabeled, if its prediction vote is higher than a corresponding threshold.
Feofanov et al. (2019) showed empirically that this approach outperforms significantly the classic selflearning algorithm, when the threshold is manually fixed. In reality, by fixing the threshold, the algorithm may inject much error, therefore, in most cases the classic selflearning algorithm performs worse than the corresponding supervised algorithm.
4 Probabilistic Bound
In our approach, we define the risk by explicitly taking into account the posterior probability
as follows:where denotes the Bayes risk in classifying an observation .
Let be a random variable such that
is a discrete random variable that is equal to the margin
with probability , . Then, is defined by the following probability law:(4) 
The random variable as defined above is connected to the Bayes risk in the following way:
Proof.
One can notice that :
Applying the total probability law, we obtain:
∎
Let and be respectively the first and the second statistical moments of the random variable defined by the law (4):
Then, for all choice of on a hypothesis space , and for all distributions over and over , such that , we have:
(5) 
5 Learning with Imperfect Labels
In this section, we show how to evaluate the Bayes risk in the case where we have an imperfect output with a different distribution than the true output .
Further, we assume that the Bayes classifier is optimal in terms of risk minimization, i.e. it is equivalent to the maximum a posteriori rule:
Then, the Bayes risk for and can be written as:
Let the label imperfection be described by the following probability law:
It is assumed that the true class conditional distribution remains the same when the imperfect label is given, i.e. . [Section 3.2.1. in Chittineni (1980)] For all distributions , , , we have:
(6)  
(7) 
Proof.
Let be the output of the Bayes classifier learnt with the true labels of all training samples.
Consider :
By taking the expectation with respect to , we derive the inequality. ∎
Note that in some semisupervised learning approaches, this model was employed to correct the mislabeling errors induced by the learner assigning pseudolabels to unlabeled training examples (Krithara et al., 2008). From this result, we can obtain a bound in the presence of mislabeled examples. For this, we introduce a random variable defined in the similar way as replacing by . Let and be respectively the first and the second statistical moments of the random variable defined by the law (4):
Then, for all distributions over and , over , such that , we have:
(8) 
Consequently, given an imperfect label, we can evaluate bound in this ”noisy” case; then, using the term , we perform a correction of the bound to get the true bound. One can notice that when , there is no mislabeling, so the regular bound is obtained.
6 Wrapper with Imperfect Labels
In the next Section, we present a new framework for wrapper feature selection using both labeled and unlabeled data based on the probabilistic framework with mislabeling errors presented above.
6.1 Framework
The algorithm starts from a supervised Bayes classifier initially trained on available labeled examples. Then, it iteratively retrains the classifier by assigning pseudolabels at each iteration to unlabeled examples that have prediction vote above a certain threshold found by minimizing Equation (3) (Feofanov et al., 2019, Algorithm 1).
As a result, we obtain a new augmented training set that increases the diversity of training examples. However, the votes of the classifier are possibly biased and the pseudolabeled examples contain mislabeling errors.
In this sense, we propose a wrapper strategy that performs feature selection by modelling these mislabeling errors. In other words, we search a feature subset in the space of possible subsets that minimizes the bound with imperfect labels (Corollary 5
). The bound is estimated using the augmented training set. In order to solve the optimization problem, we perform a heuristic search using a genetic algorithm.
6.2 Genetic Algorithm and Its Limitations
A genetic algorithm (Goldberg and Holland, 1988) is an evolutionary optimization algorithm inspired by the natural selection process. A fitness function is optimized by evolving iteratively a population of candidates (in our case, binary representation of possible feature subsets).
Starting from a randomly drawn population, the algorithm produces iteratively new populations, called generations, by preserving parents, candidates with best fitness, and creating offspring from parents using operation of crossover and mutation (Figure 1). After a predefined number of generations the algorithm is stopped, and a candidate with the best fitness in the last population is returned. Further, we call this algorithm as the classic genetic algorithm (denotes as CGA in the following).
The CGA can be very effective for the wrapper feature selection in the case where the number of features is very large. However, there might be several limitations. Firstly, the algorithm may have a large variance in giving results depending on the initialization of population. Therefore, it needs usually a large number of generations to have a stable output.
Another problem would be that during the crossover a child inherits features from the parents at random; ignoring any information like feature importance. Because of that, a solution the algorithm outputs is generally not sparse as it could be. To produce a sparse solution, it is usually spread to limit the search space by fixing the number of selected features (Persello and Bruzzone, 2016). However, it is not very clear which number of features should be taken.
6.3 Feature Selection Genetic Algorithm
In this Section, we describe the new method, Feature Selection Genetic Algorithm (FSGA). The main idea of the algorithm is to take into account the importance of features during the generation of new population. This strategy allows to output a sparse solution preserving discriminative power and not fixing the number of features. The figure 2 illustrates a flowchart of how the algorithm works.
Below, we describe in detail the different steps of the algorithm.

Initialization: We initialize the population by randomly generating feature subsets of a fixed length. In our experiments, this length is equal to . Each candidate is a feature subset, which is different from CGA
, where the subset is binarized.

Fitness Computation: For each candidate , we train a supervised model and compute the score reflecting the strength of the subset.

Feature Weights Computation: For each candidate , we obtain weights
from the learning model. For this, ensemble methods based on decision trees can be used.

Feature Relevance Test:
To accelerate convergence and reduce variance of the algorithm, we embed a test to eliminate irrelevant to response variables. We are inspired by the work of
Tuv et al. (2009), where variables are compared with their copies with randomly permuted values. For each feature, we compute the average weights:where is the population of generation . We find features that have average weights less than a fixed threshold : . The found features as well as their copies with randomly permuted values are included to the best subset in the population according to the fitness score. A new supervised model, which gives feature weights, is learnt on these features. If the difference between the weight of a feature (that belongs to ) and the weight of its noisy counterpart is not significant, the feature is removed and will not be further considered by the algorithm.

Parent Selection: Among the population , candidates with best fitness are selected, preserved for the next population and used to produce new offspring.

Crossover: A new child is generated by mating two parents. We take randomly the crossover point that characterizes the proportion of features inherited from the 1st parent. The rest of the features are inherited from the second parent. In contrast to the CGA, we inherit variables according to their weights. For each parent, its features are sorted by their weights in the decreasing order. We fill the child by the features of the first parent in the specified order until we reach the quota. The rest of features are taken from the second parent under a condition that there are no repetitions.

Mutation: To increase the diversity of candidates, we perform mutation of children in the same way as in CGA. In addition, we define a possibility to mutate the number of features in the subset. For each child, its length can be randomly increased, decreased or remain the same.
7 Experimental Results
We conducted a number of experiments aimed at evaluating how the consideration of the mislabeling errors of pseudolabeled unlabeled examples can help to learn an efficient wrapper feature selection model.
7.1 Framework
To this end, we compared the proposed approach with stateoftheart models on a series of numerical experiments described below.
We consider the Random Forest algorithm
(Breiman, 2001), denotes as RF, with 200 trees and the maximal depth of trees as the Bayes classifier with the uniform posterior distribution. For an observation, we evaluate the vector of class votes
by averaging over the trees the vote given to each class by the tree. A tree computes a class vote as the fraction of training examples in a leaf belonging to a class.To evaluate the bound, we approximate probabilities by the class vote reflecting the confidence in predicting . To estimate , we use the 5fold crossvalidation on the training labeled set by comparing true labels with predicted ones over different validation sets.
To evaluate the quality of selection, we use the selflearning algorithm (SLA) with automatic thresholding (Feofanov et al., 2019). In other words, at first, we find a feature subset using a feature selection method, then we train SLA on the selected features and compute its performance. In our experiments we compare the following methods:

a baseline, which is a fully supervised Random Forest (RF) trained using only labeled examples and on complete subset of features;

an embedded selection by rescaled linear square regression (RLSR) proposed by Chen et al. (2017);

a feature ranking by Semi_Fisher score (SFS) proposed by Yang et al. (2010);

a feature ranking by semisupervised Laplacian score (SSLS) proposed by Zhao et al. (2008);

the approach proposed in this paper: a wrapper based on the bound with imperfect labels (WIL); the bound is computed using training labeled examples and training unlabeled examples pseudolabeled by SLA; the approach is optimized by the classic genetic algorithm (WILCGA) and the feature selection algorithm (WILFSGA).
In addition, we averaged over 20 random (train/unlabeled/test) sets of the initial collection and report the average classification accuracy over the 20 trials on the unlabeled training set as well as on the test set.
The hyperparameters of all methods are set to their default values as there are not enough labeled training samples to tune them correctly. Specifically,
for RLSR is set to 0.1; the umber of nearest neighbours is set to 20 for SSLS and SFS. For the genetic algorithms, the number of generations is set to 20, the population size is 40 and the number of parents is set to 8.7.2 Experiments on Synthetic Data
We first test the algorithms on synthetic data that were generated using the Pedregosa et al.’s implementation of the algorithm that created the Madelon data set (Guyon, 2003). The size of training labeled, training unlabeled and test sets are respectively 100, 900, and 100. We fixed the number of classes to 3; the number of features to 20 wherein 8 features are informative, 6 are linear combinations of the latter, and 6 features are noise.
For RLSR, SFS and SSLS, the number of selected features is set to 8. The feature selection results averaged over 20 different splits. Figure 3 illustrates which features were selected by each method. Below, we report the results of feature selection with respect to the elimination of irrelevant and redundant variables.

Elimination of irrelevant variables: From Figure 3 it can be seen that the filter methods, SFS and SSLS, eliminated effectively the noise (variables 1520), whereas RLSR coped with this task worst of all. Due to the fact that in CGA a new child inherits features from its parents purely at random, the WILCGA does not discard the irrelevant variables completely and probably needs more generations to cope with that. Thanks to the feature relevance test, WILFSGA is as effective as the filter methods.

Elimination of redundant variables: Since the redundant variables (914) are linear combinations of the informative ones (18), they are useful individually for classification as well. Because of that, the filter methods, SFS and SSLS, tend to underselect some informative features (e.g. 2,3,8). This might be caused by the individual weakness of these variables compared to the ”strong” variables (e.g. 4,5,6) and their redundant counterparts.
In contrast to the filters, the wrapper approaches, WILCGA and WILFSGA, search features that will be jointly strong, therefore, they are less prone to the nonselection of the informative variables. Finally, it can be clearly seen that RLSR is affected by noise, so it underselects both the informative and the redundant variables.
Data set  # of lab. examples,  # of unlab. examples,  # of test examples,  Dimension,  # of classes, 

Protein  97  875  108  77  8 
Isolet  140  1264  156  617  26 
Fashion  99  9801  100  784  10 
MNIST  99  9801  100  784  10 
Coil20  130  1166  144  1024  20 
PCMAC  175  1574  194  3289  2 
Gisette  69  6861  70  5000  2 
Data set  Score  RF  RLSR  SFS  SSLS  WIL  

CGA  FSGA  
ACC  ACC  ACC  ACC  ACC  ACC  
Protein  .751 .024  77  .726 .024  19  .712 .028  19  .685 .028  19  .755 .028  42  .755 .031  26  
.742 .043  .721 .046  .716 .049  .673 .048  .761 .047  .75 .042  
Isolet  .817 .014  617  .822 .02  73  .672 .022  73  .666 .016  73  .842 .012  319  .822 .016  55  
.817 .022  .814 .029  .659 .037  .649 .036  .849 .023  .815 .032  
Fashion  .688 .014  784  .591 .016  86  .528 .034  86  .512 .031  86  .688 .018  407  .662 .022  73  
.684 .038  .59 .041  .52 .054  .504 .045  .686 .03  .658 .033  
MNIST  .774 .016  784  .21 .022  86  .11 .002  86  .446 .062  86  .825 .021  413  .782 .016  78  
.776 .05  .212 .03  .109 .003  .451 .052  .832 .047  .796 .039  
Coil20  .928 .012  1024  .922 .013  102  .81 .015  102  .813 .018  102  .941 .01  518  .937 .012  58  
.926 .026  .916 .025  .816 .025  .809 .023  .935 .023  .931 .025  
PCMAC  .815 .025  3289  .817 .021  222  .726 .047  222  .595 .057  222  .811 .021  1656  .818 .025  57  
.829 .035  .825 .038  .727 .061  .598 .066  .825 .033  .832 .036  
Gisette  .877 .013  5000  .669 .084  293  .877 .012  293  .615 .041  293  .879 .015  2503  .873 .016  64  
.865 .042  .683 .086  .874 .035  .614 .059  .881 .038  .873 .029  
7.3 Experiments on Real Data Sets
In addition, we validate our approach on 6 publicly available data sets (Chang and Lin, 2011; Li et al., 2018). The associated applications are image recognition, with the MNIST, the Coil20 and the Gisette data sets; the text classification database PCMAC; application to bioinformatics with the Protein data set; finally the Isolet database represents a speech recognition task.
The main characteristics of all data sets are summarized in Table 1. Since we are interested in practical use of the algorithm, we test the algorithms under the condition that . For the MNIST and the Fashion data sets, we consider its subset of 10000 observations. For RLSR, SFS and SSLS, we fix the number of selected features as . Table 2 summarizes the performance results and reports the number of features used for learning. These results show that

The proposed approach compares well to other methods. On data sets Isolet, MNIST, Coil20 the algorithm improves significantly the performance over the supervised baseline RF by using unlabeled data and reducing the original dimension.

For all data sets, being one of the best in terms of performance, WILFSGA reduces also drastically the original dimension, which is especially the case for the data sets of larger dimension (Coil20, PCMAC, Gisette).

Since the number of selected features should be predefined for RLSR, SFS and SSLS, it becomes difficult to determine this number, especially in the semisupervised context. This may lead to significant drop in performance that can be seen on the MNIST data set.
8 Conclusion
In this paper we proposed a new semisupervised framework for wrapper feature selection. To increase the diversity of labeled data, unlabeled examples are pseudolabeled using a selflearning algorithm. We extended the bound to the case where these examples are given imperfect class labels. An objective of the proposed wrapper is to minimize this bound using a genetic algorithm. To produce a sparse solution, we proposed a modification of the latter by taking into account feature weights during its evolutionary process. We provided empirical evidence of our framework in comparison with a supervised baseline, two semisupervised filter techniques as well as an embedded feature selection algorithm. The proposed modification of the genetic algorithm provides a tradeoff in tasks, where both high performance and low dimension are reached.
References
 Amini et al. (2009) Amini, M., Usunier, N., and Laviolette, F. (2009). A transductive bound for the voted classifier with an application to semisupervised learning. In Advances in Neural Information Processing Systems 21, pages 65–72.
 Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
 Chandrashekar and Sahin (2014) Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.

Chang and Lin (2011)
Chang, C.C. and Lin, C.J. (2011).
LIBSVM: A Library for Support Vector Machines.
ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27. 
Chen et al. (2017)
Chen, X., Yuan, G., Nie, F., and Huang, J. Z. (2017).
Semisupervised feature selection via rescaled linear regression.
In IJCAI, volume 2017, pages 1525–1531.  Chittineni (1980) Chittineni, C. (1980). Learning with imperfectly labeled patterns. Pattern Recognition, 12(5):281–291.

Feofanov et al. (2019)
Feofanov, V., Devijver, E., and Amini, M.R. (2019).
Transductive bounds for the multiclass majority vote classifier.
Proceedings of the AAAI Conference on Artificial Intelligence
, 33(01):3566–3573.  Goldberg and Holland (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2):95–99.

Guyon (2003)
Guyon, I. (2003).
Design of experiments of the nips 2003 variable selection benchmark.
In
NIPS 2003 workshop on feature extraction and feature selection
.  Guyon and Elisseeff (2003) Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
 Kohavi and John (1997) Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(12):273–324.
 Koller and Sahami (1996) Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages 284–292, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
 Krithara et al. (2008) Krithara, A., Amini, M.R., and Goutte, C. (2008). SemiSupervised Document Classification with a Mislabeling Error Model. In European Conference on Information Retrieval (ECIR’08), pages 370–381. Springer.
 Lacasse et al. (2007) Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. (2007). Pacbayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural information processing systems, pages 769–776.
 Laviolette et al. (2014) Laviolette, F., Morvant, E., Ralaivola, L., and Roy, J. (2014). On Generalizing the CBound to the Multiclass and Multilabel Settings. In NIPS 2014 Workshop on Representation and Learning Methods for Complex Outputs, Dec 2014, Montréal, Canada.
 Li et al. (2018) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2018). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94.
 Mann and Whitney (1947) Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50–60.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Persello and Bruzzone (2016)
Persello, C. and Bruzzone, L. (2016).
Kernelbased domaininvariant feature selection in hyperspectral images for transfer learning.
IEEE Transactions on Geoscience and Remote Sensing, 54(5):2615–2626.  Ren et al. (2008) Ren, J., Qiu, Z., Fan, W., Cheng, H., and Yu, P. S. (2008). Forward semisupervised feature selection. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining, pages 970–976, Berlin, Heidelberg. Springer Berlin Heidelberg.
 Sheikhpour et al. (2017) Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). A survey on semisupervised feature selection methods. Pattern Recogn., 64(C):141–158.
 Tür et al. (2005) Tür, G., HakkaniTür, D. Z., and Schapire, R. E. (2005). Combining active and semisupervised learning for spoken language understanding. Speech Communication, 45:171–186.
 Tuv et al. (2009) Tuv, E., Borisov, A., Runger, G., and Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res., 10:1341–1366.
 Vittaut et al. (2002) Vittaut, J., Amini, M., and Gallinari, P. (2002). Learning classification with both labeled and unlabeled data. In 13th European Conference on Machine Learning (ECML’02), pages 468–479.
 Yang et al. (2010) Yang, M., Chen, Y.J., and Ji, G.L. (2010). Semi_fisher score: A semisupervised method for feature selection. In 2010 International Conference on Machine Learning and Cybernetics, volume 1, pages 527–532. IEEE.
 Zhao et al. (2008) Zhao, J., Lu, K., and He, X. (2008). Locality sensitive semisupervised feature selection. Neurocomputing, 71(1012):1842–1849.
References
 Amini et al. (2009) Amini, M., Usunier, N., and Laviolette, F. (2009). A transductive bound for the voted classifier with an application to semisupervised learning. In Advances in Neural Information Processing Systems 21, pages 65–72.
 Breiman (2001) Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
 Chandrashekar and Sahin (2014) Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.

Chang and Lin (2011)
Chang, C.C. and Lin, C.J. (2011).
LIBSVM: A Library for Support Vector Machines.
ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27. 
Chen et al. (2017)
Chen, X., Yuan, G., Nie, F., and Huang, J. Z. (2017).
Semisupervised feature selection via rescaled linear regression.
In IJCAI, volume 2017, pages 1525–1531.  Chittineni (1980) Chittineni, C. (1980). Learning with imperfectly labeled patterns. Pattern Recognition, 12(5):281–291.

Feofanov et al. (2019)
Feofanov, V., Devijver, E., and Amini, M.R. (2019).
Transductive bounds for the multiclass majority vote classifier.
Proceedings of the AAAI Conference on Artificial Intelligence
, 33(01):3566–3573.  Goldberg and Holland (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2):95–99.

Guyon (2003)
Guyon, I. (2003).
Design of experiments of the nips 2003 variable selection benchmark.
In
NIPS 2003 workshop on feature extraction and feature selection
.  Guyon and Elisseeff (2003) Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182.
 Kohavi and John (1997) Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(12):273–324.
 Koller and Sahami (1996) Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages 284–292, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
 Krithara et al. (2008) Krithara, A., Amini, M.R., and Goutte, C. (2008). SemiSupervised Document Classification with a Mislabeling Error Model. In European Conference on Information Retrieval (ECIR’08), pages 370–381. Springer.
 Lacasse et al. (2007) Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. (2007). Pacbayes bounds for the risk of the majority vote and the variance of the gibbs classifier. In Advances in Neural information processing systems, pages 769–776.
 Laviolette et al. (2014) Laviolette, F., Morvant, E., Ralaivola, L., and Roy, J. (2014). On Generalizing the CBound to the Multiclass and Multilabel Settings. In NIPS 2014 Workshop on Representation and Learning Methods for Complex Outputs, Dec 2014, Montréal, Canada.
 Li et al. (2018) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2018). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94.
 Mann and Whitney (1947) Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50–60.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Persello and Bruzzone (2016)
Persello, C. and Bruzzone, L. (2016).
Kernelbased domaininvariant feature selection in hyperspectral images for transfer learning.
IEEE Transactions on Geoscience and Remote Sensing, 54(5):2615–2626.  Ren et al. (2008) Ren, J., Qiu, Z., Fan, W., Cheng, H., and Yu, P. S. (2008). Forward semisupervised feature selection. In Washio, T., Suzuki, E., Ting, K. M., and Inokuchi, A., editors, Advances in Knowledge Discovery and Data Mining, pages 970–976, Berlin, Heidelberg. Springer Berlin Heidelberg.
 Sheikhpour et al. (2017) Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). A survey on semisupervised feature selection methods. Pattern Recogn., 64(C):141–158.
 Tür et al. (2005) Tür, G., HakkaniTür, D. Z., and Schapire, R. E. (2005). Combining active and semisupervised learning for spoken language understanding. Speech Communication, 45:171–186.
 Tuv et al. (2009) Tuv, E., Borisov, A., Runger, G., and Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res., 10:1341–1366.
 Vittaut et al. (2002) Vittaut, J., Amini, M., and Gallinari, P. (2002). Learning classification with both labeled and unlabeled data. In 13th European Conference on Machine Learning (ECML’02), pages 468–479.
 Yang et al. (2010) Yang, M., Chen, Y.J., and Ji, G.L. (2010). Semi_fisher score: A semisupervised method for feature selection. In 2010 International Conference on Machine Learning and Cybernetics, volume 1, pages 527–532. IEEE.
 Zhao et al. (2008) Zhao, J., Lu, K., and He, X. (2008). Locality sensitive semisupervised feature selection. Neurocomputing, 71(1012):1842–1849.
Comments
There are no comments yet.