In many applications, it is a difficult task to obtain fully labeled data sets to train a classifier, and labeling is usually expensive, time consuming and subject to human expertise, yet collecting abundant unlabeled data is much easier Cohen et al. (2004). To leverage both labeled and unlabeled samples, semi-supervised learning (SSL) has been proposed to improve the generalization ability Zhu (2005); Chapelle et al. (2009).
There are four broad categories of semi-supervised learning methods, i.e. generative methods, graph-based methods, low-density separation methods and disagreement-based methods Zhou and Li (2005) to be discussed in Section 2. The above methods are proposed based on certain assumptions on the labeled data and unlabeled data, which plays an important role in semi-supervised learning. However, it remains an open question on how to make the right assumptions on a real data.
By t-distributed stochastic neighbor embedding (t-SNE) tool Maaten and Hinton (2008), Fig. 1 visualizes two real data sets, namely the cjs and analcat data sets, which are downloaded from UCI repository. The test accuracy of the transductive support vector machine (TSVM) Bennett and Demiriz (1999) on the analcat data set is better than that of co-forest Li and Zhou (2007) model. But co-forest performs better on cjs data. The reason may be that analcat has large margin between classes such that the assumption of low-density separation is satisfied for TSVM (see Fig. 1). For cjs data, the class distribution is irregular and tree based method can work well in this situation. Seemingly, TSVM and tree based model have complementary properties. This motivates us to ensemble heterogeneous classifiers for semi-supervised learning.
In this paper, we propose a new semi-supervised method, called co-training with optimal weight (e.g., CTOW). The contributions of our method are in order.
We combine co-training with two strong heterogeneous classifiers, namely, Xgboost and TSVM, which have complementary properties and larger diversity.
The optimization problem of the weight for each classifier is established and we provide prior information of the margin density to help compute the weight of TSVM.
The proposed method works well for both large margin data (e.g., analcat data set) and irregular data (e.g., cjs data set).
Experiments are conducted on fourteen real tabular data sets. The results show that our method can improve at least test accuracy with less computational time.
2 Related Work
Generative methods Miller and Uyar (1997) are simple and effective in semi-supervised learning. The assumption about generative methods is that data actually comes from a mixture model Zhu (2005). Graph-based methods Zhu et al. (2003) use similarity matrix to construct a graph. The main assumption made by graph-based methods is that the labels are smooth with respect to the graph. However, the efficiency of the graph-based methods will heavily depend on the size of the constructed graph.
Low-density separation methods assume that the classes are well-separated, such that the decision boundary lies in a low-density region and does not cut through dense unlabeled data region. The most famous representative among others is semi-supervised support vector machine (S3VM), also called TSVM. The optimization problem of TSVM is given as follows Bennett and Demiriz (1999)
where are the parameters that specify the orientation and the offset, respectively; is the slack variable; is the pseudo label to be optimized; and are the penalty constants; the set represents the labeled data, and the set represents the unlabeled data. Since (1)-(3) is a non-convex optimization problem, many researchers strived to solve it efficiently Chapelle et al. (2008).
Disagreement-based methods Blum and Mitchell (1998); Zhou and Li (2010b) need assemble of multiple learners and let them collaborate and teach each other to exploit unlabeled data. Co-forest Li and Zhou (2007) is one of the most famous representative methods, which extends the co-training paradigm Blum and Mitchell (1998)
by random forest consisting of many trees. Each decision tree is firstly initiated from the training sets, then the unlabeled examples are randomly selected to label in confidently, finally, majority voting is employed to obtain the pseudo labels. But co-forest is based on the ensemble of the weak classifiers. It is more desirable to exploit heterogeneous ensemble of strong classifiers with complementary properties to improve the performance of a co-training method.
Many recent approaches for semi-supervised learning advocate to train a neural network based on the consistency loss, which forces the model to generate consistent outputs when its inputs are perturbed, such as pseudo-labelingLee (2013), Ladder network Rasmus et al. (2015), model Laine and Aila (2016), mean teacher Tarvainen and Valpola (2017), VAT Miyato et al. (2018), Mixmatch Berthelot et al. (2019). The consistency assumption works well for image data, among others video oriented tasks. Nevertheless, the neural network with consistency assumption may not reflect any specific inductive bias toward tabular data.
So far gradient boosting decision treesFriedman (2001) and Xgboost Chen and Guestrin (2016) are two most widely used models in Kaggle competitions. Xgboost is an additive tree ensemble model aggregating outputs of trees according to where each is a regression tree, and is the final output for the input data
. To learn the model parameter, they minimize the following loss function,where is a regularized term. As we known, Xgboost works well for tabular data, but it only use the labeled data without unlabeled data.
Our paper centers around the semi-supervised learning for tabular data. The above discussions about Xgboost inspires us to incorporate Xgboost classifier into semi-supervised learning. In Mallapragada et al. (2008), the authors proposed method called semi-boost, which combines similarity matrix with boosting methods to obtain more accurate pseudo labels. However, semi-boost is computationally expensive for large data set. Thus, our paper uses disagreement-based method by the ensemble of Xgboost and TSVM to surpass state-of-the-art performance in semi-supervised learning.
3 The Proposed Approach
In this paper, we consider semi-supervised classification problems. The training set consists of a labeled data set with labeled examples and unlabeled examples , with . Assume that the data has classes. We attempt to utilize training set to construct a learner to classify unseen instances. To this end, we propose a new semi-supervised learning method CTOW, by combining the Xgboost with TSVM. The reason that selecting Xgboost and TSVM as base learner is explained in Subsection 3.1. The architecture of our method is shown in Fig. 2, and the detailed techniques are presented in the following subsection.
3.1 Select Base Learners
There are two directions to select the base learners for semi-supervised learning, one is diversity of the learners, which is deemed to be a key of good ensemble Zhou and Li (2010a). The other is the accuracy of the learners, which helps us to find some better pseudo label data and improve the classification accuracy further.
To select some base learners maintaining a large diversity, we should use a criteria to measure the diversity. As we know, the correlation coefficient is a simple and efficient method to measure the diversity, which is the correlation between two classifier outputs (correct/incorrect) Kuncheva and Whitaker (2003). It is formulated as follows
where , , and are defined in Table 1.
|Classifier one correct||Classifier one wrong|
|Classifier two correct|
|Classifier two wrong|
According to (4), we calculate the correlation coefficient between two different decision tree classifiers for different real data. In addition, the correlation coefficient between TSVM and decision tree classifier is given. We also show the the correlation coefficient between TSVM and Xgboost.
Fig. 3 (a) presents the diversity of two different classifiers. When the correlation coefficient is smaller, the diversity is larger. In Li and Zhou (2007), the authors proposed Co-forest method, which uses some different decision tree as base learners, but we see that the correlation coefficient between two different decision tree classifiers is higher than that between two heterogeneous classifiers. Especially, and have much smaller value for analcat data set and cjs data set, which has been visualized in Fig. 1, it shows Xgboost and TSVM have complementary properties.
Fig. 3 (a) also shows is bigger than , which means the diversity between TSVM and decision tree classifier is larger than that between TSVM and Xgboost. From Fig. 3 (b), however, it shows Xgboost always performs better than decision tree classifier. This phenomenon makes sense because Xgboost is an additive tree ensemble model. Compared with Xgboost, TSVM based on the large margin can obtain higher accuracy for some real data set, such as analcat data. Thus, TSVM and Xgboost are selected as based learners based on a tradeoff between diversity and accuracy.
The new method, CTOW, comprises classifiers. Assume the -th classifier is TSVM, and the others are Xgboost. In order to make CTOW perform better for a real commercial data, we choose several Xgboost rather than one Xgboost as base learner. In order to improve accurate and diverse classifiers further, each Xgboost is initiated from the different training sets bootstrapped from labeled set . For TSVM, the whole training set is used. Thus, we can train the
classifiers simultaneously using the different data sets, which yields the prediction probabilityof the unlabeled data, where , , and is the parameter for the -th classifier.
3.3 Optimally Combining Classifiers
Each classifier sends its prediction probability of the unlabeled data to the center, then the unlabeled data will be labeled by the optimal weight ensemble. Suppose is the weight of the -th classifier. The probability of the pseudo label is denoted as where is a vector with .
The optimal weight is obtained by solving the following optimization problem
where means the prediction probability of the -th unlabeled data from the -th class. Recall that the -th classifier is TSVM. We can give it a prior weight according to (7) with the following details. First, we introduce a margin density metric for TSVM as follows Sethi and Kantardzic (2017):
where is the slack variable mentioned in (1), the numerator counts the number of training samples that falls in the margin of the TSVM. The variable is a threshold, and the goal of the function is to give a smaller weight to TSVM if ; otherwise it yields a bigger weight. In our experiment, the function is given as follows
where the threshold will be specified shown in Section 4.1.
The objective function in (5) contains the entropy of predicted probability and a regularization term . We want to enforce the ensemble classifier to provide low-entropy predictions on the unlabeled data. In addition, a regularization term is introduced to avoid overfitting to one classifier. When , the solution of the optimization problem (5)-(7) tends to give the same weight to all classifiers. The following example shows the relationship between (5)-(7) and the majority voting rule.
Example 1. Consider a binary classification problem. There are three classifiers to predict one instance with probability
Thus, we have
If we only minimize the entropy of , then there are infinite optimal solutions such as or , where is an arbitrary constant. However, if we add the regularization term , then a unique solution can be obtained with , being equivalent to the result of applying majority voting rule.
where the set is the linear constraint, i.e., , is the learning rate, is the gradient of the objective function in (5) evaluated at , and is Euclidean projection of onto . In this paper, we use an efficient algorithm proposed by Duchi et al. (2008) to numerically perform the projection in (10).
3.4 Diversity Augmentation
In this subsection, we use the following steps to maintain diversity between base learners further.
Firstly, we utilize bootstrap sampling to select different subset of the unlabeled data for the -th classifier, where . Actually, can be also subsampled based on co-forest method proposed in Li and Zhou (2007), which can reduce the influence of misclassifying an unlabeled sample.
Secondly, inspired by the idea of the co-forest method, the probability of the unlabeled data is obtained as follows
where is the optimal solution of problem (5)-(7), means the prediction probability of the -th unlabeled data from the -th class for the -th classifier. If is the set containing all classifiers, the formulation in (11) means all other component classifiers in without are used to determine the most confidently unlabeled examples for the -th classifier. In order to filter out the unconfident pseudo labels, we select the unlabeled data from , when its maximum probability is bigger than a threshold , i.e. .
Finally, we get the reliable pseudo label for the -th classifier as follows
We then combine the pseudo labeled data and labeled data to train the -th classifier again. The framework of the whole training process is shown in Algorithm 1. Based on the output of Algorithm 1, we can predict the test data and calculate the test accuracy.
For comparison, the performances of five state-of-the-art semi-supervised algorithms, i.e., Graph-SVM (GSVM, Mikhail Belkin (2006)), GMM Zhu (2005), Ladder network Rasmus et al. (2015), Co-forest Li and Zhou (2007), TSVM Bennett and Demiriz (1999) are also evaluated.
We conduct experiments based on 14 data sets from UCI machine learning repository. Additionally, we test our method on a realcommercial data set which contains around 50 thousand samples with 62 features. Descriptions of the experimental data sets is shown in Table 3 (see Appendix).
4.1 Implementation details
For each data set, fivefold cross validation is employed for evaluation. For each fold, we split the training data in a stratified fashion to obtain a labeled data set and an unlabeled set for a given label rate . In our simulation, we set , which means that splitting the training set will produce a set with labeled examples and a set with unlabeled examples.
The proposed method, CTOW, adaptively combines classification results of Xgboost and TSVM. In our simulation, we set . Three Xgboost and one TSVM are used. In addition, we set threshold , iterations and penalty parameter .
Firstly, let us show the relationship between margin density (introduced in (8)) and the accuracy of both TSVM and Xgboost.
Fig. 4 (a) shows that when is smaller, the accuracy of TSVM tends to be higher. The bar with negative value in Fig. 4 (b) means that Xgboost performs better than TSVM, which happens when becomes larger (see cjs data). Then, can be used as the prior information for providing the weight of TSVM. Specifically, if is bigger than a threshold , we give a smaller weight to TSVM, or a larger weight otherwise. According to the above discussion and observation from Fig. 4, we set in our Algorithm 1, and use the function denoted in (7) to calculate the weight of TSVM. Based on the above setup, we run Algorithm 1 to show the results in the next subsection.
Firstly, the visualization of some data sets is presented in Fig. 1 and Fig. 5, which helps us to analyze the suitable application for these methods. Secondly, we compare the proposed method with some other semi-supervised learning methods, which shows that our method yields the best performance in many real data sets.
Table 2 shows the test accuracy of the different algorithms. From the second column, we see that GMM fails to work well in various data sets. The reason is that GMM needs to assume the data comes from Gaussian mixture distribution, which is a strong assumption for real data sets. Comparing Graph-SVM with TSVM, Table 2 presents that these two methods have similar performance, because both of them are based on SVM. Nevertheless, Graph-SVM needs to calculate the similarity matrix, then it takes too long time for computation (see Fig 8(a)). Thus, we do not report the result of GMM and Graph-SVM if the data size is large.
In Fig. 5, synthetic data enjoys large margin so that TSVM performs better than Co-forest (see Table 2). Conversely, the classes of both commercial and cjs data sets are overlapped and the data distribution is irregular, then co-forest based tree model can obtain better performance than TSVM. Ladder network can get the best performance for hill and texture data sets, because these original features have homogeneous attributes and they cannot represent the label very well. But the neural network can train a better feature space in this case. Our method utilizes the advantage of Xgboost and TSVM. Table 2 presents that the proposed method can achieve the best performance for many real data sets, especially for large and high dimension data, such as gas-grift and commercial data set (see Table 3).
5 Conclusions and Future Works
In this paper, we propose a new method, CTOW, for semi-supervised deep learning, which applies the optimal ensemble of two heterogeneous classifiers, namely Xgboost and TSVM. The unlabeled data is exploited by considering model initialization, solving optimal weight ensemble problem, diversity augmentation, simultaneously. Experiments on various real data sets demonstrate that our method is superior to state-of-the-art methods.
From the simulations, we find Xgboost and TSVM have complementary properties and larger diversity. The reason that leading to this phenomenon should be studied in the future.
Although SSL is an old topic, there has been renewed interest in SSL which is reflected in both academic and industrial research. Many SSL methods are proposed based on certain assumptions on the labeled data and unlabeled data, which plays an important role in semi-supervised learning. However, it remains an open question on how to make the right assumptions on a real data.
Our research finds Xgboost and TSVM have complementary properties and larger diversity. Then, we proposed a new SSL method called CTOW by appling the optimal ensemble of two heterogeneous classifiers Xgboost and TSVM. Thus, the proposed CTOW enjoys the advantage of both Xgboost and TSVM such that it weakly depends on the distribution of the training data.
Our research could be also used to provide explanations for Xgboost and TSVM in their applications as well as reducing the cost of labeling. The proposed method may fail when label data is very few such that both Xgboost and TSVM performs worst.
-  (1999) Semi-supervised support vector machines. In Advances in Neural Information Processing Systems, pp. 368–374. Cited by: §1, §2, §4.
-  (2019) Mixmatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §2.
Combining labeled and unlabeled data with co-training.
Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. Cited by: §2.
-  (2009) Semi-supervised learning. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §1.
-  (2008) Optimization techniques for semi-supervised support vector machines. Journal of Machine Learning Research 9 (Feb), pp. 203–233. Cited by: §2.
-  (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.
-  (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (12), pp. 1553–1566. Cited by: §1.
-  (2008) Efficient projections onto the -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. Cited by: §3.3.
-  (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189–1232. Cited by: §2.
-  (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51 (2), pp. 181–207. Cited by: §3.1.
-  (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.
-  (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §2.
-  (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37 (6), pp. 1088–1098. Cited by: §1, §2, §3.1, §3.4, §4.
-  (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §1.
-  (2008) Semiboost: boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11), pp. 2000–2014. Cited by: §2.
-  (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, pp. 2399–2434. Cited by: §4.
-  (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems, pp. 571–577. Cited by: §2.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. Cited by: §2.
-  (2015) Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: §2, §4.
-  (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications 82, pp. 77–99. Cited by: §3.3.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204. Cited by: §2.
-  (2010) Semi-supervised learning by disagreement. Knowledge & Information Systems 24 (3), pp. 415–439. Cited by: §3.1.
-  (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge & Data Engineering (11), pp. 1529–1541. Cited by: §1.
-  (2010) Semi-supervised learning by disagreement. Knowledge and Information Systems 24 (3), pp. 415–439. Cited by: §2.
-  (2003) Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning, pp. 912–919. Cited by: §2.
-  (2005) Semi-supervised learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1, §2, §4.
6.1 Data sets
Detailed information of the experimental data sets is shown in Table 3.
The visualization of some other real data sets are presented in Fig. 6, which helps us to select the suitable application for these methods.
6.3 Performance Analysis
Fig. 7 (a) presents the average accuracy of different algorithms with 14 data sets, the proposed method, CTOW, improves at least accuracy comparing with the other methods. It is well known that one algorithm cannot always beat the other methods, but we can count the number of times that these algorithms achieves the highest test accuracies in Fig. 7 (b). It shows that our proposed method can achieve the best performance for half of these real data sets, which contains large margin and irregular data.
Fig. 8 (b) presents the average accuracy of different algorithms with 14 data sets with different label rate , the proposed method, CTOW, improves at least accuracy comparing with the other methods for label rate . With the increase of the label rate , our method performs better than Co-forest, Ladder network and TSVM.
6.4 Ablation Study
Since our method combines several semi-supervised learning methods, here, we show an ablation study and discuss the effect of removing some components in order to provide additional insight about the proposed method. Specifically, we measure the performance of CTOW without considering prior information of margin density in (7), which is denoted as CTOW-NP. Removing the classifier, TSVM, and only use co-training with Xgboost, which is called CTOW-NT.
Table 4 summarizes our ablation results. It shows that only using Xgboost or TSVM degrades the classifier’s performance. Meanwhile, correct prior information of the margin density can help us obtain improved performance.