1 Introduction
In many applications, it is a difficult task to obtain fully labeled data sets to train a classifier, and labeling is usually expensive, time consuming and subject to human expertise, yet collecting abundant unlabeled data is much easier Cohen et al. (2004). To leverage both labeled and unlabeled samples, semisupervised learning (SSL) has been proposed to improve the generalization ability Zhu (2005); Chapelle et al. (2009).
There are four broad categories of semisupervised learning methods, i.e. generative methods, graphbased methods, lowdensity separation methods and disagreementbased methods Zhou and Li (2005) to be discussed in Section 2. The above methods are proposed based on certain assumptions on the labeled data and unlabeled data, which plays an important role in semisupervised learning. However, it remains an open question on how to make the right assumptions on a real data.
By tdistributed stochastic neighbor embedding (tSNE) tool Maaten and Hinton (2008), Fig. 1 visualizes two real data sets, namely the cjs and analcat data sets, which are downloaded from UCI repository. The test accuracy of the transductive support vector machine (TSVM) Bennett and Demiriz (1999) on the analcat data set is better than that of coforest Li and Zhou (2007) model. But coforest performs better on cjs data. The reason may be that analcat has large margin between classes such that the assumption of lowdensity separation is satisfied for TSVM (see Fig. 1). For cjs data, the class distribution is irregular and tree based method can work well in this situation. Seemingly, TSVM and tree based model have complementary properties. This motivates us to ensemble heterogeneous classifiers for semisupervised learning.
In this paper, we propose a new semisupervised method, called cotraining with optimal weight (e.g., CTOW). The contributions of our method are in order.

We combine cotraining with two strong heterogeneous classifiers, namely, Xgboost and TSVM, which have complementary properties and larger diversity.

The optimization problem of the weight for each classifier is established and we provide prior information of the margin density to help compute the weight of TSVM.

The proposed method works well for both large margin data (e.g., analcat data set) and irregular data (e.g., cjs data set).
Experiments are conducted on fourteen real tabular data sets. The results show that our method can improve at least test accuracy with less computational time.
2 Related Work
Generative methods Miller and Uyar (1997) are simple and effective in semisupervised learning. The assumption about generative methods is that data actually comes from a mixture model Zhu (2005). Graphbased methods Zhu et al. (2003) use similarity matrix to construct a graph. The main assumption made by graphbased methods is that the labels are smooth with respect to the graph. However, the efficiency of the graphbased methods will heavily depend on the size of the constructed graph.
Lowdensity separation methods assume that the classes are wellseparated, such that the decision boundary lies in a lowdensity region and does not cut through dense unlabeled data region. The most famous representative among others is semisupervised support vector machine (S3VM), also called TSVM. The optimization problem of TSVM is given as follows Bennett and Demiriz (1999)
(1)  
s.t.  (2)  
(3) 
where are the parameters that specify the orientation and the offset, respectively; is the slack variable; is the pseudo label to be optimized; and are the penalty constants; the set represents the labeled data, and the set represents the unlabeled data. Since (1)(3) is a nonconvex optimization problem, many researchers strived to solve it efficiently Chapelle et al. (2008).
Disagreementbased methods Blum and Mitchell (1998); Zhou and Li (2010b) need assemble of multiple learners and let them collaborate and teach each other to exploit unlabeled data. Coforest Li and Zhou (2007) is one of the most famous representative methods, which extends the cotraining paradigm Blum and Mitchell (1998)
by random forest consisting of many trees. Each decision tree is firstly initiated from the training sets, then the unlabeled examples are randomly selected to label in confidently, finally, majority voting is employed to obtain the pseudo labels. But coforest is based on the ensemble of the weak classifiers. It is more desirable to exploit heterogeneous ensemble of strong classifiers with complementary properties to improve the performance of a cotraining method.
Many recent approaches for semisupervised learning advocate to train a neural network based on the consistency loss, which forces the model to generate consistent outputs when its inputs are perturbed, such as pseudolabeling
Lee (2013), Ladder network Rasmus et al. (2015), model Laine and Aila (2016), mean teacher Tarvainen and Valpola (2017), VAT Miyato et al. (2018), Mixmatch Berthelot et al. (2019). The consistency assumption works well for image data, among others video oriented tasks. Nevertheless, the neural network with consistency assumption may not reflect any specific inductive bias toward tabular data.So far gradient boosting decision trees
Friedman (2001) and Xgboost Chen and Guestrin (2016) are two most widely used models in Kaggle competitions. Xgboost is an additive tree ensemble model aggregating outputs of trees according to where each is a regression tree, and is the final output for the input data. To learn the model parameter, they minimize the following loss function,
where is a regularized term. As we known, Xgboost works well for tabular data, but it only use the labeled data without unlabeled data.Our paper centers around the semisupervised learning for tabular data. The above discussions about Xgboost inspires us to incorporate Xgboost classifier into semisupervised learning. In Mallapragada et al. (2008), the authors proposed method called semiboost, which combines similarity matrix with boosting methods to obtain more accurate pseudo labels. However, semiboost is computationally expensive for large data set. Thus, our paper uses disagreementbased method by the ensemble of Xgboost and TSVM to surpass stateoftheart performance in semisupervised learning.
3 The Proposed Approach
In this paper, we consider semisupervised classification problems. The training set consists of a labeled data set with labeled examples and unlabeled examples , with . Assume that the data has classes. We attempt to utilize training set to construct a learner to classify unseen instances. To this end, we propose a new semisupervised learning method CTOW, by combining the Xgboost with TSVM. The reason that selecting Xgboost and TSVM as base learner is explained in Subsection 3.1. The architecture of our method is shown in Fig. 2, and the detailed techniques are presented in the following subsection.
3.1 Select Base Learners
There are two directions to select the base learners for semisupervised learning, one is diversity of the learners, which is deemed to be a key of good ensemble Zhou and Li (2010a). The other is the accuracy of the learners, which helps us to find some better pseudo label data and improve the classification accuracy further.
To select some base learners maintaining a large diversity, we should use a criteria to measure the diversity. As we know, the correlation coefficient is a simple and efficient method to measure the diversity, which is the correlation between two classifier outputs (correct/incorrect) Kuncheva and Whitaker (2003). It is formulated as follows
(4) 
where , , and are defined in Table 1.
Classifier one correct  Classifier one wrong  

Classifier two correct  
Classifier two wrong 
According to (4), we calculate the correlation coefficient between two different decision tree classifiers for different real data. In addition, the correlation coefficient between TSVM and decision tree classifier is given. We also show the the correlation coefficient between TSVM and Xgboost.
Fig. 3 (a) presents the diversity of two different classifiers. When the correlation coefficient is smaller, the diversity is larger. In Li and Zhou (2007), the authors proposed Coforest method, which uses some different decision tree as base learners, but we see that the correlation coefficient between two different decision tree classifiers is higher than that between two heterogeneous classifiers. Especially, and have much smaller value for analcat data set and cjs data set, which has been visualized in Fig. 1, it shows Xgboost and TSVM have complementary properties.
Fig. 3 (a) also shows is bigger than , which means the diversity between TSVM and decision tree classifier is larger than that between TSVM and Xgboost. From Fig. 3 (b), however, it shows Xgboost always performs better than decision tree classifier. This phenomenon makes sense because Xgboost is an additive tree ensemble model. Compared with Xgboost, TSVM based on the large margin can obtain higher accuracy for some real data set, such as analcat data. Thus, TSVM and Xgboost are selected as based learners based on a tradeoff between diversity and accuracy.
3.2 Initialization
The new method, CTOW, comprises classifiers. Assume the th classifier is TSVM, and the others are Xgboost. In order to make CTOW perform better for a real commercial data, we choose several Xgboost rather than one Xgboost as base learner. In order to improve accurate and diverse classifiers further, each Xgboost is initiated from the different training sets bootstrapped from labeled set . For TSVM, the whole training set is used. Thus, we can train the
classifiers simultaneously using the different data sets, which yields the prediction probability
of the unlabeled data, where , , and is the parameter for the th classifier.3.3 Optimally Combining Classifiers
Each classifier sends its prediction probability of the unlabeled data to the center, then the unlabeled data will be labeled by the optimal weight ensemble. Suppose is the weight of the th classifier. The probability of the pseudo label is denoted as where is a vector with .
The optimal weight is obtained by solving the following optimization problem
(5)  
(6)  
(7) 
where means the prediction probability of the th unlabeled data from the th class. Recall that the th classifier is TSVM. We can give it a prior weight according to (7) with the following details. First, we introduce a margin density metric for TSVM as follows Sethi and Kantardzic (2017):
(8) 
where is the slack variable mentioned in (1), the numerator counts the number of training samples that falls in the margin of the TSVM. The variable is a threshold, and the goal of the function is to give a smaller weight to TSVM if ; otherwise it yields a bigger weight. In our experiment, the function is given as follows
(9) 
where the threshold will be specified shown in Section 4.1.
The objective function in (5) contains the entropy of predicted probability and a regularization term . We want to enforce the ensemble classifier to provide lowentropy predictions on the unlabeled data. In addition, a regularization term is introduced to avoid overfitting to one classifier. When , the solution of the optimization problem (5)(7) tends to give the same weight to all classifiers. The following example shows the relationship between (5)(7) and the majority voting rule.
Example 1. Consider a binary classification problem. There are three classifiers to predict one instance with probability
Thus, we have
If we only minimize the entropy of , then there are infinite optimal solutions such as or , where is an arbitrary constant. However, if we add the regularization term , then a unique solution can be obtained with , being equivalent to the result of applying majority voting rule.
Next, we show that the optimization problem (5)(7) can be solved by projected gradient method. Specifically, by generating the sequence via
(10) 
where the set is the linear constraint, i.e., , is the learning rate, is the gradient of the objective function in (5) evaluated at , and is Euclidean projection of onto . In this paper, we use an efficient algorithm proposed by Duchi et al. (2008) to numerically perform the projection in (10).
3.4 Diversity Augmentation
In this subsection, we use the following steps to maintain diversity between base learners further.
Firstly, we utilize bootstrap sampling to select different subset of the unlabeled data for the th classifier, where . Actually, can be also subsampled based on coforest method proposed in Li and Zhou (2007), which can reduce the influence of misclassifying an unlabeled sample.
Secondly, inspired by the idea of the coforest method, the probability of the unlabeled data is obtained as follows
(11) 
where is the optimal solution of problem (5)(7), means the prediction probability of the th unlabeled data from the th class for the th classifier. If is the set containing all classifiers, the formulation in (11) means all other component classifiers in without are used to determine the most confidently unlabeled examples for the th classifier. In order to filter out the unconfident pseudo labels, we select the unlabeled data from , when its maximum probability is bigger than a threshold , i.e. .
Finally, we get the reliable pseudo label for the th classifier as follows
(12) 
We then combine the pseudo labeled data and labeled data to train the th classifier again. The framework of the whole training process is shown in Algorithm 1. Based on the output of Algorithm 1, we can predict the test data and calculate the test accuracy.
4 Experiments
For comparison, the performances of five stateoftheart semisupervised algorithms, i.e., GraphSVM (GSVM, Mikhail Belkin (2006)), GMM Zhu (2005), Ladder network Rasmus et al. (2015), Coforest Li and Zhou (2007), TSVM Bennett and Demiriz (1999) are also evaluated.
We conduct experiments based on 14 data sets from UCI machine learning repository. Additionally, we test our method on a real
commercial data set which contains around 50 thousand samples with 62 features. Descriptions of the experimental data sets is shown in Table 3 (see Appendix).4.1 Implementation details
For each data set, fivefold cross validation is employed for evaluation. For each fold, we split the training data in a stratified fashion to obtain a labeled data set and an unlabeled set for a given label rate . In our simulation, we set , which means that splitting the training set will produce a set with labeled examples and a set with unlabeled examples.
The proposed method, CTOW, adaptively combines classification results of Xgboost and TSVM. In our simulation, we set . Three Xgboost and one TSVM are used. In addition, we set threshold , iterations and penalty parameter .
Firstly, let us show the relationship between margin density (introduced in (8)) and the accuracy of both TSVM and Xgboost.
Fig. 4 (a) shows that when is smaller, the accuracy of TSVM tends to be higher. The bar with negative value in Fig. 4 (b) means that Xgboost performs better than TSVM, which happens when becomes larger (see cjs data). Then, can be used as the prior information for providing the weight of TSVM. Specifically, if is bigger than a threshold , we give a smaller weight to TSVM, or a larger weight otherwise. According to the above discussion and observation from Fig. 4, we set in our Algorithm 1, and use the function denoted in (7) to calculate the weight of TSVM. Based on the above setup, we run Algorithm 1 to show the results in the next subsection.
4.2 Performance
Firstly, the visualization of some data sets is presented in Fig. 1 and Fig. 5, which helps us to analyze the suitable application for these methods. Secondly, we compare the proposed method with some other semisupervised learning methods, which shows that our method yields the best performance in many real data sets.
Table 2 shows the test accuracy of the different algorithms. From the second column, we see that GMM fails to work well in various data sets. The reason is that GMM needs to assume the data comes from Gaussian mixture distribution, which is a strong assumption for real data sets. Comparing GraphSVM with TSVM, Table 2 presents that these two methods have similar performance, because both of them are based on SVM. Nevertheless, GraphSVM needs to calculate the similarity matrix, then it takes too long time for computation (see Fig 8(a)). Thus, we do not report the result of GMM and GraphSVM if the data size is large.
In Fig. 5, synthetic data enjoys large margin so that TSVM performs better than Coforest (see Table 2). Conversely, the classes of both commercial and cjs data sets are overlapped and the data distribution is irregular, then coforest based tree model can obtain better performance than TSVM. Ladder network can get the best performance for hill and texture data sets, because these original features have homogeneous attributes and they cannot represent the label very well. But the neural network can train a better feature space in this case. Our method utilizes the advantage of Xgboost and TSVM. Table 2 presents that the proposed method can achieve the best performance for many real data sets, especially for large and high dimension data, such as gasgrift and commercial data set (see Table 3).
Data  GMM  GSVM  Ladder  Coforest  TSVM  CTOW 

cjs  0.293  0.640  0.740  0.989  0.654  0.987 
hill  0.488  0.490  0.530  0.492  0.493  0.499 
segment  0.694  0.889  0.898  0.907  0.878  0.925 
wdbc  0.643  0.940  0.932  0.905  0.949  0.954 
steel  0.466  0.627  0.652  0.620  0.673  0.649 
analcat  0.206  0.975  0.982  0.876  0.992  0.993 
synthetic  0.292  0.908  0.810  0.745  0.927  0.920 
vehicle  0.657  0.596  0.635  0.631  0.649  0.625 
german  0.614  0.619  0.679  0.712  0.718  0.716 
gina  *  *  0.807  0.814  0.835  0.857 
madelon  *  *  0.536  0.538  0.518  0.543 
texture  *  *  0.973  0.877  0.952  0.953 
gasgrift  *  *  0.945  0.927  0.941  0.965 
dna  *  *  0.885  0.890  0.894  0.911 
commercial  *  *  0.832  0.816  0.861  0.901 
5 Conclusions and Future Works
In this paper, we propose a new method, CTOW, for semisupervised deep learning, which applies the optimal ensemble of two heterogeneous classifiers, namely Xgboost and TSVM. The unlabeled data is exploited by considering model initialization, solving optimal weight ensemble problem, diversity augmentation, simultaneously. Experiments on various real data sets demonstrate that our method is superior to stateoftheart methods.
From the simulations, we find Xgboost and TSVM have complementary properties and larger diversity. The reason that leading to this phenomenon should be studied in the future.
Broader Impact
Although SSL is an old topic, there has been renewed interest in SSL which is reflected in both academic and industrial research. Many SSL methods are proposed based on certain assumptions on the labeled data and unlabeled data, which plays an important role in semisupervised learning. However, it remains an open question on how to make the right assumptions on a real data.
Our research finds Xgboost and TSVM have complementary properties and larger diversity. Then, we proposed a new SSL method called CTOW by appling the optimal ensemble of two heterogeneous classifiers Xgboost and TSVM. Thus, the proposed CTOW enjoys the advantage of both Xgboost and TSVM such that it weakly depends on the distribution of the training data.
Our research could be also used to provide explanations for Xgboost and TSVM in their applications as well as reducing the cost of labeling. The proposed method may fail when label data is very few such that both Xgboost and TSVM performs worst.
References
 [1] (1999) Semisupervised support vector machines. In Advances in Neural Information Processing Systems, pp. 368–374. Cited by: §1, §2, §4.
 [2] (2019) Mixmatch: a holistic approach to semisupervised learning. arXiv preprint arXiv:1905.02249. Cited by: §2.

[3]
(1998)
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 92–100. Cited by: §2.  [4] (2009) Semisupervised learning. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §1.
 [5] (2008) Optimization techniques for semisupervised support vector machines. Journal of Machine Learning Research 9 (Feb), pp. 203–233. Cited by: §2.
 [6] (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.
 [7] (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to humancomputer interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (12), pp. 1553–1566. Cited by: §1.
 [8] (2008) Efficient projections onto the ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pp. 272–279. Cited by: §3.3.
 [9] (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp. 1189–1232. Cited by: §2.
 [10] (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51 (2), pp. 181–207. Cited by: §3.1.
 [11] (2016) Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.
 [12] (2013) Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §2.
 [13] (2007) Improve computeraided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans 37 (6), pp. 1088–1098. Cited by: §1, §2, §3.1, §3.4, §4.
 [14] (2008) Visualizing data using tSNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §1.
 [15] (2008) Semiboost: boosting for semisupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (11), pp. 2000–2014. Cited by: §2.
 [16] (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, pp. 2399–2434. Cited by: §4.
 [17] (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. In Advances in Neural Information Processing Systems, pp. 571–577. Cited by: §2.
 [18] (2018) Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1979–1993. Cited by: §2.
 [19] (2015) Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554. Cited by: §2, §4.
 [20] (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications 82, pp. 77–99. Cited by: §3.3.
 [21] (2017) Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. In Advances in Neural Information Processing Systems, pp. 1195–1204. Cited by: §2.
 [22] (2010) Semisupervised learning by disagreement. Knowledge & Information Systems 24 (3), pp. 415–439. Cited by: §3.1.
 [23] (2005) Tritraining: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge & Data Engineering (11), pp. 1529–1541. Cited by: §1.
 [24] (2010) Semisupervised learning by disagreement. Knowledge and Information Systems 24 (3), pp. 415–439. Cited by: §2.
 [25] (2003) Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning, pp. 912–919. Cited by: §2.
 [26] (2005) Semisupervised learning literature survey. Technical report University of WisconsinMadison Department of Computer Sciences. Cited by: §1, §2, §4.
6 Appendix
6.1 Data sets
Detailed information of the experimental data sets is shown in Table 3.
Data set  instances  feature  classes 

cjs  2796  10  6 
hill  1212  101  2 
segment  2310  20  7 
wdbc  569  31  2 
steel  1941  27  7 
analcat  841  71  4 
synthetic  600  62  7 
vehicle  846  19  4 
german  1000  24  2 
gina  3468  971  2 
madelon  2600  500  2 
texture  5500  41  11 
gasgrift  13910  129  6 
dna  3186  181  3 
commercial  50000  62  5 
6.2 Visualization
The visualization of some other real data sets are presented in Fig. 6, which helps us to select the suitable application for these methods.
6.3 Performance Analysis
Fig. 7 (a) presents the average accuracy of different algorithms with 14 data sets, the proposed method, CTOW, improves at least accuracy comparing with the other methods. It is well known that one algorithm cannot always beat the other methods, but we can count the number of times that these algorithms achieves the highest test accuracies in Fig. 7 (b). It shows that our proposed method can achieve the best performance for half of these real data sets, which contains large margin and irregular data.
Fig. 8 (b) presents the average accuracy of different algorithms with 14 data sets with different label rate , the proposed method, CTOW, improves at least accuracy comparing with the other methods for label rate . With the increase of the label rate , our method performs better than Coforest, Ladder network and TSVM.
6.4 Ablation Study
Since our method combines several semisupervised learning methods, here, we show an ablation study and discuss the effect of removing some components in order to provide additional insight about the proposed method. Specifically, we measure the performance of CTOW without considering prior information of margin density in (7), which is denoted as CTOWNP. Removing the classifier, TSVM, and only use cotraining with Xgboost, which is called CTOWNT.
Table 4 summarizes our ablation results. It shows that only using Xgboost or TSVM degrades the classifier’s performance. Meanwhile, correct prior information of the margin density can help us obtain improved performance.
Data  CTOWNP  CTOWNT  CTOW 

cjs  0.987  0.989  0.987 
hill  0.499  0.502  0.499 
segment  0.923  0.922  0.925 
wdbc  0.931  0.919  0.954 
steel  0.647  0.646  0.649 
analcat  0.961  0.924  0.993 
synthetic  0.898  0.815  0.920 
vehicle  0.653  0.657  0.625 
german  0.715  0.709  0.716 
gina  0.858  0.864  0.857 
madelon  0.562  0.556  0.543 
texture  0.931  0.915  0.953 
gasgrift  0.962  0.964  0.965 
dna  0.911  0.912  0.911 
Average  0.817  0.807  0.821 
Comments
There are no comments yet.