Machine learning algorithms often require large amount of labeled data for training. As collecting labeled examples can be expensive, semi-supervised learning has been proposed [zhu2006semi]. Among the existing semi-supervised approaches, self-training [triguero2015self], co-training [blum1998combining], and tri-training [zhou2005tri] are the most notable ones. However, they suffer from one major issue of the gradually increased level of noise during the iterative labeling process. This problem can be attributed to two factors: (1) static labeling threshold, and (2) inappropriate stopping criteria.
Many self-labeled algorithms iteratively enlarge labeled training set with unlabeled instances whose prediction confidence is larger than a static labeling threshold. Static labeling threshold produces a good classification performance only when the proportion of correctly labeled instances remains above a constant level. However, given the continuously added noisy labels during the semi-supervised process [triguero2015self], it is unlikely that any fixed assignment of the threshold will produce optimal classifications.
Besides, deciding when to stop the iterative instance labeling process is also critical for the self-labeled techniques. Existing stopping criteria include: setting a threshold on the number of labels that the algorithm is willing to generate, or stopping the labeling process when little to no accuracy increase occurs in an iteration. Stopping criteria is still an open issue, as too conservative or too liberal stopping criteria may produce many mislabeled examples to the self-labeled process.
To solve the two challenges, we propose a new tri-training-based method, called tri-training with teacher-student paradigm. Specifically, in each iteration, a double-teacher-single-student teaching relation is established based on predefined teacher and student thresholds, where teachers teach the student with generated proxy labels on the unlabelled data. Along the teaching process, the teacher-student relationship is continuously adjusted with adaptive teacher and student thresholds. The teacher-student relationship terminates on either running out of teachable instances or when reaching a graduation point, where the student threshold equals the teacher threshold.
We evaluate the tri-training with teacher-student paradigm approach on the sentiment analysis task of SemEval-2016 over various labeled-unlabeled data ratios. The proposed method outperforms many strong baselines in terms of gaining better prediction performances while consuming less number of unlabeled examples.
Assume we are given a set of unlabeled samples as well as a set of labeled samples , where
. The proposed method starts by training three independent base classifiers, , on bootstrapped sample subsets , , respectively taken from . The aim of the bootstrap sampling is to increase the diversity of base classifiers trained through the labeled set. Next, for every sample in , each of the trained models , , predicts a label , ,
with corresponding prediction probability, , .
2.1 Teacher-Student Assignment
Instead of assigning a majority voted label, as implemented in the original tri-training [zhou2005tri], here we model the learning task from a teacher-student perspective. In each iteration of our proposed approach, two classifiers ( and ) are ascertained to be teachers if their prediction probabilities and are both larger than the teacher threshold . The other classifier is then treated as student if its prediction probability is less than the student threshold . An unlabeled sample in will only be assigned a label after it is identified as teachable. Teachable examples are defined according to the function SelectTeachableSamples, as shown in Algorithm 2. The required criteria are as follows: Firstly, the predicted labels and from the two teachers and must agree with each other. Second, both teachers’ prediction confidences and must exceed and at the same time, the student’s confidence must be less than . This setting of using two teachers ensures that bias in any of these models doesn’t affect the quality of the information taught to the student. It’s similar to the real-life teacher-student learning process, where only qualified teachers can teach students things that they are the most comfortable with. Here, it is important to note that the teacher-student roles are rotated in each iteration, , allowing each classifier to learn from the other classifiers’ experiences, as is further trained with the original labeled set along with the identified teachable samples .
2.2 Adaptive Thresholds
Another novel aspect that we adopt from real-world teaching scenarios to the proposed method is the continuously adjusted teacher-student relationship. To be more specific, as a student learns from the teachers, it would become more confident of its prior knowledge taught by the teachers. In that sense, the student threshold increases monotonically in every iteration. On the other hand, as student progresses through the learning process, the teachers are supposed to teach them more advanced cases, i.e. cases where the teachers are less confident about. This is captured in our approach by monotonically decreasing the teacher threshold . For this work, we chose a linear adaptive rate for the adaptive process as shown in line 10 and 11 of Algorithm 1.
2.3 Stopping Criteria
Existing self-labeled techniques often stop when no sample can be labeled, or no performance improvement occurs in an iteration. The original tri-training paper introduces an error constraint that checks if a peak performance has been reached. However, the error measurement is conducted only on the labeled dataset, hence assuming that the labeled set distribution is representative of the unlabeled set distribution. Tri-training may also lead to a limited number of co-labeling examples for training and a premature termination while dealing with large datasets [chou2016boosted].
In this work, we present our stopping criterion by comparing the student’s confidence threshold with the teacher’s threshold during each training iteration. We assume that when a student reaches the same confidence level as the teachers in a particular iteration, then there is nothing to be learned for the students from the teachers. This happens in our algorithm 2, when . At this point, adding newer samples to the training set of (the student) would not contribute to its learning anymore. In that sense, we called the point when as the graduation point, so as to stop the tri-training process naturally when the constraint is reached.
3.1 Experimental Settings
Datasets. We evaluate our model on the sentiment classification dataset of SemEval-2016 Task 4 Subtask A [nakov2016semeval]. In total, there are 6000 training sentences, including 3094 positive, 863 neutral, and 2043 negative instances. We use 2000 sentences from the dev set for validation and we have 20632 for test. To test the model’s generalizability, we subsequently examine it under different proportions of labeled data. We select 10%, 20%, 30% and 40% of the training set randomly as labeled samples and treat the rest as unlabeled by hiding their labels. Hidden labels are used later for quality check of the generated proxy labels.
Baselines. Since our method improves upon the foundations laid by the typical semi-supervised methods as mentioned in the related work section (e.g. tri-training and self-training), we compare with the following baselines:
- Self-training with Naive Bayes as base classifier.
SVM STr - Self-training with SVM as base classifier.
Tri - Tri-training with SVM as base classifiers.
Tri-D - Tri-training with disagreement with SVM as base classifiers [sogaard2010simple].
Our proposed approach is tri-training with teacher-student paradigm (Tri-TS). We don’t compare with co-training here because there are no clear independent views [zhou2005tri]
in the sentiment analysis task. We do not use any deep learning model as base learner in this study, as deep learning models may not perform well in the presence of limited labeled data. We did try FastText[joulin2017bag] as a proof case, but even under the label rate, its performance is unsatisfactory (an initial of with an improvement of using the proposed model).
In all the baselines, we experiment with different base classifiers and their combinations, namely Naive Bayes, SVM and Neural Networks. We use a linear kernel (LinearSVC) for SVM. For the neural networks (MLP), we use 50 neurons in the hidden layer with a softmax output. We use Glove 300-dimensional word embeddings pennington2014glove. After text-cleaning and tokenization, we average the word-embeddings for the tokens present in the sentence to get the feature vectors. For both the tri-training baselines, Tri and Tri-D, we obtain the best results with SVM as base classifiers. Hence, we report these for comparison with our approach.
Note that, as mentioned in Section 2.3, for the baselines Tri and Tri-D, we use their own respective stopping criteria during evaluation, as a comparison to our newly proposed stopping criterion.
Parameter Tuning. All parameters required in both the proposed method and the baselines are fine-tuned using the validation set. A grid search is used to determine those parameter values that maximize each model’s performance. For the proposed method is tuned , . The best performed rates of and are found empirically as 0.001. For the tri-training baselines, we try to tune the error constraint as suggested in the original paper, but it generates only small number of proxy labels during the training process and terminates after very limited number of iterations. In that sense, we discard the error constraint and try the threshold based tri-training method as adopted in [ruder2017knowledge] and [sogaard2010simple]. Best performed parameters are obtained again via evaluations on validation set.
We evaluate our approach and the baselines from three different aspects: the overall model performance, the quality of generated proxy-labels, and the quantity of unlabeled data consumed. Model performances are reported using -score as adopted in the SemEval competition.
Overall Performance. The methods Tri and Tri-D both use majority voting to combine the three classifiers. For a fair comparison with these methods, after the training is completed, we perform majority voting on the test set to get the final predictions. In Table 1, we see that the proposed tri-training with teacher-student paradigm consistently outperforms the other baselines with higher prediction performance across different labeled versus unlabeled settings. The proposed method reaches a of 0.523 using just 40% of the labeled data, whereas the upper bound is only 0.585, if the we train the base SVM classifier on the 100% training dataset.
To better understand the effectiveness of the proposed teacher-student paradigm, we further look into the performance of each individual base classifier before the majority voting step, We found that under the 10% label rate, the maximum achieved between the base classifiers and the final ensemble model was only 0.011, and such difference decreased to 0.005, when label rate increased to 40%, which indicates indicates good quality of the base classifiers even without the ensemble step. In addition, same conclusion can also be inferred as the base classifiers in Tri-TS before ensemble performed better than the base classifiers in all the other baselines.
Quality of Proxy-labels. The quality of the assigned proxy-labels to the unlabelled data in each iteration determines how well the model learns. So, here, we evaluate the quality of all produced proxy-labels during the self-labeling process against the hidden ground truth to determine the effectiveness of the algorithms in terms of teaching the correct labels. Table 2 shows that teacher models in our proposed method consistently produce high quality proxy-labels (88.93% match with the hided ground truth labels) for the student model to learn. The other baselines tend to suffer from the problem of adding unreliable labels to the labeled dataset. We view this result as a confirmation of the usefulness of the adaptive threshold in terms of producing high quality proxy-labels on the unlabeled data.
Quantity of Unlabeled Data Consumed. To evaluate the effectiveness of our stopping criterion, we calculate the quantity of unlabeled data consumed during the self-labeling process. Figure 1 shows a plot of the models’ with regard to the cumulative number of samples added throughout the iterations (each datapoint in the plot corresponds to an iteration). We find that the proposed method consumes only 201 unlabeled instances to reach the best prediction performance, whereas both the original tri-training and tri-training with disagreement added around twice or thrice the number of samples. From Figure 1, we can further see that many of the baseline algorithms reach the saturation point way before they stop the training process i.e. the improvement in performance is marginal or even decays under some circumstances. This proves the effectiveness of the proposed stopping criteria.
We see that our approach performs worse than the tri-training baselines in the earlier iterations. This happens because our algorithm learns easier cases in the very beginning and gradually increases the difficulty along the learning process. On the contrary, the original tri-training grows very fast but also plateaus earlier, hence not achieving the full potential of using the three base classifiers. This early plateauing is avoided in our case with the adoption of the adaptive thresholds.
Sensitivity Analysis. We further perform sensitivity analysis for the assessment of the initial settings of and with respect to their impact on the model performance. Specifically, we compare the experiment results with: (1) the initial teacher threshold set over with initial fixed as ; and (2) the initial student threshold set over with initial fixed as . In both settings, and are continuously updated with the learned adaptive rates and after their initial assignment. We observe only marginal performance losses with an average difference of over all values. This indicates that the initial value for and would not affect the performance that much, as long as they are adaptive.
In this paper, we propose a new teacher-student paradigm for original tri-training with continuously adaptive threshold and a natural stopping criteria. We show that our model outperforms all self-training and tri-training baselines in terms of achieving higher overall performance, higher quality of generated proxy labels, while consuming a less quantity of the unlabeled data. Although we only validate the proposed method against the benchmark SemEval dataset in this paper, our ultimate goal is to utilize it as a solution for the scenarios with limited labeled data and to tackle real-world problems, where labeled data is hard to find or expensive to attain.