1 Introduction
We consider the problem of semisupervised learning of binary classification functions. As in the supervised paradigm, the goal in semisupervised learning is to construct a classification rule that maps objects in some input space to a target outcome, such that future objects map to correct target outcomes as well as possible. In the supervised paradigm this mapping is learned using a set of training objects and their corresponding outputs. In the semisupervised scenario we are given an additional and often large set of unlabeled objects. The challenge of semisupervised learning is to incorporate this additional information to improve the classification rule.
The goal of this work is to build a semisupervised version of the least squares classifier that is robust against deterioration in performance meaning that, at least in expectation, its performance is not worse than supervised least squares classification. While it may seem like an obvious requirement for any semisupervised method, current approaches to semisupervised learning do not have this property. In fact, performance can significantly degrade as more unlabeled data is added, as has been shown in Cozman2006 ; Cozman2003 , among others. This makes it difficult to apply these methods in practice, especially when there is a small amount of labeled data to identify possible reduction in performance. A useful property of any semisupervised learning procedure would therefore be that its performance does not degrade as we add more unlabeled data. Additionally, many semisupervised learning procedures are formulated as hardtooptimize, nonconvex objective functions. A more satisfactory state of affairs for semisupervised classification would therefore be methods that are easier to train and that, on average, do not lead to worse classification performance than their supervised alternatives.
We present a novel approach to semisupervised learning for the least squares classifier that we will refer to as implicitly constrained least squares classification (ICLS). ICLS leverages implicit assumptions present in the supervised least squares classifier to construct a semisupervised version. This is done by minimizing the supervised loss function subject to the constraint that the solution has to correspond to the solution of the least squares classifier for some labeling of the unlabeled objects.
As this work is specifically concerned with least squares classification, we note several reasons why this is a particularly interesting classifier to study: First of all, the least squares classifier is a discriminative classifier. Some have claimed semisupervised learning without additional assumptions is impossible for discriminative classifiers Seeger2001 ; Singh2008 . Our results show this does not strictly hold.
Secondly, the closedform solution for the supervised least squares classifier allows us to study its theoretical properties. In particular, in the univariate setting without intercept and assuming perfect knowledge of , the distribution of the feature, we show this procedure never gives worse performance in terms of the squared loss criterion compared to the supervised least squares classifier. Moreover, using the closedform solution we can rewrite our semisupervised approach as a quadratic programming problem, which can be solved through a simple gradient descent with boundary constraints.
Lastly, least squares classification is a useful and adaptable classification technique allowing for straightforward use of, for instance, regularization, sparsity penalties or kernelization Hastie2009 ; Poggio2003 ; Rifkin2003 ; Suykens1999 ; Tibshirani1996 . Using these formulations, it has been shown to be competitive with stateoftheart methods based on loss functions other than the squared loss Rifkin2003 as well as computationally efficient on large datasets Bottou2010 .
This work builds on Krijthe2015 and offers a more complete exposition: we show ICLS can be formulated as a quadratic programming problem, we extend the experimental results section by including an alternative semisupervised procedure, adding additional datasets and discussing the ‘peaking’ phenomenon. Moreover, we extend the theoretical result with conditions when one is likely to see improvement of the proposed approach over the supervised classifier.
The main contributions of this paper are

A novel convex formulation for robust semisupervised learning using squared loss (Equation 5)

A proof that this procedure never reduces performance in terms of the squared loss for the 1dimensional case without intercept (Theorem 1)

An empirical evaluation of the properties of this classifier (Section 6)
The rest of this paper is organized as follows. Section 2 gives an overview of related work on semisupervised learning. Section 3 gives a high level overview of the method while Section 4 introduces our semisupervised version of the least squares classifier in more detail. We then derive a quadratic programming formulation and present a simple way to solve this problem through bounded gradient descent. Section 5 contains a proof of the improvement of the ICLS classifier over the supervised alternative. This proof is specific to classification with a single feature, without including an intercept in the model. For the multivariate case, we present an empirical evaluation of the proposed approach on benchmark datasets in Section 6 to study its properties. The final sections discuss the results and conclude.
2 Related Work
Many diverse approaches to semisupervised learning have been proposed Chapelle2006 ; Zhu2009 . While semisupervised techniques have shown promise in some applications, such as document classification Nigam2000 , peptide identification Kall2007 and cancer recurrence prediction Shi2011 , it has also been observed that these techniques may give performance worse than their supervised counterparts. See for instance Cozman2006 ; Cozman2003 , for an analysis of this problem, and Elworthy1994 for a practical example in partofspeech tagging. In these cases, disregarding the unlabeled data would lead to better performance.
Some Goldberg2009 ; Wang2007a have argued that agnostic semisupervised learning, which Goldberg2009
defines as semisupervised learning that is at least no worse than supervised learning, can be achieved by crossvalidation on the limited labeled data. Agnostic semisupervised learning follows if we only use semisupervised methods when their estimated crossvalidation error is significantly lower than those of the supervised alternatives. As the results of
Goldberg2009 indicate, this criterion may be too conservative: given the small amount of labeled data, a semisupervised method will only be preferred if the difference in performance is very large. If the difference is less distinct, the supervised learner will always be preferred and we potentially ignore useful information from the unlabeled objects. Moreover, this crossvalidation approach can be computationally demanding.SelfLearning
A simple approach to semisupervised learning is offered by the selflearning procedure McLachlan1975 also known as Yarowsky’s algorithm Abney2004 ; Yarowsky1995 or retagging Elworthy1994 . Taking any classifier, we first estimate its parameters on only the labeled data. Using this trained classifier we label the unlabeled objects and add them, or potentially only those we are most confident about, with their predicted labels to the labeled training set. The classifier parameters are reestimated using these labeled objects to get a new classifier. One iteratively applies this procedure until the predicted labels of the unlabeled data no longer change.
One of the advantages of this procedure is that it can be applied to any supervised classifier. It has also shown practical success in some application domains, particularly document classification Nigam2000 ; Yarowsky1995 . Unfortunately, the process of selftraining can also lead to severely decreased performance, compared to the supervised solution Cozman2006 ; Cozman2003
. One can imagine that once an object is incorrectly labeled and added to the training set, its incorrect label may be reinforced, leading the solution away from the optimum. Selflearning is closely related to expectation maximization (EM) based approaches
Abney2004 . Indeed, expectation maximization suffers from the same issues as selflearning Zhu2009 . In Section 6 we compare the proposed approach to selflearning for the least squares classifier.Additional Assumptions
Some semisupervised methods leverage the unlabeled data by introducing assumptions that link properties of the features alone to properties of the label of an object given its features. Commonly used assumptions are the smoothness assumption: objects that are close in the feature space likely share the same label; the cluster assumption: objects in the same cluster share a label; and the low density assumption enforcing that the decision boundary should be in a region of low data density.
The lowdensity assumption is used in entropy regularization Grandvalet2005
as well as for support vector classification in the transductive support vector machine (TSVM)
Joachims1999 and closely related semisupervised SVM (SVM) Bennett1998 ; Sindhwani2006 . In these approaches an additional term is added to the objective function to push the decision boundary away from regions of high density. Several approaches have been put forth to minimize the resulting nonconvex objective function, such as the convex concave procedure Collobert2006 and difference convex programming Sindhwani2006 ; Wang2007 .In all these approaches to semisupervised learning, a parameter controls the importance of the unlabeled points. When the parameter is correctly set, it is clear, as Wang2007a claims, that TSVM is always no worse than supervised SVM. It is, however, nontrivial to choose this parameter, given that semisupervised learning is most interesting in cases where we have limited labeled objects, making a choice using crossvalidation very unstable. In practice, therefore, TSVM can also lead to performance worse than the supervised support vector machine, as well will also see in Section 6.3.
Safe Semisupervised Learning
Loog2010 ; Loog2014b attempt to guard against the possibility of deterioration in performance by not introducing additional assumptions, but instead leveraging implicit assumptions already present in the choice of the supervised classifier. These assumptions link parameters estimates that depend on labeled data to parameter estimates that rely on all data. By exploiting these links, semisupervised versions of the nearest mean classifier and the linear discriminant are derived. Because these links are unique to each classifier, the approach does not generalize directly to other classifiers. The method presented here is similar in spirit, but unlike Loog2010 ; Loog2014b , no explicit equations have to be formulated to link parameter estimates using only labeled data to parameter estimates based on all data. Moreover, our approach allows for theoretical analysis of the nondeterioration of the performance of the procedure.
Aside from the work by Loog2010 ; Loog2014b , another attempt to construct a robust semisupervised version of a supervised classifier has been made in Li2011 , which introduces the safe semisupervised support vector machine (SVM). This method is an extension of SVM Bennett1998 which constructs a set of lowdensity decision boundaries with the help of the additional unlabeled data, and chooses the decision boundary, which, even in the worstcase, gives the highest gain in performance over the supervised solution. If the lowdensity assumption holds, this procedure provably increases classification accuracy over the supervised solution. The main difference with the method considered in this paper, however, is that we make no such additional assumptions. We show that even without these assumptions, safe improvements are possible for the least squares classifier.
Semisupervised Least Squares
While least squares classification has been widely used and studied Hastie2009 ; Poggio2003 ; Suykens1999 , little work has been done on applying semisupervised learning to the least squares classifier specifically. For least squares regression, Little2002 describe an iterative method for handling missing outcomes that was formally proposed in Healy1956 . In the case of least squares regression, this method has some computational advantages over discarding the unlabeled data but its solution always coincides with the supervised solution. Shaffer1991 studied the value of knowing , where is the
design matrix containing the feature values for each observation. If we assume the number of unlabeled data points is large, this is similar to the semisupervised situation. It is shown that if the size of the parameters is small compared to the noise, the variance of a procedure that plugs in
as the estimate ofhas a lower variance than supervised least squares regression. As the size of the parameters increases, this effect reverses. In fact, the paper demonstrates that in this semisupervised setting no best linear unbiased estimator for the regression coefficients exists. In Section
6, we compare our approach to using this plugin estimate by substituting the matrix by a version based on both labeled and unlabeled data. A similar plugin procedure has been used by Fan2008 for linear discriminant analysis for dimensionality reduction which is closely related to least squares classification. Here the (normalized) total scatter matrix, which plays a similar role to the matrix in least squares regression is exchanged with the more accurate estimate of the total scatter based on both labeled and unlabeled data.3 Implicitly Constrained Least Squares Classification
Given a limited set of labeled objects and a potentially large set of unlabeled objects, the goal of implicitly constrained least squares classification is to use the latter to improve the solution of the least squares classifier trained on just the labeled data. We start with a sketch of this approach, before discussing the details.
Given the supervised least squares classifier, consider the hypothesis space of all possible parameter vectors, which we will denote as , see Figure 1. Given a set of labeled objects, we can determine the supervised parameter vector . Suppose we also have a potentially large number of unlabeled objects. Assume that every object has a label, it is merely unknown to us. If these labels were to be revealed, it is clear how the additional objects can improve classification performance: we estimate the least squares classifier using all the data to obtain the parameter vector
. Since this estimate is based on more objects, we expect the parameter estimate to be better. These real labels are unknown, but we can still consider all possible labelings of unlabeled objects, and estimate corresponding parameters based on these imputed labelings. In this way, we get a set of possible parameters for our classifier, which form the set denoted by
. Clearly one of these labelings corresponds to the real, but unknown, labeling, so one of the parameter estimates in this set corresponds to the solution we would obtain using all the correct labels of both the labeled and unlabeled objects. Because these are the only possible classifiers when the true labels would be revealed, we propose to look within this set for an improved semisupervised solution.Two issues then remain: how do we choose the best parameters from this set and how do we find these without having to enumerate all possible labelings?
Looking at the first problem, we reiterate that the goal of semisupervised learning is to find a good classification rule and, therefore, still the obvious way to evaluate this rule is by the loss on the labeled training points. In other words, we choose the classifier from the parameter set that minimizes the squared loss on the labeled points. We will denote this solution by . Note this approach is rather different from other approaches to semisupervised learning where the loss is adapted by including a term that depends on the unlabeled data points. In our formulation, the loss function is still the regular, supervised loss of our classification procedure.
As for the second issue, after relaxing the constraint that we need hard labels for the data points, we will see that the resulting optimization problem is, in fact, an instantiation of wellstudied quadratic programming, which we solve using a simple gradient descent procedure.
4 Method
4.1 Supervised Multivariate Least Squares Classification
Least squares classification Hastie2009 ; Rifkin2003
is the direct application of wellknown ordinary least squares regression to a classification problem. A linear model is assumed and the parameters are minimized under squared loss. Let
be an design matrix with rows containing vectors of length equal to the number of features plus a constant feature to encode the intercept. Vector y denotes an vector of class labels. We encode one class as and the other as . The multivariate version of the empirical risk function for least squares estimation is given by(1) 
The wellknown closedform solution for this problem is found by setting the derivative with respect to equal to 0 and solving for , giving
(2) 
In case is not invertible (for instance when ), a pseudoinverse is applied. As we will see, the closed form solution to this problem will enable us to formulate our semisupervised learning approach in terms of a standard quadratic programming problem, which is easy to optimize.
4.2 Implicitly Constrained Least Squares Classification
In the semisupervised setting, apart from a design matrix X and target vector y, an additional set of measurements of size without a corresponding target vector is given. In what follows, denotes the extended design matrix which is simply the concatenation of the design matrices of the labeled and unlabeled objects.
In the implicitly constrained approach, we incorporate the additional information from the unlabeled objects by searching within the set of classifiers that can be obtained by all possible labelings , for the one classifier that minimizes the supervised empirical risk function in Equation (1). This set, , is formed by the s that would follow from training supervised classifiers on all (labeled and unlabeled) objects going through all possible soft labelings for the unlabeled samples, i.e., using all . Since these supervised solutions have a closed form, this can be written as
(3) 
The soft labeling provides both a relaxation for computational reasons as well as a strategy to deal with label uncertainty. We can interpret these fractions as a type of class posterior for the unlabeled objects. This constraint set , combined with the supervised loss that we want to optimize in Equation (1), gives the following definition for implicitly constrained semisupervised least squares classification:
(4) 
Since is fixed for a particular choice of and has a closed form solution, we can rewrite the minimization problem in terms of instead of :
(5) 
The problem defined in Equation (5) can be written in a standard quadratic programming form:
(6)  
where^{1}^{1}1The published version of this paper contains a typo in this equation and the two equations that follow. We corrected this error here.
and
Here, denotes the identity matrix and and denote column vectors of respectively ones and zeros.
Since the matrix Q is a product of a matrix and its transpose, it is guaranteed to be positive semidefinite. The problem is typically not positive definite because there are different labelings that will lead to one and the same minimum objective.
The quadratic problem defined above can be solved using, for instance, an interior point method. We have found a gradient descent approach to be easier to apply. Taking the derivative with respect to and rearranging the terms we find
Because of its convexity, this problem can be solved efficiently using a quasiNewton approach that allows for the box bounds, such as LBFGSB Byrd1995 . Solving for gives a labeling that we can use to construct the semisupervised classifier using Equation (2) by considering the imputed labels as the labels for the unlabeled data.
5 Theoretical Results
We will examine this procedure by considering it in a limited, yet illustrative setting. In this case we will, in fact, prove that our procedure will never give a worse least squares estimate than the supervised solution. Consider the case where we have just one feature
, a limited set of labeled instances and assume we know the probability density function of this feature
exactly. This last assumption is similar to having unlimited unlabeled data and is also considered, for instance, in Sokolovska2008 . We consider a linear model with no intercept: where , without loss of generality, is set as for one class and for the other. For new data points, estimates can be used to determine the predicted label of an object by using a threshold set at, for instance, .The expected squared loss, or risk, for this model is given by
(7) 
where . We will refer to this as the joint density of and
. Note, however, that this is not strictly a density, since it deals with the joint distribution over a continuous
and a discrete . The optimal solution is given by the that minimizes this risk:(8) 
We will show the following result:
Theorem 1.
Given a linear model in 1D without intercept, , and known, the estimate obtained through implicitly constrained least squares always has an equal or lower risk than the supervised solution:
In particular, given labeled sample, if is continuous in the feature
with bounded second moment and
, thenProof.
Setting the derivative of (7) with respect to to and rearranging we get
(11)  
In this last equation, since we assume as given, the only unknown is the function , the expectation of the label , given . Now suppose we consider every possible labeling of the unlimited number of unlabeled objects including fractional labels, that is, every possible function where . Given this restriction on , the second integral in (11) becomes a reweighted version of the expectation operation over . By changing the choice of one can vary the value of this integral, but it will always be bounded on an interval on . It follows that all possible ’s also form an interval on , which is the constraint set . The optimal solution has to be in this interval, since it corresponds to a particular but unknown .
Using the set of labeled data, we can construct a supervised solution that minimizes the loss on the training set of labeled objects (see Figure 2):
(12) 
Now, either this solution falls within the constrained region, or not, , with different consequences:

If there is a labeling of the unlabeled points that gives us the same value for . Therefore, the solution falls within the allowed region and there is no reason to update our estimate. Therefore .

Alternatively, if , the solution is outside of the constrained region (as shown in Figure 2): there is no possible labeling of the unlabeled data that will give the same solution as . We then update the to be the within the constrained region that minimizes the loss on the supervised training set. As can be seen from Figure 2, this will be a point on the boundary of the interval. Note that is now closer to than . Since the true loss function is convex and achieves its minimum in the optimal solution, corresponding to the true labeling, the risk of our semisupervised solution will always be equal to or lower than the loss of the supervised solution.
Thus, the proposed update either improves the estimate of the parameter or it does not change the supervised estimate. In no case will the semisupervised solution be worse than the supervised solution, in terms of the expected squared loss. This concludes the proof of the first part of the theorem.
The last part of the theorem gives a general condition when, in expectation, our semisupervised approach will outperform the supervised learner. Because will never be worse than , to prove this we only need to show that for some observation of a labeled point with positive , the estimated is outside of the interval , in which case .
If we observe an object labeled with feature value , the corresponding estimate . Since the improvement in loss will only result if this estimate is not in the constrained region, we need to show that
(13) 
To do this, consider the bounds of the interval . These most extreme values are obtained whenever all negative values of are assigned label while the positive get labels , or the other way around. From (11) and writing we find the interval is given by
(14) 
Combining this with (13), we get the condition
(15) 
Since is assumed to be continuous, , and the lower bound in this equation is always smaller than , while the upper bound is always larger than . The assumption of the continuity of ensures that (15) holds whenever . The property is satisfied by many distributions of the data. The result, therefore, indicates, that in the case of labeled sample improvement is not only possible, but will occur in many cases. When we have multiple labeled examples, this effect will likely become smaller. This makes sense: the more labeled data we have to estimate the parameter, the smaller the impact of the unlabeled objects will be. ∎
6 Empirical Results
To study the properties of the proposed semisupervised approach to least squares classification, we compare how this approach fares against supervised least squares classification without the constraints.
For comparison we include two alternative semisupervised approaches and an oracle solution:
SelfLearning
Using a simple procedure proposed by McLachlan1975 , among others, the supervised least squares classifier is updated iteratively by using its class predictions on the unlabeled objects as the labels for the unlabeled objects in the next iteration. This is done until convergence.
Updated Second Moment Least Squares (USM)
In this approach we replace the second moment matrix with an appropriately scaled matrix similar to the estimator studied in Shaffer1991 :
where and are centered. This centering ensures that results do not depend on the particular encoding of the labels used. We will refer to this as updated second moment least squares (USM) classification.
Oracle
The performance of the least squares classifier if all unlabeled objects were labeled as well. This serves as the unattainable upper bound on the performance of any semisupervised learner.
A description of the datasets used for our experiments is given in Table 1. We use datasets from both the UCI repository Lichman2013 and from the benchmark datasets proposed by Chapelle2006 . While the benchmark datasets proposed in Chapelle2006 are useful, in our experience, the results on these datasets are very homogeneous because of the similarity in their dimensionality and their low Bayes errors. The UCI datasets are more diverse both in terms of the number of objects and features as well as the nature of the underlying problems. Taken together, this collection allows us to investigate the properties of our approach for a wide range of problems. All the code used to run the experiments is available from the first author’s website.
Dataset  Objects  Features  PCA99  Majority  Source 

Haberman  306  3  3  0.74  Lichman2013 
Ionosphere  351  33  30  0.64  Lichman2013 
Parkinsons  195  22  12  0.75  Lichman2013 
Diabetes  768  8  8  0.65  Lichman2013 
Sonar  208  60  43  0.53  Lichman2013 
SPECT  267  22  21  0.79  Lichman2013 
SPECTF  267  44  37  0.79  Lichman2013 
Transfusion  748  4  3  0.76  Lichman2013 
WDBC  569  30  17  0.63  Lichman2013 
Mammography  961  9  9  0.54  Lichman2013 
Digit1  1500  241  221  0.51  Chapelle2006 
USPS  1500  241  183  0.80  Chapelle2006 
COIL2  1500  241  114  0.50  Chapelle2006 
BCI  400  117  45  0.50  Chapelle2006 
g241c  1500  241  235  0.50  Chapelle2006 
g241d  1500  241  235  0.50  Chapelle2006 
6.1 Peaking Behaviour in Semisupervised Least Squares
With fewer than samples, the supervised least squares classifier that utilizes a pseudoinverse is known to exhibit a peaking phenomenon, as described in Opper1996 ; Raudys1998 : Starting from a single observation, expected classification errors generally decrease as we add more data before errors increase again to reach a maximum approximately when the number of features is equal to the number of observations. This phenomenon can also be observed in the semisupervised setting. Figures 3 and 4 show learning curves of the methods considered here, using
labeled training objects and an increasing number of unlabeled objects. Performance is evaluated on objects that were not in the labeled or unlabeled set. The Oracle classifier indicates the mean error when we do have the labels for the unlabeled objects and therefore corresponds to the peaking phenomenon in the supervised case. In the supervised case, several proposals have been done to ameliorate this peaking behaviour, such as feature selection, regularization, removing objects, injecting noise in the features, or adding redundant features
Skurichina1999 . The semisupervised learners suffer from the same peaking phenomenon, except that unlike the Oracle, USM and ICLS do not fully recover from the initial increase in classification error.We have no full explanation for the observed peaking behaviour in the semisupervised setting. Even in the supervised setting the behaviour remains elusive. The two observation we do make are: 1. that the peak occurs at the same location for both the supervised and semisupervised scenarios, which is likely due to the dependence of all methods on the inverse of and 2. that the subspace defined by the input data is the defining characteristic for the location of the peak.
This peaking behaviour is not the primary topic of this work and in the remainder we will restrict our attention to the case where there are enough labeled objects such that the matrix is invertible.
6.2 Comparison of Learning Curves
We study the behavior of the expected classification error of the ICLS procedure for different sizes of the unlabeled set. This statistic has two desired properties. First of all it should never be higher than the expected classification error of the supervised solution, which is based on only the labeled data. Secondly, the expected classification error should not increase as we add more unlabeled data. A semisupervised classifier that has both these properties can be used safely, since adding unlabeled data and continuing to add more unlabeled data will never decrease performance, on average.
Experiments were conducted as follows. For each dataset, labeled points were randomly chosen, where we make sure to sample at least 1 object from each of the two classes. Since the peaking phenomenon described in the previous section is not main topic of this work, we avoid this situation by considering the setting in which the labeled design matrix is of full rank, which we ensure by setting , the dimensionality of the dataset plus five observations. For all datasets we ensure a minimum of labeled objects.
Next, we create unlabeled subsets of increasing size by randomly selecting points from the original dataset without replacement. The classifiers are trained using these subsets and the classification performance is evaluated on the remaining objects. Since the test set decreases in size as the number of unlabeled objects increases, the standard error slightly increases with the number of unlabeled objects.
The results of these experiments are shown in Figure 5. We report the mean classification error as well as the standard error of this mean. As can be seen from the tight confidence bands, this offers an accurate estimate of the expected classification error.
This procedure of sampling labeled and unlabeled points is repeated times and the average classification error (Figure 5) and squared loss (Figure 6) on the test set is determined. The latter is done to evaluate whether the approach is effective in increasing generalization performance in terms of the loss used in estimating the classifier. This is the same loss that we consider in Theorem 1. Even though in applications the ultimate goal may typically be classification performance, this allows us to study whether problems occur because of the optimization itself, or because of the link between the surrogate loss used and the classification error.
We find that, generally, the ICLS procedure has monotonically decreasing error curves as the number of unlabeled samples increases, unlike selflearning. On the Diabetes and Transfusion datasets, the performance of selflearning becomes worse than the supervised solution when more unlabeled data is added, while the ICLS classifier again exhibits a monotonic decrease of the average error rate. The USM classifier performs well on most datasets except for the Mammography dataset, where both in terms of average error rates and squared loss, performance is worse than the supervised classifier.
When we compare the error curves and the loss curves, the nonmonotonically decreasing losses for the selflearner correspond to increased errors. In general, however, similar losses for different classifiers can give rise to different behaviours in terms of error rates.
6.3 Benchmark performance
We now consider the performance of these classifiers in a crossvalidation setting. The experiment is set up as follows. For each dataset, the objects are randomly divided into folds. We iteratively go through the folds using fold as validation set, and the other as the training set. From this training set, we then randomly select labeled objects, as in the previous experiment, and use the rest as unlabeled data. After predicting labels for the validation set for each fold, the classification error is then determined by comparing the predicted labels to the real labels. This is repeated times, while randomly assigning objects to folds in each iteration.
The crossvalidation procedure used here is slightly different from that described in Chapelle2006 , to make it more closely relate to the crossvalidation procedure that is usually employed in supervised learning. More specifically, our procedure ensures the validation sets are independent (nonoverlapping), such that, after going over all the folds, each object is in the validation set only once. This is different from the procedure in Chapelle2006 , were the authors ensure the labeled sets are nonoverlapping. We have not found a qualitative difference in the error rates, however, when using the procedure proposed in Chapelle2006 . The advantage of the procedure employed here is that every object gets a single predicted label, allowing for the direct comparison of predictions of different classifiers.
Dataset  Supervised  SelfLearning  USM  ICLS  Oracle 

Haberman  0.29  0.28 (33)  0.28 (42)  0.29 (24)  0.26 (11) 
Ionosphere  0.29  0.24 (1)  0.22 (1)  0.19 (0)  0.13 (0) 
Parkinsons  0.34  0.29 (5)  0.25 (3)  0.26 (1)  0.12 (0) 
Diabetes  0.32  0.34 (83)  0.31 (31)  0.31 (7)  0.23 (0) 
Sonar  0.42  0.37 (5)  0.34 (3)  0.33 (1)  0.25 (0) 
SPECT  0.41  0.39 (28)  0.28 (0)  0.33 (1)  0.18 (0) 
SPECTF  0.43  0.40 (14)  0.31 (0)  0.36 (2)  0.23 (0) 
Transfusion  0.27  0.28 (63)  0.26 (30)  0.27 (25)  0.23 (2) 
WDBC  0.27  0.18 (0)  0.20 (2)  0.13 (0)  0.04 (0) 
Mammography  0.28  0.28 (28)  0.28 (54)  0.27 (14)  0.20 (0) 
Digit1  0.42  0.34 (0)  0.25 (0)  0.20 (0)  0.06 (0) 
USPS  0.42  0.34 (0)  0.22 (0)  0.20 (0)  0.09 (0) 
COIL2  0.39  0.27 (0)  0.24 (0)  0.19 (0)  0.10 (0) 
BCI  0.41  0.35 (1)  0.30 (0)  0.28 (0)  0.16 (0) 
g241c  0.45  0.39 (0)  0.30 (0)  0.29 (0)  0.14 (0) 
g241d  0.45  0.39 (0)  0.30 (0)  0.29 (0)  0.13 (0) 
The results shown in Table 2 tell a similar story to those in the previous experiment. Most importantly for the purposes of this paper, ICLS, in general, offers solutions that give at least no higher expected classification error than the supervised procedure. On many of these datasets, the selflearning approach seems to share this property. However, if we look at for how many of the crossvalidation repeats the ICLS and selflearning give lower error than the supervised solution, there is a clear difference. The selflearning solution gives a higher error on more of the repeats than ICLS, for all of the datasets.
The results also show that unlabeled information is of use. Particularly on the last six datasets, ICLS and USM offers large improvement in classification accuracy over the supervised solution. The differences in performance between ICLS and selflearning can also be quite substantial, where ICLS outperforms selflearning on most of the datasets. USM performs well on many of the datasets, especially when we consider how simple and computationally efficient this procedure is.
Dataset  Supervised  SelfLearning  TSVM  Oracle 

Haberman  0.29  0.29 (34)  0.32 (92)  0.26 (8) 
Ionosphere  0.17  0.18 (81)  0.17 (51)  0.11 (0) 
Parkinsons  0.22  0.22 (32)  0.22 (60)  0.14 (0) 
Diabetes  0.31  0.31 (40)  0.28 (7)  0.23 (0) 
Sonar  0.26  0.26 (53)  0.25 (33)  0.25 (25) 
SPECT  0.30  0.28 (13)  0.25 (3)  0.18 (0) 
SPECTF  0.30  0.29 (28)  0.28 (29)  0.21 (0) 
Transfusion  0.27  0.27 (59)  0.29 (96)  0.23 (0) 
WDBC  0.06  0.06 (53)  0.05 (30)  0.03 (0) 
Mammography  0.27  0.28 (60)  0.25 (3)  0.20 (0) 
Digit1  0.08  0.08 (85)  0.06 (1)  0.05 (0) 
USPS  0.14  0.13 (17)  0.12 (5)  0.11 (1) 
COIL2  0.16  0.16 (75)  0.19 (100)  0.09 (0) 
BCI  0.28  0.29 (70)  0.36 (99)  0.17 (0) 
g241c  0.22  0.23 (87)  0.17 (0)  0.16 (0) 
g241d  0.23  0.24 (90)  0.17 (0)  0.16 (0) 
While we are interested in a semisupervised procedure that outperforms the supervised least squares classifier, for comparison we repeated the experiment for the (linear) supervised SVM, selflearning applied to the SVM and the Transductive SVM. We used the SVM and TSVM implementations of Sindhwani2006 , setting the regularization parameter to and the influence parameter of the unlabeled data to , as was also done in Sindhwani2006 . The experiment is set up in the same way as the one in Table 2. The results are shown in Table 3.
On many of the datasets, the supervised support vector classifier has a lower error than the supervised least squares classifier, due to the use of a regularization term in the SVM implementation, which we do not include in our analysis and which makes the results difficult to compare directly to the results in Table 2. Selflearning performs worse compared to the least squares setting, which may be a consequence of the supervised solution already being a decent solution on some of these datasets. The Transductive SVM offers some improvements over the supervised solution. Compared to ICLS, however, the TSVM gives worse performance than the supervised solution on many more datasets and many more repeats, the exact behaviour we attempted to avoid when constructing ICLS.
7 Discussion
From Theory to Empirical Results
The results presented in this paper are rather promising, especially in the light of the negative theoretical performance results presented in the literature Cozman2006 . The result in Theorem 1, to start with, indicates the proposed procedure is in some way robust against reduction in performance. The strong result of this theorem, stating that performance never gets worse, holds in the 1D case with unlimited unlabeled data and no intercept in the model. A slightly weaker result, that performance does not degrade on average may still hold without these assumptions. This last statement is corroborated by the empirical results showing improvements in averaged squared errors for ICLS throughout.
The results in the previous section also indicate that such improved results hold in terms of the misclassification error, at least on this collection of datasets. These empirical observations are encouraging because we are often interested in misclassification error and not the squared loss that was considered in Theorem 1. Furthermore the experiments were carried out in the multivariate setting with an intercept term using limited unlabeled data, rather than the unlimited unlabeled data setting considered in the theorem. This indicates that minimizing the supervised loss over the subset , leads to a semisupervised learner with desirable behavior, both theoretically in terms of risk and empirically in terms of classification error.
Robustness
The method considered in this work is different from most previous work in semisupervised learning in that it is inherently robust against a decrease in performance. The robustness of the method comes from the fact that we do not accept solutions that do not work on the labeled data. The goal of semisupervised learning is to improve supervised techniques using the additional information inherent in the additional unlabeled objects. Previous approaches have done this by changing the loss function that is being optimized, in particular by introducing an extra term corresponding to assumptions about the unlabeled data. The loss function then becomes a mixture between the supervised objective and an unsupervised objective, which may lead to decreased performance as we observed in Table 3. If the goal is classification, we propose that the loss function should remain the supervised loss function. The unlabeled objects are merely used to introduce constraints on the possible solutions to this loss function, but do not change its functional form.
Assumptions
Most other semisupervised techniques rely on introducing useful assumptions that link information about the distribution of the features to the posterior of the classes . It has been argued that, for discriminative classifiers, semisupervised learning is impossible without these additional assumptions about the link between labeled and unlabeled objects Seeger2001 ; Singh2008 . ICLS, however, is both a discriminative classifier and no explicit additional assumptions about this link are made. Any assumptions that are present follow, implicitly, from the choice of squared loss as the loss function and from the chosen hypothesis space.
In fact, additional assumptions may actually be at the root of the problem: clearly if such an additional assumption is correct, a semisupervised classifier can gain from it, but if the assumption is incorrect, degraded performance may ensue. What we leverage in our approach are the implicit assumptions that are, in a sense, intrinsic to the supervised least squares classifier.
One could argue that constraining the solutions to
is an assumption as well. It corresponds to a very weak assumption about the supervised classifier: that it will improve when we add additional labeled data. This is generally assumed in the supervised setting as well. The lack of additional assumptions has another advantage: no additional hyperparameter value needs to be selected that controls the importance of the unlabeled data for the results in Sections
5 and 6 to hold as ICLS acts as a type of data dependent regularization.Note that the solution provided by selflearning is, by construction, also in the constrained subset . The difference with ICLS is that in ICLS the choice of estimate from is based on information of the labeled objects only, while selflearning also uses the imputed labels on the unlabeled objects. This may lead to selfdeception: if the imputed labels are wrong, a good fit for these wrongly imputed labels does not necessarily lead to an improved . In fact, it might lead to worse choices as shown in the results.
Time Complexity
In terms of the number of features, ICLS scales in the same way as the supervised least squares solution, where the main bottleneck is the calculation of . Furthermore, the quadratic programming formulation of ICLS presented in Section 4 allows one to use the standard and constantly improving tools from convex optimization to find the ICLS estimate. Unfortunately one has to go from a convex problem with variables in the supervised case to a constrained convex problem with variables for ICLS. For very large , this may not currently be computationally feasible. Further insight in the general nature of the semisupervised solutions that one obtains can lead to more dedicated and potentially better scalable methods to solve the quadratic programming problem we have to deal with in our approach.
Compared to ICLS, selflearning seems more favorable in terms of computational cost. Selflearning usually converges in a few iterations, where each iteration has at most the cost of one supervised least squares estimation. In our implementations, however, selflearning and ICLS had similar training times (Figure 7). USM with its simple closed form solution has much lower training times and performs surprisingly well.
Squared Loss
Generally, models used in practice do not directly minimize misclassification error. For computational reasons, often convex surrogate losses, such as the one employed here are minimized. It is therefore interesting to look at the performance of a classifier in terms of these surrogate losses Loog2016a . We have chosen to restrict ourselves to a particular convex loss and attempted to ensure improvement in terms of this chosen loss function.
When we compare the average squared loss on the test set, ICLS, USM and selflearning often seem to offer similar performance. This is quite unlike the results in, for instance Loog2010 ; Loog2014b , where the selflearner often performed much worse in terms of the loss than an approach based on constraining the solution using unlabeled data. While Loog2010 ; Loog2014b consider a generative classifier, we consider a discriminative classifier, in which case selflearning may be less susceptible to increases in the loss. Selflearning does, however, still increase the loss on some datasets, unlike ICLS.
The peaking phenomenon described in Opper1996 ; Raudys1998 is known to occur for squared loss minimization when we increase the number of labeled samples. Here we find it also occurs when we change the number of unlabeled samples. It seems that ICLS and USM are more sensitive to this problem than selflearning. As yet, we do not have any explanation for this behavior. Further improvements to the current approach may start by trying to understand this occurrence of peaking.
Other Losses
While the results presented in this work are promising for squared loss, an open question is what other classifiers could benefit from the implicitly constrained approach considered here. Using negative log likelihood as a loss function, for instance, also leads to an interesting implicitly constrained semisupervised classifier, for instance, in linear discriminant analysis Krijthe2014 .
In the derivation of ICLS, we made use of the closedform solution given an imputed labeling to derive a quadratic programming problem in terms of the labels. For many loss functions, closedform solutions do not exist, which prohibits a straightforward formulation of their implicitly constrained semisupervised counterparts. Without a supervised closedform solution one cannot straightaway apply techniques like gradient descent to the parameters as this typically leads to solutions that are outside of the set , even if the loss considered is differentiable.
More Constraints
In Figure 1, we illustrate that projecting onto the subset causes improvement as long as a better solution than the supervised solution is within . A smaller will give a larger improvement, since the semisupervised solution is going to be closer to . In the extreme case where only forms the subset, this clearly gives a large improvement over supervised learning. It therefore makes sense to think about reducing the size of . In the approach presented in this work, however, to ensure a better solution than the supervised solution is always within the constraint set with probability , our choice of is conservatively large. It contains elements corresponding to all labelings of the unlabeled points, even extremely unlikely ones.
By excluding unlikely labelings from the subset, the size of may shrink, while the probability that it includes remains high. For instance, one might exclude labelings with class priors that are very unlikely to occur, given the class priors that are observed in the labeled data, a strategy which is also employed in Transductive SVMs where it is necessary for it to converge to meaningful local optima. Changes to may, therefore, allow for larger improvements in terms of the risk or classification error, while introducing a small chance of deterioration in performance.
8 Conclusion
This work introduced a new semisupervised approach to least squares classification. By implicitly considering all possible labelings of the unlabeled objects and choosing the one that minimizes the loss on the labeled observations, we derived a robust classifier with a simple quadratic programming formulation. For this procedure, in the univariate setting with a linear model without intercept, we can prove it never degrades performance in terms of squared loss (Theorem 1). Experimental results indicate that in expectation this robustness also holds in terms of classification error on real datasets. Hence, semisupervised learning for least squares classification without additional assumptions can lead to improvements over supervised least squares classification both in theory and in practice.
Acknowledgement
Part of this work was funded by project P23 of the Dutch publicprivate research community COMMIT.
References
 (1) F. Cozman, I. Cohen, Risks of SemiSupervised Learning, in: O. Chapelle, B. Schölkopf, A. Zien (Eds.), SemiSupervised Learning, MIT press, 2006, Ch. 4, pp. 56–72.

(2)
F. G. Cozman, I. Cohen, M. C. Cirelo, SemiSupervised Learning of Mixture Models, in: Proceedings of the Twentieth International Conference on Machine Learning, 2003.
 (3) M. Seeger, Learning with labeled and unlabeled data, Tech. rep. (2001).
 (4) A. Singh, R. D. Nowak, X. Zhu, Unlabeled data: Now it helps, now it doesn’t, in: Advances in Neural Information Processing Systems, 2008, pp. 1513–1520.
 (5) T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning, 2nd Edition, Spinger, 2009.
 (6) T. Poggio, S. Smale, The Mathematics of Learning: Dealing with Data, Notices of the AMS (2003) 537–544.
 (7) R. Rifkin, G. Yeo, T. Poggio, Regularized leastsquares classification, Nato Science Series Sub Series III Computer and Systems Sciences 190.
 (8) J. A. K. Suykens, J. Vandewalle, Least Squares Support Vector Machine Classifiers, Neural Processing Letters 9 (1999) 293–300.
 (9) R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B 58 (1) (1996) 267–288.

(10)
L. Bottou, Largescale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
 (11) J. H. Krijthe, M. Loog, Implicitly Constrained SemiSupervised Least Squares Classification, in: E. Fromont, T. D. Bie, M. van Leeuwen (Eds.), 14th International Symposium on Advances in Intelligent Data Analysis XIV (Lecture Notes in Computer Science Volume 9385), Saint Étienne. France, 2015, pp. 158–169.
 (12) O. Chapelle, B. Schölkopf, A. Zien, Semisupervised learning, MIT press, 2006.
 (13) X. Zhu, A. B. Goldberg, Introduction to SemiSupervised Learning, Vol. 3, Morgan & Claypool, 2009.
 (14) K. Nigam, A. K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM, Machine learning 34 (2000) 1–34.
 (15) L. Käll, J. D. Canterbury, J. Weston, W. S. Noble, M. J. MacCoss, Semisupervised learning for peptide identification from shotgun proteomics datasets., Nature methods 4 (11) (2007) 923–925. doi:10.1038/nmeth1113.
 (16) M. Shi, B. Zhang, Semisupervised learning improves gene expressionbased prediction of cancer recurrence, Bioinformatics 27 (21) (2011) 3017–3023.
 (17) D. Elworthy, Does BaumWelch reestimation help taggers?, in: Proceedings of the fourth conference on Applied natural language processing, 1994, pp. 53–58. arXiv:9410012v2.

(18)
A. B. Goldberg, X. Zhu, Keepin’it real: semisupervised learning with realistic tuning, NAACL HLT 2009 Workshop on Semisupervised Learning for Natural Language Processing.
 (19) J. Wang, X. Shen, W. Pan, On Transductive Support Vector Machines, Contemporary Mathematics 443 (2007) 7–19.
 (20) G. J. McLachlan, Iterative Reclassification Procedure for Constructing an Asymptotically Optimal Rule of Allocation in Discriminant Analysis, Journal of the American Statistical Association 70 (350) (1975) 365–369.
 (21) S. Abney, Understanding the yarowsky algorithm, Computational Linguistics 30 (3) (2004) 365–395.
 (22) D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd annual meeting on Association for Computational Linguistics (1995) 189–196doi:10.3115/981658.981684.
 (23) Y. Grandvalet, Y. Bengio, Semisupervised learning by entropy minimization, in: L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, 2005, pp. 529–536.
 (24) T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann Publishers, 1999, pp. 200–209.
 (25) K. P. Bennett, A. Demiriz, Semisupervised support vector machines, in: Advances in Neural Information Processing Systems 11, 1998, pp. 368–374.
 (26) V. Sindhwani, S. S. Keerthi, Large scale semisupervised linear SVMs, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York, New York, USA, 2006, p. 477.
 (27) R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive SVMs, Journal of Machine Learning Research 7 (2006) 1687–1712.
 (28) J. Wang, X. Shen, Large margin Semisupervised Learning, Journal of Machine Learning Research 8 (2007) 1867–1891.
 (29) M. Loog, Constrained Parameter Estimation for SemiSupervised Learning: The Case of the Nearest Mean Classifier, in: Proceedings of the 2010 European Conference on Machine learning and Knowledge Discovery in Databases, 2010, pp. 291–304.

(30)
M. Loog, A. C. Jensen, SemiSupervised Nearest Mean Classification through a constrained LogLikelihood, IEEE Transactions on Neural Networks and Learning Systems 26 (5) (2014) 995 – 1006.
 (31) Y.F. Li, Z.h. Zhou, Towards making unlabeled data never hurt, in: Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 1081–1088.
 (32) R. J. A. Little, D. B. Rubin, Statistical Analysis with Missing Data, 2002.
 (33) M. Healy, M. Westmacott, Missing Values in Experiments Analysed on Automatic Computers, Journal of the Royal Statistical Society 5 (3) (1956) 203–206.
 (34) J. P. Shaffer, The GaussMarkov Theorem and Random Regressors, The American Statistician 45 (4) (1991) 269–273.
 (35) B. Fan, Z. Lei, S. Z. Li, Normalized LDA for Semisupervised Learning, in: International Conference on Automatic Face & Gesture Recognition, 2008, pp. 1–6.
 (36) R. H. Byrd, P. Lu, J. Nocedal, C. Zhu, A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientific Computing 16 (5) (1995) 1190–1208.
 (37) N. Sokolovska, O. Cappé, F. Yvon, The asymptotics of semisupervised learning in discriminative probabilistic models, in: W. W. Cohen, A. McCallum, S. T. Roweis (Eds.), Proceedings of the 25th International Conference on Machine Learning, ACM Press, Helsinki, Finland, 2008, pp. 984–991.

(38)
M. Lichman, UCI Machine Learning
Repository (2013).
URL http://archive.ics.uci.edu/ml  (39) M. Opper, W. Kinzel, Statistical Mechanics of Generalization, in: E. Domany, J. L. Hemmen, K. Schulten (Eds.), Models of Neural Networks III, Springer, New York, 1996, pp. 151–209.
 (40) S. Raudys, R. P. W. Duin, Expected classification error of the Fisher linear classifier with pseudoinverse covariance matrix, Pattern Recognition Letters 19 (56) (1998) 385–392.
 (41) M. Skurichina, R. P. W. Duin, Regularisation of Linear Classifiers by Adding Redundant Features, Pattern Analysis & Applications 2 (1) (1999) 44–52. doi:10.1007/s100440050013.

(42)
M. Loog, J. H. Krijthe, A. C. Jensen, On Measuring and Quantifying Performance: Error Rates, Surrogate Loss, and an Example in SSL, in: C. H. Chen (Ed.), Handbook of Pattern Recognition and Computer Vision, 5th Edition, World Scientific, 2016, Ch. 1.3.
 (43) J. H. Krijthe, M. Loog, Implicitly Constrained SemiSupervised Linear Discriminant Analysis, in: Proceedings of the 22nd International Conference on Pattern Recognition, Stockholm, 2014, pp. 3762–3767.
Comments
There are no comments yet.