1 Introduction
In reality, one of the main difficulties faced by many machine learning tasks is manually tagging large amounts of data. This is especially prominent for deep learning, which usually demands a huge number of welllabeled samples. Therefore, how to use the least amount of labeled data to train a deep network has become an important topic in the area. To overcome this problem, researchers proposed that the use of a large number of unlabeled data can extract the topology of the overall data’s distribution. Combined with a small amount of labeled data, the generalization ability of the model can be significantly improved, which is the socalled semisupervised learning
[5, 21, 18].Recently, semisupervised deep learning has made some progress. The main ideas of existing works broadly fall into two categories. One is generative model based algorithms, for which unlabeled samples help the generative model to learn the underly sample distribution for sample generation. Examples of this type algorithms include CatGAN [15], BadGAN [7], variational Bayesian [10], etc. The other is discriminant model based algorithms, for which the role of the unlabeled data may provide sample distribution information to prevent model overfitting , or to make the model more resistant to disturbances. Typical algorithms of this type include unsupervised loss regularization [16, 1], latent feature embedding [18, 20, 8, 14], pseudo label [11, 19]
. Our method belongs to the second category, in which an unsupervised regularization term, which captures the local and global sample distribution characteristics, is added to the loss function for semisupervised deep learning.
The proposed algorithm is based on the theory of manifold regularization, which is developed by Belkin et al.[3, 4] and then introduced into deep learning by Weston et al. [18]. Given labeled samples and their corresponding labels , recall that manifold regularization combines the idea of manifold learning with the idea of semisupervised learning, and learns the manifold structure of data with a large amount of unlabeled data, which gets the model better generalization. Compared to the loss function in tradition supervised learning framework, the manifold regularization based semisupervised learning algorithm adds a new regularization term to penalize the complexity of the discriminant function over the sample distribution manifold, as shown in the equation (1):
(1) 
where is an arbitrary supervised loss term, and is a kernel norm, such as a Gaussian kernel function, that penalizes the model complexity in the ambient (data) space. is the introduced manifold regularization term, which penalizes model complexity along the data distribution manifold to make sure that the prediction output have the same distribution as the input data. and are used as weights. As shown in Fig. 1, after the manifold regularization term is introduced, the decision boundary tries not to destroy the manifold structure of the data distribution and meanwhile, keeps itself as simple as possible, so that the boundary finally passes through where the data is sparsely distributed.
However, the research on the application of manifold regularization in the field of semisupervised deep learning has not been fully explored. The construction of manifold regularization only considers the local structural relationship of samples. For classification problems, we should not only preserve the positional relationship of neighbor data to ensure clustering, but also consider distinguishing data from different manifolds and separating them in the embedded space. Therefore, in this paper, we propose a novel manifold loss term based on the improved Unsupervised Discriminant Projection (UDP) [9], which incorporates both local and nonlocal distribution information, and we conduct experiments on realworld datasets to demonstrate that it can produce better classification accuracy for semisupervised deep learning than its counterparts.
The following contents are organized as follows: The theory and the proposed algorithm are presented in Section 2; then the experimental results are given in Section 3, followed by conclusions and discussions in Section 4.
2 Improved UDP Regularization Term
In this section, we first review the UDP algorithm and then introduce an improved UDP algorithm. Then we propose a semisupervised deep learning algorithm which is based on the improved UDP algorithm.
2.1 Basic idea of UDP
The UDP method is proposed by Yang et al. originally for dimensionality reduction of smallscale highdimensional data
[9]. As a method for multimanifold learning, UDP considers both local and nonlocal quantities of the data distribution. The basic idea of UDP is shown in Fig. 2. Suppose that the data is distributed on two elliptical manifolds denoted by and , respectively. If we only require that the distances of neighboring data are still close after being projected along a certain direction, then the projection along will be the optimal direction, but at this time the two data clusters will be mixed with each other and difficult to separate after projection. Therefore, while requiring neighbor data to be sufficiently close after projection, we should also optimize the direction of the projection so that the distance between different clusters is as far as possible. Such projected data are more conducive to clustering after dimensionality reduction.For this reason, UDP uses the ratio of local scatter to nonlocal scatter, to find a projection which will draw the close data closer, while simultaneously making the distant data even more distant from each other. The local scatter can be characterized by the mean square of the Euclidean distance between any pair of the projected sample points that are neighbors. The criteria for judging neighbors can be nearest neighbors or neighbors. Since the value of is difficult to determine and it may generate an unconnected graph, the nearest neighbor criterion is used here to define the weighted adjacency matrix with kernel weighting:
(2) 
Then given a training set containing samples , denote the local set . After projecting and onto a direction , we get their images and . The local scatter is defined as
(3) 
Similarly, the nonlocal scatter can be defined by the mean square of the Euclidean distance between any pair of the projected sample points that are not in any set of neighborhoods. It is defined as
(4) 
The optimal projection vector
minimizes the following final objective function(5) 
2.2 An improved UDP for large scale dimension reduction
Since the original UDP method is developed for dimensionality reduction of smallscale data sets, the data outside the nearest neighbors of a sample are regarded as nonlocal data and participate in the calculation of a nonlocal scatter. However, when the scale of training data is large, this way of calculating the nonlocal scatter will bring a prohibitive computational burden, because each sample has nonlocal data. To overcome this problem, we propose an improved UDP for large scale dimension reduction.
Suppose there are training data ,and the desired output of after dimension reduction is . Using the Euclidean distance as a measure, similar to the definition of the nearest neighbor set, we define a set of distant data set . Similarly, we define a nonadjacency matrix :
(6) 
Then we define the distant scatter as
(7) 
for the local scatter , we use the same one as the original UDP. So the objective function of the improved UDP is
J_R(w)= & JLJD
= & ∑_i=1^M∑j∈UKHijyiyj22∑b ∈DNWibyiyb22
The improved UDP also requires that after the mapping of the deep network, the outputs of similar data is as close as possible, while simultaneously “pushing away” the output of dissimilar data. Although only the data with extreme distance is used, in the process of making the dissimilar data far away from each other, the data similar to them will gather around them respectively, thus widening the distance between the classes and making the sparse area of data distribution more sparse, densely areas denser.
2.3 The improved UDP based semisupervised deep learning
Suppose we have a dataset , in which the first data points are labeled samples with labels , and the rest data points are unlabeled samples. Let be the embeddings of the samples through a deep network. Our aim is to train a deep network using both labeled and unlabeled samples, such that different classes are well separated and meanwhile, cluster structures are well preserved. Putting all together, we have the following objective function
(8) 
where is the number of labeled data and is the number of unlabeled data. is the supervised loss function and is the UDP regularization term.
is the hyperparameter, which is used to balance the supervisory loss and unsupervised loss. We use the softmax function as our supervised loss, but other type of loss function (e.g. mean square error) are also applicable.
We use error backpropagation (BP) to train the network. The details of the training process are given in the following algorithm.
3 Experimental Results
3.1 Results of dimensionality reduction
Firstly, we test the dimensionality reduction performance of the improved UDP method in two different image datasets, MNIST and ETH80^{1}^{1}1ETH80:https://github.com/KaiXuan/ETH80. Then we compare the improved UDP with original UDP, as well as several popular dimension reduction algorithms (Isomap [2], Multidimensional scaling (MDS) [6], tSNE [13] and spectral embedding [12]), to show its performance improvement.
MNIST is a dataset consisting of grayscale images of handwritten digits. We randomly selected 5000 samples from the dataset to perform our experiments because the original UDP usually applies to smallscale datasets. ETH80 is a smallscale but more challenging dataset which consists of RGB images from 8 categories. We use all the 820 samples from “apples” and “pears” categories and convert the images from RGB into grayscale for manipulation convenience. The parameters of the baseline algorithms are set to their suggested default values and the parameters (kernel width , number of nearest neighbors and number of farthest points ) of the improved UDP are set empirically. The experimental results on the two datasets are shown in Fig. 3 and Fig. 4, respectively.
From these results we can see that after dimension reduction, the improved UDP maps different classes more separately than the original UDP on both datasets. This is important because while adopting the new UDP into semisupervised learning in equation (8), better separation means more accurate classification. It is also worth mentioning that although on the ETH80 dataset, the improved UDP achieves comparable results as the rest baseline algorithms, its results on MNIST is much better (especially than MDS, Isomap) in terms of classes separation.
To quantitatively measure the classes separation, Table 1
shows the cluster purity given by kmeans clustering algorithm on these two datasets after dimensionality reduction. The purity is calculated based maximum matching degree
[17] after clustering.Method  MNIST  ETH80 

UDP  81.7  77.7 
Improved UDP  93.8  99.4 
Isomap [2]  86.1  99.9 
MDS [6]  53.7  98.9 
tSNE [13]  93.1  100.0 
Spectral Embedding [12]  98.6  100.0 
Table 1 demonstrates that our improved UDP method improves the cluster purity by a large margin compared to the original UDP. It can also be seen from Fig. 3 and Fig. 4 that our improved UDP method is more appropriate for clustering than original UDP. Furthermore, our method is more efficient than the original UDP because we do not have to calculate a fully connected graph. What we need are the kernel weights of the neighbors and distant data. On both datasets, our method gets much better (on MNIST) or competitive results with other dimension reduction methods.
3.2 Results of classification
We conduct experiments on MNIST dataset and SVHN dataset^{2}^{2}2SVHN the The Street View House Numbers (SVHN) Dataset (http://ufldl.stanford.edu/housenumbers/), which consists of color images for realworld house number digits with various appearance and is a highly challenging classification problem. to compare the proposed algorithm with the supervised deep learning (SDL) and Manifold Regularization (MR) semisupervised deep learning [18]. The number of labeled data for MNIST dataset is set to 100, combined with 2000 unlabeled data, to train a deep network. For SVHN, from the training set we randomly selected 1000 samples as labeled data and 20000 samples as unlabeled data to train a network. For both experiments, we test the trained network on the testing set (of size 10000 in MNIST and 26032 in SVHN ) to obtain testing accuracy. The optimizer we choose Adam. The parameters are manually tuned using a simple grid search rule. and take 10 and 50 and kernel width is within .
We adopt the three embedding network structures described in [18] and the results of MNIST and SVHN are shown in Table 2. For supervised deep learning, we apply entropy loss at the network output layer only, since middle layer embedding and auxiliary network do not make any sense. From the table we can see, MR is better for middle layer embedding. Our method is better for output embedding and auxiliary network embedding and achieves better classification results for most network structures. The results also suggest that it may be helpful to combine MR with UDP together, using MR for hidden layer and UDP for output layer^{3}^{3}3
We leave this to our future work. We should also point out that although the classification accuracies are somehow lower than the stateoftheart results, the network we employed is a traditional multilayer feedforward network and we do not utilize any advanced training techniques such as batchnormalization, random data augmentation. In the future, we will try to train a more complex network with advanced training techniques to make thorough comparisons.
.number of labled data  MNIST  SVHN  
SDL  MR  Improved UDP  SDL  MR  Improved UDP  
Output layer embedding  74.31  82.95  83.19  55.21  64.70  72.66 
Middle layer embedding    83.52  83.07    72.10  69.35 
Auxiliary neural network    87.55  87.79    62.61  71.32 
4 Conclusions and Future Work
Training a deep network using a small number of labeled samples is of great practical significance, since many realworld applications have big difficulties to collect enough labeled samples. In this paper, we modify the unsupervised discriminant projection (UDP) algorithm to make it suitable for large data dimension reduction and semisupervised learning. The new algorithm simultaneously takes both local and nonlocal manifold information into account and meanwhile, could reduce the computational cost. Based on this, we proposed a new semisupervised deep learning algorithm to train a deep network with a very small amount of labeled samples and many unlabeled samples. The experimental results on different realworld datasets demonstrate its validity and effectiveness.
The construction of the neighbor graph is based on Euclidean distance in data space, which may not be a proper distance measure on data manifold. In the future, other neighbor graph construction methods, such as the measure on the Riemannian manifold, will be tried. The limitation of the current method is that it can attain good results for tasks that are not too complex, such as MNIST, but for more challenging classification datasets, such as CIFAR10, which the direct nearest neighbors may not reflect the actual similarity, the method may not perform very well. Our future work will try to use some prelearning techniques, such as autoencoder or kernel method, to map origin data to a much concise representation.
References
 [1] (2014) Learning with pseudoensembles. Advances in Neural Information Processing Systems 4, pp. 3365–3373. Cited by: §1.
 [2] (2002) The isomap algorithm and topological stability. Science 295 (5552), pp. 7–7. Cited by: §3.1, Table 1.
 [3] (2006) Manifold regularization: a geometric framework for learning from examples. Journal of Machine Learning Research 7 (1), pp. 2399–2434. Cited by: §1.
 [4] (2005) On manifold regularization.. In AISTATS, pp. 1. Cited by: §1.
 [5] (2009) Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §1.
 [6] (2000) Multidimensional scaling. Chapman and hall/CRC. Cited by: §3.1, Table 1.
 [7] (2017) Good semisupervised learning that requires a bad gan. In Advances in neural information processing systems, pp. 6510–6520. Cited by: §1.
 [8] (2016) Semisupervised deep learning by metric embedding. arXiv preprint arXiv:1611.01449. Cited by: §1.
 [9] (2007) Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics. IEEE Transactions on Pattern Analysis & Machine Intelligence 29 (4), pp. 650–664. Cited by: §1, Figure 2, §2.1.
 [10] (2014) Semisupervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.
 [11] (2013) Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §1.
 [12] (2003) Spectral embedding of graphs. Pattern recognition 36 (10), pp. 2213–2230. Cited by: §3.1, Table 1.
 [13] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.1, Table 1.
 [14] (2015) Semisupervised learning with ladder networks. pp. 3546–3554. Cited by: §1.
 [15] (2015) Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390. Cited by: §1.
 [16] (2016) Semisupervised phone classification using deep neural networks and stochastic graphbased entropic regularization. arXiv preprint arXiv:1612.04899. Cited by: §1.
 [17] (2014) A novel graphbased kmeans for nonlinear manifold clustering and representative selection. Neurocomputing 143, pp. 109–122. Cited by: §3.1.
 [18] (2012) Deep learning via semisupervised embedding. In International Conference on Machine Learning, Cited by: §1, §1, §1, §3.2, §3.2.
 [19] (2018) Semisupervised deep learning using pseudo labels for hyperspectral image classification. IEEE Transactions on Image Processing 27 (3), pp. 1259–1270. Cited by: §1.
 [20] (2016) Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §1.

[21]
(2009)
Introduction to semisupervised learning.
Synthesis lectures on artificial intelligence and machine learning
3 (1), pp. 1–130. Cited by: §1.