1 Introduction
In recent years, deep learning has achieved great success in many areas of image research. Deep learning models need to be driven by a large amount of labeled data to achieve good results, for example, using a millionlevel number of images to train deep neural networks
Ref1 ; Ref2 . However, collecting labeled samples is difficult for many tasks, which requires a lot of manpower, time, and cost. The remote sensing image classification task can prove this point. In practical applications, the acquisition of labeled data for remote sensing requires not only field investigation but also professional interpretation, which limits the quantity of available labeled samples. In contrast, unlabeled samples are easier to obtain and more numerous, therefore how to use the readily available data to improve the performance of model is an important research issue.Semisupervised learning is a machine learning method between supervised learning and unsupervised learning. In the case of a small number of labeled samples, semisupervised learning avoids the problem of insufficient model generalization by introducing unlabeled samples. Unlabeled samples provide important information for the spatial distribution of the data, which helps the model to get better decision boundaries. Semisupervised learning can use only onetenth or less labeled data to achieve similar results as supervised learning algorithms.
Recently, semisupervised learning algorithm is implemented by adding an unlabeled data loss term to the loss function. PseudoLabel Ref3
takes the class corresponding to the maximum predicted probability as the true label of the unlabeled sample. However, PseudoLabel does not use data augmentation, so the results obtained are limited. The earlier semisupervised methods than PseudoLabel will not be introduced here, the related overviews are mentioned in literature
Ref4 . In the following, we mainly introduce semisupervised learning methods with data augmentation.Based on the smoothness assumption of the system input and output, a robust model gives a stable and smooth prediction when the input changes (such as shearing, rotation, etc.). The commonly used regularization method in supervised learning is data augmentation, which obtains lifelike training data by transforming the input without changing the class semantics Ref5 . Similarly, data augmentation can be applied to unlabeled samples, keeping the output consistentcy before and after augmentation. A teacherstudent model inputs noisy samples into the student model, minimizing the prediction error between the teacher model and the student model Ref6 . Subsequently, the teacherstudent model is extended based on the number of iterations to get better results. Model Ref7 ; Ref8 updates the prediction of the teacher model by exponential moving average (EMA), and Mean Teacher Ref9 updates the parameters of teacher model with EMA. However, the data augmentation of Model and Mean Teacher is relatively simple, and only random noise is added to the inputs and hidden layers. Virtual Adversarial Training (VAT) Ref10 defines the direction of disturbance in the most sensitive direction of the model. MixUp Ref11
linearly interpolates the inputs and labels of two different samples to obtain augmented samples and labels. MixMatch
Ref12 uses the MixUp method to augment both labeled and unlabeled data, with pseudolabel predicted by the model as the label of unlabeled sample. Nevertheless, VAT and Mixmatch all employ fixed augmentation method for various datasets. UDA Ref13 is the latest research from googleresearch, which use AutoAugment Ref14 to perform data augmentation. AutoAugment is the bestperforming data augmentation method. Following UDA, we also use AutoAugment to generate augmented samples. The details of AutoAugment will be introduced in section 2.4. Importantly, the semisupervised methods introduced above do not add constraints on sample features during network training, resulting in unclear classification boundaries.In semisupervised learning, the loss function consists of two parts: the loss of the labeled samples and the loss of the unlabeled samples. In the field of image classification for supervised learning, in order to get a better feature description, many loss functions have been proposed. The crossentropy loss function is simple to implement and has good performance, but its feature distribution cannot reach the optimal state. By combining the positive and negative samples, Triplet loss Ref15 can increase the feature constraints and improve the model generalization, but it takes a long time to train the model. LSoftmax Ref16 and AMSoftmax Ref17 change the angle between the weights of the fully connected layer and the features, which makes the intraclass distribution more compact and the interclass distribution more dispersed. By using predefined evenlydistributed class centroids (PEDCC) and adding the MSE loss between features on AMSoftmax, PEDCCLoss Ref18 achieves the best recognition accuracy in CIFAR100, LFW and other datasets. Therefore, we use PEDCCLoss on a small number of labeled samples and extends its idea to the loss of unlabeled samples. We use the maximum mean discrepancy (MMD) loss Ref19
to measure the distance between the feature distribution of unlabeled samples extracted by the model and the feature distribution of PEDCC. Therefore, the unlabeled sample feature distribution also satisfies the uniform distribution. Our method makes full use of the labeled samples and unlabeled samples to optimize decision boundaries. The overall diagram of our method is shown in Fig.
1. And the details of the overall diagram are described in section 3.1.Our main contributions are as follows:
1) The PEDCC is applied to semisupervised learning. The features of the labeled sample are constrained to the class centroids by the loss function based on the PEDCCLoss. Our experiments have demonstrated the effectiveness of PEDCC on a small number of labeled samples.
2) By adopting AutoAugment data augmentation strategy, the loss function based on MMD is used to restrict the distribution of unlabeled samples to the distribution of PEDCC, and the KL divergence loss function is used to calculate the classification loss between unlabeled samples and augmented samples. The generalization performance of the model is improved by unlabeled samples and augmented samples.
3) The conducted experiments show that our semisupervised learning method achieve the best model prediction accuracy with 4000 labeled samples on CIFAR10 datasets and 1000 labeled samples on SVHN datasets.
2 Related work
In this section, we present the work related to the semisupervised methods we employ.
2.1 Pedcc
For deep learning algorithms, the neural network’s fitting ability can ensure the intraclass distance is small, but cannot ensure the interclass distance is large enough. However, if the interclass distance is small, the accuracy of classification will be reduced. PEDCC is proposed based on the hypersphere charge model Ref20 . Due to mutual exclusion of charges, in equilibrium, charges will be evenly distributed on the hypersphere, and the distance between points is the farthest. The purpose of the PEDCC algorithm is to generate center points that are evenly distributed on the dimensional hypersphere. is the feature dimension and is the number of classification categories. Taking the center points generated by PEDCC as the clustering center points of classes can ensure that the interclass distance is sufficiently large.
2.2 PEDCCLoss
PEDCCLoss is a loss function based on PEDCC. Fig. 2 visualize the feature distribution finally learned by the neural network using different loss functions. It shows that the features learned by PEDCCLoss have the characteristics of small intraclass spacing and large interclass spacing. The formulas for PEDCCLoss are as follows:
(1) 
(2) 
(3) 
where Eq.(1) is the AMSoftmax loss function, and and
are adjustable hyperparameters. Eq.(
2) is the MSE loss function, calculating the distance between the model extraction features and the predefined class features. As shown in Eq.(3), PEDCCLoss is obtained by adding the above two loss functions together, and is a hyperparameter which satisfies .2.3 Maximum Mean Discrepancy (MMD)
The mean discrepancy is obtained by finding the continuous function , calculating the mean value of the mapped samples, and computing the difference between the mean values of the two differently distributed samples. The goal of MMD is to find the function
to maximize the mean discrepancy. As a test statistic, MMD can be used to calculate the distance between two distributions. Therefore. MMD can be used as a tool to determine whether the two distributions are the same. In practical applications, the MMD loss of a batch of data is defined as follows:
(4) 
(5) 
where is the set of all functions , and are two different distributions, , are their corresponding samples. The batch sizes of two distributions are and , and is the parameter value of the Gaussian kernel function.
2.4 AutoAugment
Using reinforcement learning to search for the best augmentation strategy for a given dataset, AutoAugment improves the effect of manually designed data augmentation method. AutoAugment determines a search space, which consists of multiple subpolicies. Each subpolicy contains two image operations, and each operation has three parts: operation mode, probability, and amplitude. Due to the diversity of image operations, the entire search space has
possibilities. By searching in the search space, AutoAugment finds the most suitable augmentation method for different datasets. For example, the augmentation strategy of CIFAR10 mainly includes color transformation. Meanwhile, the augmentation strategy of SVHN mainly includes geometric transformation. In addition, the augmentation strategy of CIFAR10 can be extended to the strategy of CIFAR100 dataset.3 Method
In this section, we introduce our semisupervised classification learning method. Our approach incorporates what are presented in the second section.
3.1 Overall framework
As shown in Fig. 1, we use labeled samples and unlabeled samples of a given dataset to train the image classification model. Our method first processes the dataset, performing data enhancement on the unlabeled sample to get . For different datasets, different enhancement strategies are used. The enhancement strategies include a variety of image operations such as rotation, histogram equalization, clipping, and so on. For a given number of classification categories, feature points that are evenly distributed over the hypersphere are generated using PEDCC algorithm.
The labeled samples , the unlabeled samples and the enhanced samples
are input to the convolutional neural network in the same batch according to the specified numbers. The feature description vector of labeled samples
and the feature description vector of unlabeled samples are obtained at the output of the pooling layer. The category prediction , , and are obtained in the final output of the network. Subsequently, the calculation of the loss is performed. For the features of labeled samples and the features of unlabeled sample , we add the mean square error loss and the MMD loss respectively, so that the distribution is similar to the PEDCC distribution. Based on the assumption of modelbased smoothness, we minimize the KL divergence between the network’s prediction of unlabeled samples and augmented samples. Meanwhile, the AMSoftmax is used to constraint the difference between the predicted values of the labeled samples and the ground truth labels. Through multiple constraints, our algorithm makes full use of the sample to get a clear decision boundary between different classes. The loss function is described in details below.3.2 Loss function
The loss function of our semisupervised learning algorithm consists of two parts: the loss of the labeled samples and the loss of the unlabeled samples. Each part of the loss is not limited to a single loss.
3.2.1 Loss of labeled samples
Usually the loss of the labeled samples is obtained by calculating the error between the output and the ground truth value of the label. However, this constraint can’t ensure the distance of interclass large enough and the distance of intraclass small enough. We first generate the feature vectors by PEDCC, and its dimension is , where is the number of classification categories and is the dimension of the feature vector of each class. Subsequently, the weight of the fully connected layer of the convolutional neural network is fixed to the value of , and the weights and features are normalized. Therefore, when the feature vector satisfies a predefined features of a certain class, the output of the network is a onehot vector.
Assuming that the number of labeled samples in a batch is , we calculate the mean squared error between the feature vectors extracted by the neural network and the predefined centroid as the feature loss function:
(6) 
where is the th sample in a batch, is selected from according to the ground truth label , and the feature dimension of is . In addition, AMSoftmax is used to calculate the error between the predicted value of the labeled sample and the ground truth label:
(7) 
where is the angle between the weights of the fully connected layer and the feature vectors, and is the real label corresponding to the th sample. and are adjustable hyperparameters. The loss of the labeled samples consists of the two losses above:
(8) 
where is an adjustable hyperparameter that satisfies , and and are used to regulate the relative magnitude between losses.
3.2.2 Loss of unlabeled samples
In the supervised learning, the generalization performance of the model is often improved by data augmentation, and the augmented sample shares the same label as the original sample. Similarly, data augmentation can be applied to unlabeled examples of semisupervised learning. Before and after the augmentation, the predicted output of the unlabeled samples should be consistent. Data augmentation should generate augmented samples that are close to the actual samples, so it is not recommended to use augmented methods that have a large impact on the image, such as adding Gaussian noise. AutoAugment optimizes enhancement strategies for different datasets and learns the most effective enhancement methods for the original datasets. While the previous enhancement methods adopt fixed enhancement strategies for all datasets.
We use the optimal strategy of AutoAugment to perform data augmentation. Meanwhile, KL divergence is used as a loss function to constrain the distribution consistency between the enhanced samples output and the original samples output:
(9) 
where is the number of unlabeled samples in a batch. Note that, is different from the number of labeled samples in Eq. (6). is the model output of the unlabeled sample, and is the output of the augmented samples. Following VAT, and share the same value, while
does not participate in the backpropagation of the model parameters.
To further minimize the feature distribution of unlabeled samples and the predefined class centroids, we use the MMD as the loss function:
(10) 
where the dimension of is and the dimension of is . is the Gaussian kernel function in Eq. (5). The loss of the unlabeled samples consists of the two losses above:
(11) 
where and are used to regulate the relative magnitude between losses.
3.2.3 Final loss
In our approach, the final loss function of the network is the sum of the labeled samples loss and the unlabeled samples loss:
(12) 
3.3 Network structure
The network structure we use is WideResNet Ref21 , which reduces the depth and increases the width. Compared to thin and deep ResNet, WideResNet can achieve better image classification accuracy. Table 1 lists the specific parameters of the network structure we use. Unlike the previous networks, the weight of fully connected layer is fixed to the predefined evenlydistributed class centroids, and the dimension is . is the total number of labeled and unlabeled samples in a batch, and is the feature vectors dimension of each class centroid.
Group name  Output size  Block type=B(3,3) 

Conv1  
Conv2  
Conv3  
Conv4  
Avgpool  
Fully connected     
4 Experiments
4.1 Implementation details
The network structure used in our experiments is WideResNet with depth 28 and width 2. As shown in Table 2, we evaluated our semisupervised learning algorithms on CIFAR10 Ref22 and SVHN Ref23 datasets. Both CIFAR10 and SVHN are benchmark image classification datasets. The total number of sample categories is 10 and the image resolution is .
CIFAR10  SVHN  

Samples  50000  73257 
Labeled samples  4000  1000 
Categories  10  10 
Image Resolution 
In the experiment, we used 4000 labeled samples for CIFAR10 and 1000 labeled samples for SVHN. Note that, AutoAugment uses these labeled samples to find the optimal strategy for data enhancement. Therefore, the selection of labeled samples in our experiments is consistent with AutoAugment. For each unlabeled sample, 100 enhanced samples are generated by data augmentation. In one batch, there are 32 labeled samples, 160 unlabeled samples and the corresponding 160 augmented samples. The mode of network learning rate attenuation uses cosine decay. The initial learning rate is set to 0.03 on CIFAR10 and 0.05 on SVHN. The gradient descent method with momentum is used as the optimizer, and the momentum is set to 0.9. All experiments are performed on GTX 1080Ti GPU.
From Eq. (6) to Eq. (12), we can see that there are seven hyperparameters, , , , , , , and involved in our method. Among them, , , , , are included in PEDCCLoss. Therefore, referring to the previous study [16], the value of is 7.5, the value of is 0.35. and the values of , , and are all 1. Meanwhile, and are used to balance the relative magnitude between different losses. In the following, we will discuss how to determine the values of and . The details of the hyperparameter settings are shown in Table 3.
Hyperparameters  

CIFAR10  7.5  0.35  1  1  1  400  0.2 
SVHN  7.5  0.35  1  1  1  1600  0.04 
4.2 Experimental results
4.2.1 Comparison with semisupervised learning methods
We compare our method with some representative semisupervised learning algorithms, using the same number of labeled samples and the same network structure. The average results obtained from three replicate experiments are shown in Table 4. Specifically, we compare our method with five semisupervised learning methods and the supervised methods. The supervised methods use only labeled samples while a large number of unlabeled samples are not used during training. By comparing our methods with the supervised methods, the improvement is 15.36% on Cifar10 dataset and 10.41% on SVHN dataset.
CIFAR10(4k)  SVHN(1k)  

Supervised  20.26  12.83 
PseudoLabel  17.78  7.62 
Model  16.37  7.19 
Mean Teacher  15.87  5.65 
VAT  13.86  5.63 
UDA  5.34  3.41 
Ours  4.90  2.42 
PseudoLabel Ref3 , Model Ref7 , Mean Teacher Ref9 , VAT Ref10 and UDA Ref13 are representative methods in the semisupervised field. The details of these methods are described in Introduction. However, the loss function used in those methods can not make the feature distribution learned by network reach an optimal state. Our method use PEDCC, and use PEDCCLoss for labeled samples, MMD for unlabeled samples to constrains the feature distribution learned by neural network to PEDCC. The dimension of the PEDCC is on both datasets. As shown in the table, our method achieves 95.10% accuracy with 4000 labeled samples on CIFAR10 and 97.58% accuracy with 1000 labeled samples on the SVHN dataset. Compared with the previous stateoftheart model UDA, the improvement is 0.44% and 0.99% respectively. Our method optimizes the feature distribution of network learning, which improves the accuracy of classification recognition. It is worth mentioning that since the UDA experiment is implemented on google TPU, we run UDA source code three times in our experimental environment and the average results are shown in the table.
4.2.2 Ablation Study
In this section, we demonstrate the validity of the loss function we employed on the labeled and unlabeled samples based on the predefined class centroids. We performed ablation experiments on both datasets. The performance of the different loss functions is revealed by adding or removing the corresponding component. Specifically, we have adopted the following combinations of loss functions:
1) Cross Entropy on labeled samples and KL divergence on unlabeled samples.
2) PEDCCLoss on labeled samples and KL divergence on unlabeled samples.
3) PEDCCLoss on labeled samples and the sum of KL divergence and MMD on unlabeled samples.
Ablation  CIFAR10(4k)  SVHN(1k)  


5.45  2.98  

4.95  2.62  

4.53  2.48 
By comparing the second combination with the first way, the performance improvement brought by PEDCCLoss is demonstrated. In the same way, by comparing the second combination with the third way, the performance improvement brought by the MMD is revealed. Note that PEDCCLoss is the sum of the first two terms in Eq. (12). The results of different combinations are given in Table 5. Obviously, it is effective to predefine the class centroids and apply PEDCCLoss on a small number of labeled samples. Compared with the crossentropy loss, the error rate of PEDCCLoss decreased from 5.45% to 4.95% on the CIFAR10 dataset, and on the SVHN dataset the error rate decreased from 2.98% to 2.62%. By adding the MMD constraint, the distribution of the unlabeled samples approaches the uniform distribution, and the error rates on the two datasets can be reduced by 0.42% and 0.14% respectively.
4.2.3 Effect of and
TER(%)  

400  0.1  4.85 
400  0.2  4.53 
400  0.4  5.33 
200  0.2  5.27 
600  0.2  5.15 
TER(%)  

1600  0.02  2.48 
1600  0.04  2.34 
1600  0.08  2.97 
800  0.04  2.79 
2400  0.04  2.84 
The impact of and on the classification results on CIFAR10 dataset is shown in Table 6. First, we fix the value of to 400 and the values of are set to 0.1, 0.2, and 0.4 respectively. Correspondingly, the test error rates (TER) are 4.85%, 4.53%, and 5.33%. Therefore, the optimal value of is determined as 0.2. Based on this result, we fix the value of to 0.2 and the values of are set to 200, 400, and 600 respectively. Correspondingly, the test error rates are 5.27%, 4.53%, and 5.15%. Obviously, the algorithm obtains the optimal value when the value of is 400 and the value of is 0.2.
Similarly, on SVHN dataset, when the value of is 1600 and the value of is 0.04, the algorithm achieves the optimal result, and the test error rate is 2.34%. The impact of and on SVHN dataset is shown in Table 7.
5 Conclusions
In this paper, we apply the PEDCC to semisupervised learning. Unlike other semisupervised methods, our method add feature constrains using the corresponding loss functions. Therefore, our method ensures that the interclass distance is large and the intraclass distance is small to improve the accuracy of classification. Since the final loss function consists of multiple items, we experimentally find the optimal settings for parameters. Additionally, the performance gains of PEDCC on labeled and unlabeled samples are demonstrated separately through ablation experiments. At the same time, our method achieves the stateoftheart performance on the CIFAR10 and SVHN datasets, using 4000 labeled samples and 1000 labeled samples respectively.
In principle, the effect of data augmentation is directly related to the final performance of semisupervised learning. The data augmentation strategy we use is AutoAugment, which is currently the best performing data augmentation algorithm. Our method mainly adds feature constraints through different loss functions. However, we have not improved the existing data augmentation strategies. Therefore, the improvement of data augmentation methods will be studied in the future.
References

(1)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 10971105

(2)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 70778
 (3) Lee DH (2013) Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In: ICML Workshop on Challenges in Representation Learning, 3: 2
 (4) Chapelle O, Scholkopf B, Zien A (2006) Semisupervised learning. MIT Press
 (5) Cireşan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22(12): 32073220
 (6) Rasmus A, Berglund M, Honkala M, Valpola H, Raiko T (2015) Semisupervised learning with ladder networks. In: Advances in neural information processing systems, pp 35463554
 (7) Laine S, Aila T (2017) Temporal ensembling for semisupervised learning. In: Fifth International Conference on Learning Representations.
 (8) Sajjadi M, Javanmardi M, Tasdizen T (2016) Regularization with stochastic transformations and perturbations for deep semisupervised learning. In: Advances in Neural Information Processing Systems, pp 11631171
 (9) Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In: Advances in neural information processing systems, pp 11951204
 (10) Miyato T, Maeda S, Koyama M, Ishii S (2018) Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence 41(8): 19791993
 (11) Zhang H, Cisse M, Dauphin YN, LopezPaz D (2017) mixup: Beyond empirical risk minimization. arXiv:1710.09412
 (12) Berthelot D, Carlini N, Goodfellow I, Oliver A, Papernot N, and Raffel C (2019) Mixmatch: A holistic approach to semisupervised learning. arXiv:1905.02249
 (13) Xie Q, Dai Z, Hovy E, Luong MT, Le QV (2019) Unsupervised data augmentation. arXiv:1904.12848
 (14) Cubuk E D, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: Learning augmentation policies from data. arXiv:1805.09501

(15)
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815823
 (16) Liu W, Wen Y, Yu Z, Yang M (2016) Largemargin softmax loss for convolutional neural networks. In: International Conference on Machine Learning 2(3): 7
 (17) Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25(7): 926930
 (18) Zhu QY, Zhang PJ, Ye X (2019) A New Loss Function for CNN Classifier Based on Predefined EvenlyDistributed Class Centroids. arXiv:1904.06008
 (19) Borgwardt KM, Gretton A, Rasch MJ, Kriegel HP, Scholkopf B, and Smola AJ (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14): e49e57
 (20) Zhu QY, Zhang RX (2019) A Classification Supervised AutoEncoder Based on Predefined EvenlyDistributed Class Centroids. arXiv:1902.00220
 (21) Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Proceedings of the British Machine Vision Conference (BMVC).
 (22) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
 (23) Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning
 (24) Oliver A, Odena A, Raffel CA, Cubuk ED, Goodfellow I (2018) Realistic evaluation of deep semisupervised learning algorithms. In: Advances in Neural Information Processing Systems, pp 32353246