In recent years, deep learning has achieved great success in many areas of image research. Deep learning models need to be driven by a large amount of labeled data to achieve good results, for example, using a million-level number of images to train deep neural networksRef1 ; Ref2 . However, collecting labeled samples is difficult for many tasks, which requires a lot of manpower, time, and cost. The remote sensing image classification task can prove this point. In practical applications, the acquisition of labeled data for remote sensing requires not only field investigation but also professional interpretation, which limits the quantity of available labeled samples. In contrast, unlabeled samples are easier to obtain and more numerous, therefore how to use the readily available data to improve the performance of model is an important research issue.
Semi-supervised learning is a machine learning method between supervised learning and unsupervised learning. In the case of a small number of labeled samples, semi-supervised learning avoids the problem of insufficient model generalization by introducing unlabeled samples. Unlabeled samples provide important information for the spatial distribution of the data, which helps the model to get better decision boundaries. Semi-supervised learning can use only one-tenth or less labeled data to achieve similar results as supervised learning algorithms.
Recently, semi-supervised learning algorithm is implemented by adding an unlabeled data loss term to the loss function. Pseudo-Label Ref3
takes the class corresponding to the maximum predicted probability as the true label of the unlabeled sample. However, Pseudo-Label does not use data augmentation, so the results obtained are limited. The earlier semi-supervised methods than Pseudo-Label will not be introduced here, the related overviews are mentioned in literatureRef4 . In the following, we mainly introduce semi-supervised learning methods with data augmentation.
Based on the smoothness assumption of the system input and output, a robust model gives a stable and smooth prediction when the input changes (such as shearing, rotation, etc.). The commonly used regularization method in supervised learning is data augmentation, which obtains lifelike training data by transforming the input without changing the class semantics Ref5 . Similarly, data augmentation can be applied to unlabeled samples, keeping the output consistentcy before and after augmentation. A teacher-student model inputs noisy samples into the student model, minimizing the prediction error between the teacher model and the student model Ref6 . Subsequently, the teacher-student model is extended based on the number of iterations to get better results. -Model Ref7 ; Ref8 updates the prediction of the teacher model by exponential moving average (EMA), and Mean Teacher Ref9 updates the parameters of teacher model with EMA. However, the data augmentation of -Model and Mean Teacher is relatively simple, and only random noise is added to the inputs and hidden layers. Virtual Adversarial Training (VAT) Ref10 defines the direction of disturbance in the most sensitive direction of the model. MixUp Ref11
linearly interpolates the inputs and labels of two different samples to obtain augmented samples and labels. MixMatchRef12 uses the MixUp method to augment both labeled and unlabeled data, with pseudo-label predicted by the model as the label of unlabeled sample. Nevertheless, VAT and Mixmatch all employ fixed augmentation method for various datasets. UDA Ref13 is the latest research from google-research, which use AutoAugment Ref14 to perform data augmentation. AutoAugment is the best-performing data augmentation method. Following UDA, we also use AutoAugment to generate augmented samples. The details of AutoAugment will be introduced in section 2.4. Importantly, the semi-supervised methods introduced above do not add constraints on sample features during network training, resulting in unclear classification boundaries.
In semi-supervised learning, the loss function consists of two parts: the loss of the labeled samples and the loss of the unlabeled samples. In the field of image classification for supervised learning, in order to get a better feature description, many loss functions have been proposed. The cross-entropy loss function is simple to implement and has good performance, but its feature distribution cannot reach the optimal state. By combining the positive and negative samples, Triplet loss Ref15 can increase the feature constraints and improve the model generalization, but it takes a long time to train the model. L-Softmax Ref16 and AM-Softmax Ref17 change the angle between the weights of the fully connected layer and the features, which makes the intra-class distribution more compact and the inter-class distribution more dispersed. By using predefined evenly-distributed class centroids (PEDCC) and adding the MSE loss between features on AM-Softmax, PEDCC-Loss Ref18 achieves the best recognition accuracy in CIFAR100, LFW and other datasets. Therefore, we use PEDCC-Loss on a small number of labeled samples and extends its idea to the loss of unlabeled samples. We use the maximum mean discrepancy (MMD) loss Ref19
to measure the distance between the feature distribution of unlabeled samples extracted by the model and the feature distribution of PEDCC. Therefore, the unlabeled sample feature distribution also satisfies the uniform distribution. Our method makes full use of the labeled samples and unlabeled samples to optimize decision boundaries. The overall diagram of our method is shown in Fig.1. And the details of the overall diagram are described in section 3.1.
Our main contributions are as follows:
1) The PEDCC is applied to semi-supervised learning. The features of the labeled sample are constrained to the class centroids by the loss function based on the PEDCC-Loss. Our experiments have demonstrated the effectiveness of PEDCC on a small number of labeled samples.
2) By adopting AutoAugment data augmentation strategy, the loss function based on MMD is used to restrict the distribution of unlabeled samples to the distribution of PEDCC, and the KL divergence loss function is used to calculate the classification loss between unlabeled samples and augmented samples. The generalization performance of the model is improved by unlabeled samples and augmented samples.
3) The conducted experiments show that our semi-supervised learning method achieve the best model prediction accuracy with 4000 labeled samples on CIFAR10 datasets and 1000 labeled samples on SVHN datasets.
2 Related work
In this section, we present the work related to the semi-supervised methods we employ.
For deep learning algorithms, the neural network’s fitting ability can ensure the intra-class distance is small, but cannot ensure the inter-class distance is large enough. However, if the inter-class distance is small, the accuracy of classification will be reduced. PEDCC is proposed based on the hypersphere charge model Ref20 . Due to mutual exclusion of charges, in equilibrium, charges will be evenly distributed on the hypersphere, and the distance between points is the farthest. The purpose of the PEDCC algorithm is to generate center points that are evenly distributed on the -dimensional hypersphere. is the feature dimension and is the number of classification categories. Taking the center points generated by PEDCC as the clustering center points of classes can ensure that the inter-class distance is sufficiently large.
PEDCC-Loss is a loss function based on PEDCC. Fig. 2 visualize the feature distribution finally learned by the neural network using different loss functions. It shows that the features learned by PEDCC-Loss have the characteristics of small intra-class spacing and large inter-class spacing. The formulas for PEDCC-Loss are as follows:
where Eq.(1) is the AM-Softmax loss function, and and
are adjustable hyperparameters. Eq.(2) is the MSE loss function, calculating the distance between the model extraction features and the predefined class features. As shown in Eq.(3), PEDCC-Loss is obtained by adding the above two loss functions together, and is a hyperparameter which satisfies .
2.3 Maximum Mean Discrepancy (MMD)
The mean discrepancy is obtained by finding the continuous function , calculating the mean value of the -mapped samples, and computing the difference between the mean values of the two differently distributed samples. The goal of MMD is to find the function
to maximize the mean discrepancy. As a test statistic, MMD can be used to calculate the distance between two distributions. Therefore. MMD can be used as a tool to determine whether the two distributions are the same. In practical applications, the MMD loss of a batch of data is defined as follows:
where is the set of all functions , and are two different distributions, , are their corresponding samples. The batch sizes of two distributions are and , and is the parameter value of the Gaussian kernel function.
Using reinforcement learning to search for the best augmentation strategy for a given dataset, AutoAugment improves the effect of manually designed data augmentation method. AutoAugment determines a search space, which consists of multiple sub-policies. Each sub-policy contains two image operations, and each operation has three parts: operation mode, probability, and amplitude. Due to the diversity of image operations, the entire search space haspossibilities. By searching in the search space, AutoAugment finds the most suitable augmentation method for different datasets. For example, the augmentation strategy of CIFAR10 mainly includes color transformation. Meanwhile, the augmentation strategy of SVHN mainly includes geometric transformation. In addition, the augmentation strategy of CIFAR10 can be extended to the strategy of CIFAR100 dataset.
In this section, we introduce our semi-supervised classification learning method. Our approach incorporates what are presented in the second section.
3.1 Overall framework
As shown in Fig. 1, we use labeled samples and unlabeled samples of a given dataset to train the image classification model. Our method first processes the dataset, performing data enhancement on the unlabeled sample to get . For different datasets, different enhancement strategies are used. The enhancement strategies include a variety of image operations such as rotation, histogram equalization, clipping, and so on. For a given number of classification categories, feature points that are evenly distributed over the hypersphere are generated using PEDCC algorithm.
The labeled samples , the unlabeled samples and the enhanced samples
are input to the convolutional neural network in the same batch according to the specified numbers. The feature description vector of labeled samplesand the feature description vector of unlabeled samples are obtained at the output of the pooling layer. The category prediction , , and are obtained in the final output of the network. Subsequently, the calculation of the loss is performed. For the features of labeled samples and the features of unlabeled sample , we add the mean square error loss and the MMD loss respectively, so that the distribution is similar to the PEDCC distribution. Based on the assumption of model-based smoothness, we minimize the KL divergence between the network’s prediction of unlabeled samples and augmented samples. Meanwhile, the AM-Softmax is used to constraint the difference between the predicted values of the labeled samples and the ground truth labels. Through multiple constraints, our algorithm makes full use of the sample to get a clear decision boundary between different classes. The loss function is described in details below.
3.2 Loss function
The loss function of our semi-supervised learning algorithm consists of two parts: the loss of the labeled samples and the loss of the unlabeled samples. Each part of the loss is not limited to a single loss.
3.2.1 Loss of labeled samples
Usually the loss of the labeled samples is obtained by calculating the error between the output and the ground truth value of the label. However, this constraint can’t ensure the distance of inter-class large enough and the distance of intra-class small enough. We first generate the feature vectors by PEDCC, and its dimension is , where is the number of classification categories and is the dimension of the feature vector of each class. Subsequently, the weight of the fully connected layer of the convolutional neural network is fixed to the value of , and the weights and features are normalized. Therefore, when the feature vector satisfies a predefined features of a certain class, the output of the network is a one-hot vector.
Assuming that the number of labeled samples in a batch is , we calculate the mean squared error between the feature vectors extracted by the neural network and the predefined centroid as the feature loss function:
where is the -th sample in a batch, is selected from according to the ground truth label , and the feature dimension of is . In addition, AM-Softmax is used to calculate the error between the predicted value of the labeled sample and the ground truth label:
where is the angle between the weights of the fully connected layer and the feature vectors, and is the real label corresponding to the -th sample. and are adjustable hyperparameters. The loss of the labeled samples consists of the two losses above:
where is an adjustable hyperparameter that satisfies , and and are used to regulate the relative magnitude between losses.
3.2.2 Loss of unlabeled samples
In the supervised learning, the generalization performance of the model is often improved by data augmentation, and the augmented sample shares the same label as the original sample. Similarly, data augmentation can be applied to unlabeled examples of semi-supervised learning. Before and after the augmentation, the predicted output of the unlabeled samples should be consistent. Data augmentation should generate augmented samples that are close to the actual samples, so it is not recommended to use augmented methods that have a large impact on the image, such as adding Gaussian noise. AutoAugment optimizes enhancement strategies for different datasets and learns the most effective enhancement methods for the original datasets. While the previous enhancement methods adopt fixed enhancement strategies for all datasets.
We use the optimal strategy of AutoAugment to perform data augmentation. Meanwhile, KL divergence is used as a loss function to constrain the distribution consistency between the enhanced samples output and the original samples output:
where is the number of unlabeled samples in a batch. Note that, is different from the number of labeled samples in Eq. (6). is the model output of the unlabeled sample, and is the output of the augmented samples. Following VAT, and share the same value, while
does not participate in the backpropagation of the model parameters.
To further minimize the feature distribution of unlabeled samples and the predefined class centroids, we use the MMD as the loss function:
where the dimension of is and the dimension of is . is the Gaussian kernel function in Eq. (5). The loss of the unlabeled samples consists of the two losses above:
where and are used to regulate the relative magnitude between losses.
3.2.3 Final loss
In our approach, the final loss function of the network is the sum of the labeled samples loss and the unlabeled samples loss:
3.3 Network structure
The network structure we use is WideResNet Ref21 , which reduces the depth and increases the width. Compared to thin and deep ResNet, WideResNet can achieve better image classification accuracy. Table 1 lists the specific parameters of the network structure we use. Unlike the previous networks, the weight of fully connected layer is fixed to the predefined evenly-distributed class centroids, and the dimension is . is the total number of labeled and unlabeled samples in a batch, and is the feature vectors dimension of each class centroid.
|Group name||Output size||Block type=B(3,3)|
4.1 Implementation details
The network structure used in our experiments is WideResNet with depth 28 and width 2. As shown in Table 2, we evaluated our semi-supervised learning algorithms on CIFAR10 Ref22 and SVHN Ref23 datasets. Both CIFAR10 and SVHN are benchmark image classification datasets. The total number of sample categories is 10 and the image resolution is .
In the experiment, we used 4000 labeled samples for CIFAR10 and 1000 labeled samples for SVHN. Note that, AutoAugment uses these labeled samples to find the optimal strategy for data enhancement. Therefore, the selection of labeled samples in our experiments is consistent with AutoAugment. For each unlabeled sample, 100 enhanced samples are generated by data augmentation. In one batch, there are 32 labeled samples, 160 unlabeled samples and the corresponding 160 augmented samples. The mode of network learning rate attenuation uses cosine decay. The initial learning rate is set to 0.03 on CIFAR10 and 0.05 on SVHN. The gradient descent method with momentum is used as the optimizer, and the momentum is set to 0.9. All experiments are performed on GTX 1080Ti GPU.
From Eq. (6) to Eq. (12), we can see that there are seven hyperparameters, , , , , , , and involved in our method. Among them, , , , , are included in PEDCC-Loss. Therefore, referring to the previous study , the value of is 7.5, the value of is 0.35. and the values of , , and are all 1. Meanwhile, and are used to balance the relative magnitude between different losses. In the following, we will discuss how to determine the values of and . The details of the hyperparameter settings are shown in Table 3.
4.2 Experimental results
4.2.1 Comparison with semi-supervised learning methods
We compare our method with some representative semi-supervised learning algorithms, using the same number of labeled samples and the same network structure. The average results obtained from three replicate experiments are shown in Table 4. Specifically, we compare our method with five semi-supervised learning methods and the supervised methods. The supervised methods use only labeled samples while a large number of unlabeled samples are not used during training. By comparing our methods with the supervised methods, the improvement is 15.36% on Cifar10 dataset and 10.41% on SVHN dataset.
Pseudo-Label Ref3 , -Model Ref7 , Mean Teacher Ref9 , VAT Ref10 and UDA Ref13 are representative methods in the semi-supervised field. The details of these methods are described in Introduction. However, the loss function used in those methods can not make the feature distribution learned by network reach an optimal state. Our method use PEDCC, and use PEDCC-Loss for labeled samples, MMD for unlabeled samples to constrains the feature distribution learned by neural network to PEDCC. The dimension of the PEDCC is on both datasets. As shown in the table, our method achieves 95.10% accuracy with 4000 labeled samples on CIFAR10 and 97.58% accuracy with 1000 labeled samples on the SVHN dataset. Compared with the previous state-of-the-art model UDA, the improvement is 0.44% and 0.99% respectively. Our method optimizes the feature distribution of network learning, which improves the accuracy of classification recognition. It is worth mentioning that since the UDA experiment is implemented on google TPU, we run UDA source code three times in our experimental environment and the average results are shown in the table.
4.2.2 Ablation Study
In this section, we demonstrate the validity of the loss function we employed on the labeled and unlabeled samples based on the predefined class centroids. We performed ablation experiments on both datasets. The performance of the different loss functions is revealed by adding or removing the corresponding component. Specifically, we have adopted the following combinations of loss functions:
1) Cross Entropy on labeled samples and KL divergence on unlabeled samples.
2) PEDCC-Loss on labeled samples and KL divergence on unlabeled samples.
3) PEDCC-Loss on labeled samples and the sum of KL divergence and MMD on unlabeled samples.
By comparing the second combination with the first way, the performance improvement brought by PEDCC-Loss is demonstrated. In the same way, by comparing the second combination with the third way, the performance improvement brought by the MMD is revealed. Note that PEDCC-Loss is the sum of the first two terms in Eq. (12). The results of different combinations are given in Table 5. Obviously, it is effective to predefine the class centroids and apply PEDCC-Loss on a small number of labeled samples. Compared with the cross-entropy loss, the error rate of PEDCC-Loss decreased from 5.45% to 4.95% on the CIFAR10 dataset, and on the SVHN dataset the error rate decreased from 2.98% to 2.62%. By adding the MMD constraint, the distribution of the unlabeled samples approaches the uniform distribution, and the error rates on the two datasets can be reduced by 0.42% and 0.14% respectively.
4.2.3 Effect of and
The impact of and on the classification results on CIFAR10 dataset is shown in Table 6. First, we fix the value of to 400 and the values of are set to 0.1, 0.2, and 0.4 respectively. Correspondingly, the test error rates (TER) are 4.85%, 4.53%, and 5.33%. Therefore, the optimal value of is determined as 0.2. Based on this result, we fix the value of to 0.2 and the values of are set to 200, 400, and 600 respectively. Correspondingly, the test error rates are 5.27%, 4.53%, and 5.15%. Obviously, the algorithm obtains the optimal value when the value of is 400 and the value of is 0.2.
Similarly, on SVHN dataset, when the value of is 1600 and the value of is 0.04, the algorithm achieves the optimal result, and the test error rate is 2.34%. The impact of and on SVHN dataset is shown in Table 7.
In this paper, we apply the PEDCC to semi-supervised learning. Unlike other semi-supervised methods, our method add feature constrains using the corresponding loss functions. Therefore, our method ensures that the inter-class distance is large and the intra-class distance is small to improve the accuracy of classification. Since the final loss function consists of multiple items, we experimentally find the optimal settings for parameters. Additionally, the performance gains of PEDCC on labeled and unlabeled samples are demonstrated separately through ablation experiments. At the same time, our method achieves the state-of-the-art performance on the CIFAR10 and SVHN datasets, using 4000 labeled samples and 1000 labeled samples respectively.
In principle, the effect of data augmentation is directly related to the final performance of semi-supervised learning. The data augmentation strategy we use is AutoAugment, which is currently the best performing data augmentation algorithm. Our method mainly adds feature constraints through different loss functions. However, we have not improved the existing data augmentation strategies. Therefore, the improvement of data augmentation methods will be studied in the future.
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097-1105
- (3) Lee DH (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop on Challenges in Representation Learning, 3: 2
- (4) Chapelle O, Scholkopf B, Zien A (2006) Semi-supervised learning. MIT Press
- (5) Cireşan DC, Meier U, Gambardella LM, Schmidhuber J (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22(12): 3207-3220
- (6) Rasmus A, Berglund M, Honkala M, Valpola H, Raiko T (2015) Semi-supervised learning with ladder networks. In: Advances in neural information processing systems, pp 3546-3554
- (7) Laine S, Aila T (2017) Temporal ensembling for semi-supervised learning. In: Fifth International Conference on Learning Representations.
- (8) Sajjadi M, Javanmardi M, Tasdizen T (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, pp 1163-1171
- (9) Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in neural information processing systems, pp 1195-1204
- (10) Miyato T, Maeda S, Koyama M, Ishii S (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41(8): 1979-1993
- (11) Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv:1710.09412
- (12) Berthelot D, Carlini N, Goodfellow I, Oliver A, Papernot N, and Raffel C (2019) Mixmatch: A holistic approach to semi-supervised learning. arXiv:1905.02249
- (13) Xie Q, Dai Z, Hovy E, Luong MT, Le QV (2019) Unsupervised data augmentation. arXiv:1904.12848
- (14) Cubuk E D, Zoph B, Mane D, Vasudevan V, Le QV (2018) Autoaugment: Learning augmentation policies from data. arXiv:1805.09501
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815-823
- (16) Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: International Conference on Machine Learning 2(3): 7
- (17) Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25(7): 926-930
- (18) Zhu QY, Zhang PJ, Ye X (2019) A New Loss Function for CNN Classifier Based on Pre-defined Evenly-Distributed Class Centroids. arXiv:1904.06008
- (19) Borgwardt KM, Gretton A, Rasch MJ, Kriegel HP, Scholkopf B, and Smola AJ (2006) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14): e49-e57
- (20) Zhu QY, Zhang RX (2019) A Classification Supervised Auto-Encoder Based on Predefined Evenly-Distributed Class Centroids. arXiv:1902.00220
- (21) Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Proceedings of the British Machine Vision Conference (BMVC).
- (22) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
- (23) Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning
- (24) Oliver A, Odena A, Raffel CA, Cubuk ED, Goodfellow I (2018) Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems, pp 3235-3246