Recent years have witnessed rapid developments in deep neural networks [He2015DeepRL, Simonyan2014VeryDC, Szegedy2014GoingDW, szegedy2016rethinking] and their widespread applications in a variety of tasks [Graves2014TowardsES, Jzefowicz2016ExploringTL]. Due to the over-parameterized nature of deep neural networks [Pereyra2017RegularizingNN, Zhang2016UnderstandingDL], tons of tricks have been invented to improve the generalization performance of deep learning through regularizing the training procedure, such as weigh decay [Krizhevsky2009LearningML, Zhang2018ThreeMO], DropOut [Hinton2012ImprovingNN], stochastic pooling [Zeiler2013StochasticPF], data augmentation [Krizhevsky2012ImageNetCW], as well as using perturbed labels [Xie2016DisturbLabelRC] or soft labels [szegedy2016rethinking] (drawn from the continuous space) to replace the original hard labels (zero-one coded) of training samples. In this work, we study the problem of learning optimal soft labels for training samples subject to the deep learning process, and further propose novel algorithm that intends to co-learn both “best” soft labels and deep neural networks through an end-to-end training procedure.
Researchers have widely adopted label smoothing to optimize a broad range of tasks, including but not limited to image classification [Huang2018GPipeET, Real2018RegularizedEF, Zoph2017LearningTA], speech recognition [Chorowski2016TowardsBD], and machine translation [Vaswani2017AttentionIA]. The key principle here to regularize the deep learning procedure with certain privileged prior information [Xie2016DisturbLabelRC, lopez2016unifying] embedded in the soft labels. With a set of predefined rules, label smoothing [szegedy2016rethinking]
was first proposed to soften the hard labels to regularize the training objectives with smoothness. In addition to using predefined mappings, learning from the soft classification outputs (e.g., logits) of a well-trained teacher neural network (often named as knowledge distillation[hinton2015distilling]) could also improve the generalization performance significantly. In our research, we soften labels using well-trained neural networks, so as to incorporate the privileged knowledge of teacher network [lopez2016unifying]. More specifically, due to the lack of well-trained models in advance, the proposed algorithm is expected to learn optimal soft labels from the DNN outputs during the training process, i.e., under self-distillation settings.
To achieve the goal, we propose a novel deep learning algorithm, namely COLAM – the CO-Learning of deep neural networks and soft labels via Alternating Minimization. During the training procedure, COLAM alternatively minimizes two learning objectives: (i) the training loss subject to the target (soft) labels, and (ii) the loss to learn soft label design subject to the logit outputs of learned labels. Compared to the existing solution that either use the raw soft prediction results as the soft labels for self-distillation, or leverage pre-trained models as teacher networks with additional computation cost required, COLAM uses one end-to-end training procedure to effectively learn both soft labels (of all training samples) and the model all in once. COLAM improves the generalization of deep learning through softening the labels with “privileged” information while enjoying the same computation cost of vanilla training.
The contributions made in this paper are as follows. We study the technical problem of co-learning soft labels and deep neural network during one end-to-end training process in a self-distillation setting. We design two objectives to learn the model and the soft labels respectively, where the two objective functions depend on each other. We further propose COLAM algorithm that achieves the goal through alternatively minimizing two objectives. Extensive experiments have been done using real-world image classification datasets under the both supervised learning and transfer learning settings. We compare COLAM with a bunch of baselines algorithms using soft labels and perturbed labels. The experiment results showed that COLAM can significantly outperform baseline methods with significantly higher classification accuracy (1%–2%) using comparable computation cost.
2 Related Work and Backgrounds
Label smoothing (LS) was first introduced in [szegedy2016rethinking]
to enhance the performance of Inception model on ImageNet[Deng2009ImageNetAL]. This traditional label smoothing is a weighted average of ground-truth label. Formally, given a ground-truth label in a classification task for classes, we have and . If , then is the correct class to which that sample belongs. The softened label is obtained by
Note that the superscripts represent the indices. The hard label is replaced by the softened label when computing the cross entropy loss between label and predicted probabilities. Mathematically, the cross entropy between the ground-truth targets
and a predicted probability distributionis
Now the soft label substitutes the ground-truth hard label in the cross entropy function, giving rise to
This noisy loss result enables the network to reduce the chances of being overconfident while making predictions, thus regularizing the network. Besides following a uniformed distribution to produce soft labels, label smoothing can be more dynamic.[Pereyra2017RegularizingNN] shows that, with a slight modification of the KL Divergence direction, “confidence penalty" regularizer is equivalent to label smoothing. This regularizer encourages predictions to have larger entropy and lower confidence on the most probable class. It is achieved by softening the model output.
Furthermore, adding noise to labels produces similar effects as label smoothing. DisturbLabel [Xie2016DisturbLabelRC] is a regularizer in the loss layer of a network. It adds noise to labels during training by randomly changing a correct label to another one-hot label . This permutation of the elements in labels happens under a certain fixed likelihood. [Xie2016DisturbLabelRC] points out DisturbLabel has the same expected gradient as label smoothing because . However, DisturbLabel outperforms traditional LS on many datasets, likely because randomness in the algorithm contributes to the success. Although these regularizers bring improvements in generalization, little to no dataset knowledge is involved to produce soft labels.
3 COLAM Algorithm Design
In this section, we first present the overall learning procedure with the design of two objectives.
3.1 Learning Procedure
Let be the training set which contains labelled training samples and classes. Each sample is denoted by , where . We define our objective deep neural network network parameterized with as the mapping function.
COLAM splits the overall training procedure into equal-length stages , with each stage consisting of epochs. In the first stage, COLAM uses the original hard labels of training samples to train deep neural network with epochs and obtains . Then, COLAM computes the soft label for every sample (i.e., for sample ) in the training set through minimizing the Loss of Soft Label Learning.
From the second stage to the final stage of the training procedure, COLAM continues the deep learning procedure using the training samples with soft labels () via minimizing the Loss of Model Learning, and repeats the soft label computation by the end of stage. Through alternatively minimizing the training loss and soft label design loss, COLAM is expected to reach the convergence of deep learning and outputs the as the results of soft label and model co-learning. The model would be trained using epochs.
The complete algorithm is illustrated in Algorithm 1.
3.2 Loss of Deep Neural Network Training
COLAM simply uses the cross-entropy function as the loss to train deep neural networks. For the first stage, COLAM computes the DNN training loss using hard labels, while it starts leveraging the soft labels from the second stage. Given a pair of predictor and label , the parameter and the temperature for softmax, COLAM considers the cross-entropy loss as follow.
where refers to the dimension of the label and is the dimension of the input. The softmax function is defined as follow.
Note that, for the first stage, COLAM uses referring to the training sample with hard labels. From the second stage, COLAM uses referring to the sample with the soft labels. Please refer to lines 3–8 of Algorithm. 1 for details.
3.3 Loss of Soft Label Learning
To achieve the better generalization while lowering the computation complexity, COLAM assumes all samples of the same class share the same soft label.
Peer Samples. In this way, to learn the soft label, COLAM first retrieves the peer samples for every training sample. Given , its peer sample set (denoted as ) is defined as .
Soft Label Loss. Given the set of peer samples for the class , COLAM computes the soft label through minimizing the distances between the soft label to the soft prediction results of every sample in (please see also in Line 13 of Algorithm. 1). Such that, COLAM simply defines the distance as follow.
where refers to the soft prediction result and
is the learning objective of soft labels. To simplify the computation, line 13 of Algorithm 1 is equivalent to estimate the mean soft labels among all peer samples. With the
obtained, COLAM uses softmax to further normalize the vector. Finally, COLAM usesto replace as the soft label for further computation. Please refer to lines 9–14 of Algorithm. 1 for details.
|Model||# of Parameters||HL||LS||CP||DL||COLAM|
HL refers to models trained with standard Stochastic Gradient Descent optimizer using hard labels. LS are models improved by using Label Smoothing technique. CP refers to models regularized by Confidence Penalty. DL refers to DisturbLabel.
|Model||# of Parameters||HL||LS||CP||DL||COLAM|
|Model||# of Parameters||HL||LS||CP||DL||COLAM|
4.1 Tasks and Datasets
Image Classification We use CIFAR10, CIFAR100 [Krizhevsky2009LearningML] and ImageNet [Deng2009ImageNetAL] to test the performance of COLAM on image classification task. CIFAR10 has 10 different classes. In the training set of CIFAR10, each class contains 5,000 images. CIFAR 100 contains RGB images categorized into 100 classes, with each class composing 600 images. There are 500 training images and 100 testing images in each class. ImageNet is a tree-structured image database created according to the WordNet hierarchy. It consists of more than 20K categories and a total of 14 million images. We use the popular subset ILSVRC2012 which contains 1.3M images covering 1K categories.
Transfer Learning We use ImageNet as the source dataset and 4 different target tasks covering typical types of plants, animals, objects and texture . The 4 target datasets are (1) Flower102 [Nilsback2008AutomatedFC] which contains 102 categories of 8189 flower images, (2) Caltech-UCSD Birds-200-2011 [WahCUB_200_2011]
, which has 11,788 images classified into 200 categories, (3) FGVC-Aircraft[Maji2013FineGrainedVC] which composes 10,000 images of aircraft across 100 aircraft models, and (4) Describable Textures Dataset (DTD) [Cimpoi2015DeepFB] which is a texture database, consisting of 5640 images, organized according to a list of 47 terms (categories).
Image Classification We train PreActResNet34 [He2016IdentityMI], WideResNet40x4 [Zagoruyko2016WideRN] and ResNeXt29_8x64d [Xie2016AggregatedRT] for 160 epochs. We train DenseNet-BC-100 [Huang2016DenselyCC]
for 240 epochs because it converges slower. The initial learning rate is 0.1 for all architectures. Training batch size is 64. We use standard SGD optimizer with momentum 0.9 and weight decay 0.0001. We apply standard data augmentation the same way as the Pytorch official examples on CIFAR10, CIFAR100 and ImageNet classification task. For CIFAR dataset, we pad the input images by 4 pixels, and then randomly crop a sub-region ofand randomly do a horizontal flip. For ImageNet, we first randomly crop a sub-region of and randomly do a horizontal flip. We normalize the input data as done in common practice.
Transfer Learning We use ResNet101 [He2015DeepRL] as the base model to apply COLAM. We train the model with 40 epochs and the batch size for training is 64. SGD optimizer is used with a momentum of 0.9. The initial learning rate is set to 0.01 and the weight decay is set to 0.0001. We use exactly the same data augmentation methods as in ImageNet classification task. We repeat all these experiments 3 times and report the average Top-1 accuracy.
Image Classification Table. 1 and Table. 2 shows that our COLAM consistently and significantly improve baseline models in accuracy for the majority of neural network architectures we tested. On CFIAR100, the improvement is generally within 1%-2% in comparison to models trained with hard labels using a vanilla SGD optimizer. In tasks of CIFAR10 and CIFAR100, we find that models with more complex architectures are not guaranteed to be better than simpler ones. For example, WideResNet40x4 with 9 million parameters outperforms PreActResNet34 with 20 million parameters. This happens regardless of the training technique used. We can observe a similar improvement in Table. 3 when we apply COLAM to different models on ImageNet.
These results indicate the effectiveness of COLAM, which not only outperforms the traditional label smoothing technique, but it also beats other more dynamic but inherently equivalent form of label smoothing, namely Confidence Penalty and DisturbLabel. One reason that explains this phenomenon is that neither CP nor DisturbLabel encourages the model to learn the structural properties in the dataset when it regularizes the model. Preserving structural properties in the dataset is an important factor that contributes to good generalization of a model, as discussed in Sec. 5.2.
Transfer Learning We fix our test model to be ResNet101 and perform experiments on the chosen datasets. Results in Table. 4 indicate that, compared with models trained using hard labels, COLAM improves the transfer learning outcomes on all four datasets, and the improvements range from 0.46% to 1.78%. These results testify that COLAM can enhance model performance on varying datasets.
In contrast, LS does not always yield positively improved results. Furthermore, the extent of improvement LS brings is considerably less than that of COLAM, as shown by the statistics.
5.1 Effect on Training Procedure
To further demonstrate the effectiveness of our proposed alternating optimization of the training loss and soft labels, we dive into the training process of COLAM. Experiments show how the evolution of soft labels helps learning.
We first plot the learning curve of the whole training procedure of PreActResNet34, as shown in Fig. (a)a. For better demonstration purpose, we divide the training process into only 4 stages for COLAM training, with each stage consisting of 40 epochs. We observe that models trained with SGD using hard labels and COLAM display almost the same standard of performance in the first stage as expected. While in the 41st epoch, both training and test accuracy of COLAM get a sharp rise due to starting to involve soft labels generated in the 40th epoch. Then both training and test accuracy values drop slightly for a few epochs, and then return to the trend of slowly rising for the remaining epochs until next stage. A similar phenomenon also appears at the beginning of next stage, although the magnitude of accuracy improvement becomes much smaller. We notice that since the first sharp rising, COLAM continuously outperforms training using hard labels on test set by a stable gap for the remaining epochs until convergence.
We do additional experiments to validate the effect of gradual promotion of soft labels, through our proposed alternative minimization approach. We divide the training process into 10 stages, each of which consists of 16 epochs. By performing COLAM, we obtain 10 checkpoints of soft labels. We train a model from scratch with vanilla SGD, except one difference that we use the supervision of these checkpoints. The top-1 accuracy of using each soft label is denoted as the corresponding “expected accuracy". It is a solid measure of the quality of a soft label. As illustrated in Fig. (b)b, we observe that the expected accuracy gradually increases with the evolution of soft labels. In detail, the expected accuracy grows fast during the early period of the whole training procedure. This observation verifies that the quality of soft labels is indeed improved by alternating optimized with the training loss. It is worth noting that the expected accuracy begins to surpass COLAM after about half of the training epochs. The expected accuracy shows a slight drop near the end of the training process, indicating that alternating optimization of the two objectives may suffer from over-fitting to a relatively small extent. Although COLAM improves the accuracy significantly, our experiments of the expected accuracy implies the potential existence of better design of soft labels.
5.2 Effect on Deep Representations
The empirical characteristics of COLAM observed in our experiments also show that COLAM is able to promote training by involving internal structural knowledge. We demonstrate this advantage through both qualitative and quantitative analysis.
Qualitative Visualization Recent work [whenMuller] gives insights into why label smoothing improves model performance. They argue that the logit for class is correlated to the distance between and , where is the penultimate layer representation and is the template for class . As a result of their analysis, the penultimate layer representation should be close to the correct class template and equally distant from incorrect class templates for all after a model is trained with LS.
Since our COLAM preserves the dataset structural properties compared to LS, is expected to be closer to the class template that shares a greater extent of inter-class similarity than a class that does not. We verify this by projecting both the penultimate layer representation and template in 2D. We choose 100 samples from each of three classes in CIFAR100 for this visualization: “man", “palm tree", and “pine tree." Intuitively, “palm tree", and “pine tree" should be more similar to each other than “man."
Fig. (a)a and Fig. (b)b show the cluster distributions on the training set and test set when ResNet56 is trained with hard labels using vanilla SGD. We observe that the clusters are close to their respective templates. However, they are generally spread out and scattered. The clusters of “palm tree" and “pine tree" are relatively closer compared with “man." This reflects the structural property within the dataset.
As seen in Fig. (c)c and Fig. (d)d, the clusters of label smoothing are tighter and easier to separate. The three clusters also try to be equally distant from the other classes’ templates, resulting in a situation where clusters are drawn inward closer to the center of the subspace formed by the templates. However, the structural properties of the dataset is no longer preserved. “palm tree" and “pine tree" have the same distance as that between “palm tree" and “man."
Next, we see in Fig. (e)e and Fig. (f)f that, when ResNet56 is trained with COLAM, the clusters are better separated in both training set and test set. Each cluster remains close to its own template, but further away from the center of the subspace this time. This indicates the model’s improved ability to distinguish each sample. Additionally, each cluster looks tighter in comparison to the clusters in Fig. (a)a and Fig. (b)b. What remains unchanged is the structural properties in the dataset: “palm tree" and “pine tree" are still closer and “man" is further away from these two. In comparison, LS does not maintain this structural property.
Observing all figures as a whole, we see that our method COLAM is a “neutralizer" between using hard labels and LS. COLAM enjoys both accurately representing dataset structural properties (shown in training with hard labels) and obtaining tighter clusters that are easier for classification (shown in training with LS). This “neutralizing" effect enables a model to better generalize.
Quantitative Evaluation We quantitatively evaluate three distance criteria for the same three classes to confirm the qualitative findings described above. Specifically, we find
the Euclidean distance between the normalized templates, shown in Table 5.
Note that we normalize all vectors when we compute such distances in order to make fair comparisons.
As shown in Table 5, COLAM largely preserves the structural properties in the dataset by keeping the distance between “palm tree", and “pine tree" templates closer and distance from “man" greater, which is consistent with the model trained with hard labels.
In contrast, LS generally enlarges the overall distance in between templates, due to the requirement that each cluster should be equi-distant away from the incorrect class templates. Since clusters will stick close to their respective templates after training, this observation implies that data structural properties is missing in LS.
Fig. (a)a to Fig. (c)c shows the distance between templates and clusters. We see that COLAM (Fig.(c)c) is able to preserve structural properties in the dataset because the distance from a template to other clusters display the same trend as that in (a)a. Together, the two figures also show that each template is closer to its corresponding cluster and further away from other clusters when the model is trained with COLAM. Since Table. 5 shows that the distance between the templates in these two methods are roughly the same, the difference between the average distance from a template to its own cluster and the average distance from a template to other clusters indicates how well the clusters are separated. Fig.(c)c shows a larger such difference than Fig.(a)a does. This explains why COLAM outperforms training with hard labels.
|(Man, Palm Tree)||1.433||1.460||1.438|
|(Man, Pine Tree)||1.425||1.400||1.461|
|(Palm Tree, Pine Tree)||1.264||1.369||1.227|
COLAM hyperparameter experiments on PreActResNet18.
Now we compare COLAM (Fig. (c)c) with LS (Fig. (b)b). Because COLAM does not need to force each cluster to be equi-distant away from the incorrect class templates and preserves structural properties of the data, the distance between a template to its own cluster can get far smaller than that in LS. This smaller distance contributes to a larger difference between the average distance from a template to its own cluster and the average distance from a template to other clusters. Hence, COLAM is able to improve model generalization to a greater extent than LS. Moreover, clusters in LS are almost equi-distant away from other classes’ templates as shown in Fig. (b)b, this violates the structural properties in the dataset.
Computing the average distance between samples and their centroid reveals how tight a cluster is. A smaller average value indicates a tighter cluster and vice versa. In Fig. (d)d to Fig. (f)f, we observe the smallest such distance is found in the model trained with COLAM.
We are also interested in the difference between the distance from a centroid to its own cluster and the distance from a centroid to other clusters. The larger this difference is, the better the clusters are separated. Fig. (f)f indicates that the model trained with COLAM yields the greatest such distance. This, from a slightly different view, explains why COLAM gives rise to the highest generalization ability.
5.3 Choice of Hyperparameters
The most important two hyperparameters are the number of stages and number of random peer samples. We use update intervals, or equivalently number of epochs per stage, instead of number of stages for clarity in this experiments. We set update intervals to be and number of peer samples to be . We run a grid search method to validate different combinations of these two variables.
In Fig. (a)a we see that performance of COLAM does not seem to be very sensitive to most combinations of the hyperparameters. When the number of peer samples increases to 5 or more, model accuracy tends to be over 77.3%. Even when the number of peer samples is low, a good choice of the value of update interval can boost the model performance significantly. Low accuracy of the model only happens consistently when the value of update interval is large.
Another important hyperparameter is the temperature , which softens the probability distribution of incorrect classes of soft labels. We explore used in PreActResNet34 on CIFAR100. Theoretically, a larger value of makes the probability distribution of the incorrect classes smoother. When gets sufficiently large, this will make our COLAM to behave like traditional LS. In contrast, the probability distribution among incorrect classes gets even sharper if is less than 1. When gets close to 0, it is closed to the original hard label. Empirically we recommend to be some value between 1 to 2.
In this paper, we have discussed the advantages of using soft labels as the target in deep learning and proposed a novel algorithm COLAM that alternatively minimizes the training loss subject to the soft label and the objective to learn improved soft labels. We have conducted numerous experiments to demonstrate the method’s effectiveness on a variety of tasks. We have also offered both qualitative and quantitative explanations as to why COLAM is more beneficial than existing techniques to produce soft labels.