Log In Sign Up

Improving Model Training via Self-learned Label Representations

by   Xiao Yu, et al.
Columbia University

Modern neural network architectures have shown remarkable success in several large-scale classification and prediction tasks. Part of the success of these architectures is their flexibility to transform the data from the raw input representations (e.g. pixels for vision tasks, or text for natural language processing tasks) to one-hot output encoding. While much of the work has focused on studying how the input gets transformed to the one-hot encoding, very little work has examined the effectiveness of these one-hot labels. In this work, we demonstrate that more sophisticated label representations are better for classification than the usual one-hot encoding. We propose Learning with Adaptive Labels (LwAL) algorithm, which simultaneously learns the label representation while training for the classification task. These learned labels can significantly cut down on the training time (usually by more than 50 negligible additional parameters and has a minimal computational overhead. Along with improved training times, our learned labels are semantically meaningful and can reveal hierarchical relationships that may be present in the data.


page 5

page 11


Label Confusion Learning to Enhance Text Classification Models

Representing a true label as a one-hot vector is a common practice in tr...

Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes

Learning and predicting the performance of a configurable software syste...

Beyond One-hot Encoding: lower dimensional target embedding

Target encoding plays a central role when learning Convolutional Neural ...

Label-Enhanced Graph Neural Network for Semi-supervised Node Classification

Graph Neural Networks (GNNs) have been widely applied in the semi-superv...

Label-similarity Curriculum Learning

Curriculum learning can improve neural network training by guiding the o...

Living-off-the-Land Abuse Detection Using Natural Language Processing and Supervised Learning

Living-off-the-Land is an evasion technique used by attackers where nati...

Introducing Curvature to the Label Space

One-hot encoding is a labelling system that embeds classes as standard b...


Neural Networks have become an essential tool for achieving high-quality classification in various application domains. Part of their appeal stems from the fact that a practitioner does not have to hand-design the input features for model training. Instead, they can simply use the raw data representation (such as using pixels instead of highly processed SIFT or HOG features for a computer vision task) and learn a mapping to the target class. The high degree of flexibility enables neural networks to learn highly non-linear maps, and thus the target output representation is also usually kept relatively simple. It is customary to encode the target labels as a one-hot encoding

111For a -way classification task, one-hot encoding of the th category is simply the

basis vector in

. While simple and computationally convenient, a one-hot representation is rather arbitrary. Indeed, such an encoding destroys any semantic relationships that the target categories may have. For instance, for a 3-class apparel classification task with categories, say, sandal, sneaker and shirt, the semantic similarity between sandal and sneaker (both being footwear) is clearly not captured by the one-hot encoding. An alternate label representation can allow us to capture this semantic connection, and perhaps even make the learning process easier (cf. Figure 1).

(a) One-hot Labels
(b) LwAL Learned Labels
Figure 1: A visualization of the labels and how a neural network may map the training examples for a 3-way classification task. Left: when using the one-hot label encoding; Right: when adaptively learning the label encoding.

What might be a better representation of the output labels? Since powerful word embedding models such as Word2Vec (Mikolov et al., 2013) and BERT Devlin et al. (2019) are known to capture the semantic meaning of commonly occurring words, one can use such prelearned representations of our labels for classification. In fact, Chen et al. (2021) explored this idea in detail. They study the effectiveness of several embeddings including BERT (pretrained on textual data) and audio spectrogram (trained on the vocal pronunciations of the class labels) to represent the target labels and show improved performance.

An alternate approach, of course, is to explicitly learn the label representation from data itself. This again can be done in several ways. Sun et al. (2017), for example, propose to augment the underlying neural network with specialized layers for data classification and label embedding that interact with each other during the training process. This of course adds complexity to the network potentially increasing network size. Deng and Zhang (2021), in contrast, learn a “soft” set of labels by “smoothing” the original one-hot encodings without modifying the underlying network architecture. While the learned labels are reasonably flexible in representation, a simple smoothing can miss capturing more complex semantic relationships among labels.

In this work, we learn a robust data-dependent label representation that addresses issues that were unresolved in previous literature. We propose Learning with Adaptive Labels (LwAL) algorithm, which simultaneously learns semantically appropriate label representations while training for the underlying classification task. LwAL is based on the insight that relationships between class labels should be inherent in data belonging to the classes. Since one can view a neural network as a function that maps the input data to a latent representation

, we can utilize this latent data representation to get an initial estimate the label representation

for each class . Given the initial estimate of the target labels, we can now tune the underlying neural network parameters to improve classification accuracy. This improved network can in turn, be used to get a better data-dependent label representation in the latent space. We can thus alternate between learning the best representation of the labels in the latent space and learning the best parameters for the underlying network for classification, such that at convergence we achieve both high quality accuracy and an improved representation of the target labels.

Our Contributions

We propose a simple yet powerful alternating updates training algorithm LwAL, that can learn high-quality representations of labels. Our algorithm works with any underlying network architecture, without any architecture modifications, and introducing only minimal additional parameters to tune. We show that learning the labels simultaneously with LwAL significantly cuts down on the overall training time (usually by more than 50% and sometimes up to 80%) while often achieving better test accuracies than the previous works on label learning. We further show that our learned labels are in fact semantically meaningful and can reveal hierarchical relationships that may be present in our data.

Related Work

Label representations beyond the one-hot encoding have gained interest in recent years. Here we discuss the related literature in detail.

Learning Labels Directly

Representations by label smoothing:

Label smoothing techniques aim to modify the hard one-hot

class probability distribution to a softer target, which can be used to provide broader signals to the model and hence potentially achieving better performance. Numerous smoothing-based regularization techniques such as Max-Entropy Regularizer

MaxEntReg Pereyra et al. (2017), Teacher-Free Regularizer TFReg Yuan et al. (2019), and Learning with Retrospection LWR Deng and Zhang (2021) have been proposed in the literature, all showing promising improvements. Yet they do not consider unravelling or understanding the relationships between the learned class labels. Deng and Zhang (2021)

for instance focuses on learning labels generated by a temperature controlled softmax function for better training. Such representations, by their construction, are limited to learning smooth unimodal class probability distributions and cannot capture complex multimodal class distributions that may be necessary to model semantic relationships that may be present in data.

Representations by network augmentation:

Sun et al. (2017)

go beyond just label smoothing and propose a unique approach to augment the underlying neural network with specialized layers to learn sophisticated label representations during the training process. Interestingly, they show that even though their augmented network is more complex, it usually learns a good classifier at a faster rate, achieving state-of-the-art accuracies for label learning.

Static Label Representations:

Rather than learning a label representation that is tuned to a given classification task, Chen et al. (2021) take an alternate approach and use high-quality pre-trained embeddings (such as BERT or GLoVe) to represent their target labels. Since no label-training is involved, this approach has the advantage of using good label representations with no added complexity, but suffers from yielding relatively lower classification accuracies. This technique also relies on the practitioner having knowledge about which pre-trained embedding is most suitable for the given classification task, which may not be as obvious.

Other Notable Related Techniques

While not aiming to learn label representations explicitly, certain ML models yield labels beyond the traditional one-hot encoding as a side-effect. Student-Teacher learning paradigm Hinton et al. (2015), for instance, aims to learn a more compact network that approximates the behavior of a given large network. In this process of distillation, the original one-hot target labels of the larger network usually get an alternate “dense” representation in the learned compact network. While interesting, learning the distilled network is time-intensive and thus not an efficient mechanism to learn label representations.

Xie et al. (2016) develop an unsupervised framework for learning to cluster data in the latent space. They use an auto-encoder architecture to learn a compact latent of the input data where it is forced to form clusters. One can thus use these learned latent data clusters and use the cluster centers as a proxy for representing labels. The lack of direct supervision yields suboptimal partitions and hence suboptimal label encodings for classification.

Connection to Metric Learning

Metric learning aims to learn a transformation of the input space where data from the same category is mapped closer together than data from different categories Kulis (2012); Bellet et al. (2013). One can perhaps view learning labels as performing metric learning not on the input space, but rather on the output space. Interestingly, to the best of our knowledge, this viewpoint is not explored in existing literature and may be a fruitful avenue for future research.

Some metric learning literature does explore semantic hierarchical relationships between labels to learn more informed transformations. Notably, Verma et al. (2012) explicitly incorporate label hierarchy information to markedly improve nearest-neighbor classification accuracy. They additionally show that such a learned metric can also help in augmenting large taxonomies with new categories. Our work, in contrast, derives the label taxonomy directly from data without any prior hierarchical information.


Here we formally introduce our Learning with Adaptive Labels LwAL algorithm, which simultaneously learns label representations while training for the underlying classification task. We’ll start by reviewing the standard training procedure for neural networks, introducing our notation. We then present our LwAL modifications that simultaneously learns the label encodings. Finally we discuss additional optional variations to LwAL that can further improve performance in certain applications.

Standard Neural Network Training Procedure

Recall that given a dataset of samples for a -category classification task, where denotes the application specific input representation and denotes the one-hot output representation of the -th sample, the goal of a neural network (parameterized by ) to learn a mapping from the inputs () to the outputs (). This learning is usually done by finding a parameter setting that minimizes the loss between the predicted output and the desired (one-hot) output . In particular, let be the network encoding of the input . First a Softmax is applied to to obtain a probability distribution which encodes the affinity of to each of the classes. Then this induced probability distribution is compared with the ideal probability distribution using any distribution-comparing divergence such as the cross-entropy (CE). Thus the classification loss for the -th sample becomes

The optimal parameter setting can thus be learned by usual iterative gradient-type updates (such as SGD or Adam) on the aggregate loss over all training datapoints.

Learning with Adaptive Labels

To learn more enriched, semantically meaning label representations, we posit that that semantic relationships between classes are contained within the samples belonging to the class. Specifically, we model the label representation of a class as the vector that minimizes the average distance to the network encoding of the samples belonging to class . This is equivalent to considering


where is the number of samples belonging to class .

To bring the training in line with standard neural network updates, given this new class representation, one can define the probability that the network encoding of the -th datapoint belonging to class as


Therefore, the modified cross entropy loss for the -th datapoint becomes

where is the probability distribution that encodes the affinity of to each of the classes using the new label representation. One can thus train the optimal parameters of the underlying neural network the usual way.

One should note that the choice of cross-entropy as the loss function encourages the learned class representations

to be well separated yielding empirically better accuracies than other popular loss functions.

One can predict the label of test examples by simply assigning it to the closest learned label in the network encoded space. That is

Adapting to large-scale datasets

To accommodate large scale datasets, we use the mini-batch paradigm. The mini-batch training usually suffers from the problem of moving target (Mnih et al., 2013), that is,

are constantly changing leading to poor convergence. In order to alleviate this, we add hyperparameters

that controls the update frequency (Deng and Zhang, 2021), and initial warmup steps to promote more initial separation between classes when learning . See Algorithm 1 for details.

1:input dataset
2:neural network

number of training steps per epoch

4:update frequency
5:warmup steps
7:     for step  do
8:         sample a large batch
10:         compute
11:         if  and  then
12:              update for each class as per Eq. (1)
13:              compute
14:         end if
15:         gradient descent on to update
16:     end for
17:until convergence
Algorithm 1 LwAL Training Algorithm

Additional Improvements

To further improve the label quality, we draw inspiration from the push-pull based losses from metric learning literature Xing et al. (2002); Weinberger and Saul (2009); Schroff et al. (2015). We add an optional “push” loss, that encourages our learned labels to be well-separated thus yielding better generalization accuracies. Specifically, we penalize the angle between the network encoding of the datapoints

from different classes, using cosine similarity. That is (c.f. Algorithm




We have a two-fold aim for our empirical study. First, we evaluate how LwAL fares (both in terms of speed and accuracy) when compared to other popular label-learning methodologies on benchmark datasets. Second, we evaluate the effectiveness of our learned labels for revealing semantically meaningful categorical relationships in our data.222An implementation of our algorithm is available at Adaptive-Labels.

Percent Time/Epoch Reduced Best Test Accuracy
ResNet50 EfficienNetB0 DenseNet121 ResNet50 EfficienNetB0 DenseNet121
One-hot (STD) Reference 99.10.1 99.40.0 99.10.1
StaticLabel - - - N/A N/A N/A
LWR2k - 70% 50% 99.20.0 99.40.1 99.30.1
LWR3k 50% 50% 40% 99.10.1 99.50.1 99.20.1
LWR5k 10% 30% 30% 99.20.1 99.40.1 99.20.1
LabelEmbed 50% 10% 60% 99.20.1 99.40.0 99.40.0
LwAL (Ours) 60% 20% - 99.20.1 99.30.1 99.20.0
LwAL10 (Ours) 60% - - 99.30.1 99.30.0 99.10.0
LwAL10+rpl (Ours) 50% - 70% 99.30.1 99.30.0 99.40.0

Fashion MNIST

One-hot (STD) Reference 92.30.2 93.10.2 92.40.3
StaticLabel 30% - 20% 92.80.1 93.00.1 92.60.2
LWR2k - 20% - 92.30.2 93.10.2 92.40.3
LWR3k - 50% 20% 92.10.0 93.30.3 92.20.4
LWR5k - 40% 30% 92.10.0 93.40.1 92.30.4
LabelEmbed 40% 10% 60% 92.70.4 93.10.2 92.90.1
LwAL (Ours) 50% - 30% 92.90.1 93.00.2 92.40.0
LwAL10 (Ours) 50% - 40% 92.30.0 92.70.2 92.60.2
LwAL10+rpl (Ours) 30% - 60% 92.70.2 92.80.2 93.00.2


One-hot (STD) Reference 73.30.5 75.90.4 78.80.5
StaticLabel - - - 74.00.7 75.70.5 77.70.3
LWR2k - - - 67.81.1 74.70.3 74.10.7
LWR3k - - - 69.30.8 75.30.1 75.60.9
LWR5k - 30% - 69.91.1 76.30.4 76.90.7
LabelEmbed - 30% 40% 72.20.9 76.70.3 79.40.4
LwAL (Ours) - 20% - 72.80.2 76.70.4 78.90.0
LwAL10 (Ours) 30% - 30% 74.30.3 76.20.2 79.20.4
LwAL10+rpl (Ours) 60% 50% 50% 76.00.4 77.90.5 80.50.3


One-hot (STD) Reference 37.40.6 40.50.5 44.60.8
StaticLabel - - - 16.81.1 5.90.6 7.80.3
LWR2k - - - 32.90.5 38.10.5 38.70.4
LWR3k - - - 32.70.1 38.10.4 38.60.6
LWR5k - - - 32.90.4 38.10.7 38.60.6
LabelEmbed 10% 20% - 37.70.8 41.00.5 44.60.7
LwAL (Ours) 70% 65% 60% 38.80.4 43.20.2 46.80.3
LwAL10 (Ours) 70% 50% 65% 39.30.2 41.60.6 47.50.4
LwAL10+rpl (Ours) 70% 60% 60% 39.90.4 42.20.5 48.00.0


One-hot (STD) Reference 16.30.3 18.50.5 20.60.0
StaticLabel - - - 2.60.5 2.00.8 6.40.5
LWR2k - - - 13.80.1 18.00.3 17.90.2
LWR3k - - - 13.90.1 18.20.3 18.00.3
LWR5k - - - 13.90.1 18.50.4 17.80.1
LabelEmbed 10% 65% 35% 15.80.1 19.80.3 21.60.5
LwAL (Ours) 75% 80% 70% 16.60.3 22.00.6 21.10.1
LwAL10 (Ours) 80% 80% 75% 17.50.3 20.50.5 22.50.1
LwAL10+rpl (Ours) 80% 80% 70% 17.70.1 20.90.2 22.90.1
Table 1: Learning accuracy and speed comparison between LwAL and other baselines. LwAL is trained using 0 warmup steps and update frequency of once per step. Blank (–) indicates cases when the specific algorithm+backbone pair was unable to reach the reference STD test accuracy. Star (*) indicates the use of different learning rate ( ) due to failure of convergence. N/A for MNIST dataset using StaticLabel indicates that the BERT representation of MNIST categories is not appropriate.
(a) Using ResNet50 backbone
(b) Using EfficientNetB0 backbone
(c) Using DenseNet121 backbone
Figure 2: Test Curves for LwAL and other baseline algorithms trained on CIFAR100 dataset. STD best accuracy is used as a reference for other algorithms.

Learning Speed and Test Performance


To evaluate the robustness of our technique, we report results on several benchmark datasets with different sizes, number of categories and application domains. In particular we used the following datasets for our experiments.

Dataset Domain # classes # points
MNIST Vision 10 60k
Fashion MNIST Vision 10 60k
CIFAR10 Vision 10 50k
CIFAR100 Vision 100 50k
FOOD101 Vision 101 25k
IMDB Reviews NLP 2 25k
YELP Reviews NLP 2 560k

We use the default train/test splits provided by the tensorflow library as of Aug 2022.

Network Architectures

To check if LwAL works across different architectures, we test on ResNet50 He et al. (2016), EfficientNetB0 Tan and Le (2019), and DenseNet121 Huang et al. (2017)

with ImageNet weights, for vision datasets. All of these architectures are available on the tensorflow library as of Aug 2022. For text datasets, we use BERT

Devlin et al. (2019) which is available on the huggingface library as of Aug 2022.


We compare LwAL with several important baselines. We compare with the standard one-hot training procedure (STD). Chen et al. (2021) employ a static pretrained (BERT or audio spectrogram) label representation (StaticLabel). For our comparisons, we chose the pretrained BERT embedding as it was reported to show good performance on the benchmark datasets. From the label smoothing techniques, we use LWR Deng and Zhang (2021) with varying choices of the update-frequency hyperparameter (). We also compare with the network augmentation (LabelEmbed) technique by Sun et al. (2017).


In order for the backbones (ResNet50, EfficienNetB0, DenseNet121) to be used across different datasets, we attach a single dense layer with regularization of at the top to be used as the classification head.

We train all algorithms with the same set of parameters for consistency. We first pick a learning rate within the same backbone so that all algorithms can converge: for ResNet50 and DenseNet122, we use ADAM optimizer with and learning rate of ; for EfficientNetB0, we use the same optimizer but with learning rate of . For small datasets such as MNIST, F.MNIST, and CIFAR10, we train all algorithms over 10 epochs. For large datasets such as CIFAR100 and FOOD101, we train all algorithms over 20 epochs where we see the test accuracy reaches a plateau and starts to overfit. We repeat all runs with seeds and report the mean and spread.

For LWR, we use temperature , which is the recommended value. Since we are only training for a few epochs, we also experiment with varying values for the frequency and report all results in Table 1. For LabelEmbed, we use the default setting of the parameters in the implementation (Sun et al., 2017) (i.e. , , and ).

For LwAL, we can vary the output label dimension. We compare the results for output dimension of 10 times333We empirically found that increasing the output dimension often leads to improved performance, as discussed by Chen et al. (2020). Empirically, 10 times the number of classes usually leads to best performance. the number of classes (LwAL10). We also compare the results with the addition of optional loss (LwAL10+rpl). We use update frequency of and no warmup steps as we use large batch sizes. For LwAL10+rpl we used (cf. Algorithm 1).

Results and Observations

Tables 1 (main text) and 3 (Appendix) summarize our results for the Vision and NLP datasets respectively. Best results are highlighted in bold. Blank (–) in the Time column indicates that a particular algorithm+backbone combination was not able to achieve the STD one-hot baseline test accuracy.

Observe that LwAL significantly cuts down on the overall training time (usually by more than 50% and sometimes up to 80%) while often achieving better test accuracies over other baselines. Figure 2 depicts how the test accuracy curve improves as the training proceeds for a typical run using various backbones. It clearly highlights that one can achieve the same test accuracy as STD with 70% reduction in training time. This phenomenon is typical for various benchmark datasets and choice of backbones (cf. Table 1). One can conclude that LwAL10+repl with DenseNet121 backbone seems to give the best results with significant () savings overall. Curiously StaticLabel and LWR are not able to achieve STD one-hot label test accuracies for large multi-class datastes like CIFAR101 and FOOD101.

Semantic Label Representation

Here we want to empirically evaluate the effectiveness of our learned labels in discovering semantic relationships among categories. For this, we shall use the semantic hierarchy induced by WordNet Miller (1995) as the gold standard relationship among the categories, and compare how well our learned labels reveal those relationships.

To this end, we utilize the Kendall’s Tau-b () correlation coefficient score to compare the learned representations with the WordNet hierarchy. Specifically, first we compute the pairwise distances between distinct class labels for (i) the reference WordNet hierarchy tree (this is done using the short path distances between the tree nodes) , and (ii) the learned vectors from the label learning algorithm . Next, treating collected distance vectors (where ) for each of the classes as rank vectors, we can compute the average semantic correlation score as:



We report results on datasets for which the classes can be easily mapped to the WordNet Miller (1995) hierarchy. This includes the existing Fashion MNIST (8 out of 10 classes can be mapped) and CIFAR10 (10 out of 10 classes can be mapped). We also include the results for Animal with Attributes 2 (AwA2) dataset Xian et al. (2019), where 23 out of 50 classes can be mapped). We learn and evaluate the quality of the label representations of only the mappable classes for each of these datasets.

Architectures, Hyperparameters, and Baselines

We use ResNet50 (with ImageNet weights) as the underlying neural network backbone for our experiments. We compare the results of our

LwAL algorithm with other label learning techniques: LWR (best across ) and LabelEmbed. For LWR, the explicit label representation is computed via Eq. (1). For LabelEmbed, since it returns a similarity matrix between the learned labels, we compute the vectorial representation the standard (eigendecomposition) way.

The rest of the hyperparameter settings (including random seed, batch size, etc.) are same as the previous section.

Datasets Other Label Learning Algs. Ours
LWR LabelEmbed LwAL LwAL10 LwAL10+rpl
CIFAR10 -0.0170.068 0.0530.058 0.4730.028 0.5440.024 0.6090.019
F.MNIST 0.0190.068 0.0790.172 0.3060.056 0.4940.054 0.3050.039
AwA2 -0.0970.074 0.0880.078 0.2990.021 0.2880.024 0.2600.030
Table 2: Structure correlation score (Eq. 4) between learned labels and WordNet. Bold indicates best performance.
Figure 3: Hierarchical visualization (via average-linkage) of the learned labels for different algorithms on CIFAR10 dataset.


We present the correlation score for each of the label learning techniques in Table 2, and and example visualization of the learned hierarchy in Figure 3.

Observe that LwAL and its variants can consistently generate significantly superior semantically meaningful representations when compared to other label learning methods. While these results are compelling, it is worth noting that the learned labels and thus the semantic hierarchy is derived from the data inputs. LwAL can thus only extract those relationships that are present in the input data representation and likely cannot capture every fine-grained semantic detail between classes. Indeed, if the input representation (for example pixels for image classification tasks) does not contain any information about the semantic relationships, then one cannot expect LwAL to capture any useful relationship.

Conclusion and Future Work

In this work we present a simple yet powerful Learning with Adaptive Labels LwAL algorithm that can learn semantically meaningful label representations that the vanilla one-hot encoding is unable to capture. Interestingly, we find that by allowing the network to flexibly learn a label representation during training, we can significantly cut down on the overall training time while achieving high test accuracies. Extensive experiments on multiple datasets with varying dataset sizes, application domains, and network architectures show that our learning algorithm is effective and robust.

As noted, although LwAL can learn high-level semantically meaningful label representations extracted from inputs, it is interesting to explore to what degree this is possible. Can fine-grained semantic relationships be derived just from the raw input space? Or does one need to incorporate additional “side-information” to accelerate semantic discovery? We leave this as a topic for future research.


  • A. Bellet, A. Habrard, and M. Sebban (2013) A survey on metric learning for feature vectors and structured data. CoRR abs/1306.6709. External Links: Link, 1306.6709 Cited by: Connection to Metric Learning.
  • L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101 – mining discriminative components with random forests

    In European Conference on Computer Vision (ECCV), Cited by: Datasets.
  • B. Chen, Y. Li, S. Raghupathi, and H. Lipson (2021) Beyond categorical label representations for image classification. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: Introduction, Static Label Representations:, Baselines.
  • X. Chen, J. Xu, and A. Wang (2020)

    Label representations in modeling classification as text generation

    Ashia-Pacific Chapter of the Association for Computational Linguistics (ACL), Student Research Workshop, pp. 160–164. External Links: Link Cited by: footnote 3.
  • X. Deng and Z. Zhang (2021) Learning with retrospection.

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    35 (8), pp. 7201–7209.
    External Links: Link Cited by: Introduction, Representations by label smoothing:, Adapting to large-scale datasets, Baselines, Effects of Warmup Steps on LwAL.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics (ACL), pp. 4171–4186. External Links: Link, Document Cited by: Introduction, Network Architectures, Experiments with Text Dataset.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. External Links: Document Cited by: Network Architectures.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Document, Link Cited by: Other Notable Related Techniques.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2261–2269. External Links: Document Cited by: Network Architectures.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: Datasets.
  • B. Kulis (2012) Metric learning: a survey.

    Foundations and Trends in Machine Learning

    5 (4), pp. 287–364.
    Cited by: Connection to Metric Learning.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: 2. Cited by: Datasets.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    Association for Computational Linguistics (ACL), pp. 142–150. External Links: Link Cited by: Datasets, Experiments with Text Dataset.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (NIPS) 26. External Links: Link Cited by: Introduction.
  • G. A. Miller (1995) WordNet: a lexical database for english. Commun. ACM 38 (11), pp. 39–41. External Links: ISSN 0001-0782, Link, Document Cited by: Datasets, Semantic Label Representation.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013)

    Playing atari with deep reinforcement learning

    CoRR abs/1312.5602. External Links: Link, 1312.5602 Cited by: Adapting to large-scale datasets.
  • G. Pereyra, G. Tucker, J. Chorowski, L. Kaiser, and G. Hinton (2017) Regularizing neural networks by penalizing confident output distributions. CoRR abs/1701.06548. External Links: Link Cited by: Representations by label smoothing:.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: a unified embedding for face recognition and clustering

    In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. External Links: Document Cited by: Additional Improvements.
  • X. Sun, B. Wei, X. Ren, and S. Ma (2017) Label embedding network: learning label representation for soft training of deep networks. CoRR abs/1710.10393. Cited by: Introduction, Representations by network augmentation:, Baselines, Hyperparameters.
  • M. Tan and Q. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    Proceedings of the 36th International Conference on Machine Learning (ICML) 97, pp. 6105–6114. External Links: Link Cited by: Network Architectures.
  • N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair (2012) Learning hierarchical similarity metrics. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2280–2287. External Links: Document Cited by: Connection to Metric Learning.
  • K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (9), pp. 207–244. External Links: Link Cited by: Additional Improvements.
  • Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2019) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2251–2265. External Links: Document Cited by: Datasets.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Link, 1708.07747 Cited by: Datasets.
  • J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    Proceedings of The 33rd International Conference on Machine Learning (ICML) 48, pp. 478–487. External Links: Link Cited by: Other Notable Related Techniques.
  • E. Xing, A. Ng, M. Jordan, and S. Russell (2002) Distance metric learning with application to clustering with side-information. Neural Information Processing Systems (NIPS), pp. 505–512. Cited by: Additional Improvements.
  • L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng (2019) Revisiting knowledge distillation via label smoothing regularization. arXiv. External Links: Document, Link Cited by: Representations by label smoothing:.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems (NIPS) 28. External Links: Link Cited by: Datasets, Experiments with Text Dataset.


Pecent Time/Epoch Reduced Best Test Accuracy Avg. AUAC


One-hot (STD) Reference 83.90.1 82.90.2
StaticLabel 20% 84.10.1 82.10.1
LWR2k - 82.80.2 79.80.4
LWR3k - 83.20.2 80.20.2
LWR5k - 83.70.0 82.00.2
LabelEmbed - 82.70.1 81.50.1
LwAL (Ours) 30% 84.30.1 83.50.2
LwAL10 (Ours) - 83.80.1 83.30.1
LwAL10+rpl (Ours) 60% 84.20.1 83.60.0


One-hot (STD) - 88.50.1 88.00.0
StaticLabel 60% 88.60.0 88.30.0
LWR2k 60% 88.70.2 88.50.1
LWR3k 70% 88.70.2 88.50.1
LWR5k 70% 88.70.1 88.50.1
LabelEmbed - 88.30.1 87.80.2
LwAL (Ours) - 87.90.1 87.50.0
LwAL10 (Ours) - 87.60.2 87.20.1
LwAL10+rpl (Ours) - 88.20.1 87.90.1
Table 3: All algorithms are trained with the same hyperparameter of learning rate () over 10 epochs. LwAL used 0 warmup steps and update frequency of once per step. Blank (–) indicates cases when the specific algorithm was unable to reach the reference STD test accuracy.

Additional Results on Learning Speed and Test Performance

In addition to “Percent Time/Epoch Reduced” and “Best Test Accurarcy” in Table 1, we include average area under the accuracy curve (AUAC) in Table 5. This could be another useful metric to compare learning speed between algorithms, as larger area under the testing curve indicates a faster learning speed.

Experiments with Text Dataset

We also perform learning speed and test performance evaluations on text datasets, such as IMDB reviews Maas et al. (2011) and Yelp Polarity Reviews Zhang et al. (2015). Specifically, we first use BERT Devlin et al. (2019) to extract a 768-dimensional representation of each text, and then use two Dense layers for predictions (one outputs 768 dimension, and another outputs number of classes). For StaticLabel, we use BERT encodings of the word ”negative” for class 0, and ”positive” for class 1. We train all algorithms over 10 epochs, using ADAM with learning rate of 1e-4, and the rest of the training hyperparameters are the same discussed in the main text. The results are presented in Table 3.

Effects of Warmup Steps on LwAL

As discussed by Deng and Zhang (2021), we experiment our LwAL with some initial warmup steps to see if it can provide a better initial label separation and hence a better test performance. We experiment this with EfficientNetB0 backbone and report the results in Table 4. We find that using a few warmup steps can sometimes boost the test accuracy by a few percentage points. However, since this is not a consistent gain, we only presented results using in the main paper. In practice, this is a tuneable hyperparameter to further improve performance.

Best Test Accuracy
MNIST LwAL 99.30.1 99.30.0 99.30.1
LwAL10 99.30.0 99.30.1 99.30.0
LwAL10+rpl 99.30.0 99.40.0 99.30.1
Fashion MNIST LwAL 93.00.2 93.00.1 93.20.0
LwAL10 92.70.2 92.70.1 92.80.1
LwAL10+rpl 92.80.2 93.10.1 93.00.2
CIFAR10 LwAL 76.70.4 76.80.2 76.90.5
LwAL10 76.20.2 76.10.8 76.20.4
LwAL10+rpl 77.90.5 78.30.2 78.20.1
CIFAR100 LwAL 43.20.2 42.50.1 42.50.2
LwAL10 41.60.6 41.70.4 41.80.7
LwAL10+rpl 42.20.5 42.60.2 42.30.4
FOOD101 LwAL 22.00.6 22.10.1 22.00.1
LwAL10 20.50.5 20.40.1 20.30.1
LwAL10+rpl 20.90.2 20.90.1 20.80.4
Table 4: LwAL warmup steps experiment with EfficientNetB0 backbone.
Avg. AUAC Best Test Accuracy
ResNet50 EfficienNetB0 DenseNet121 ResNet50 EfficienNetB0 DenseNet121


One-hot (STD) 99.00.1 98.30.1 99.00.1 99.10.1 99.40.0 99.10.1
StaticLabel N/A N/A N/A N/A N/A N/A
LWR2k 98.70.1 99.30.1 99.00.1 99.20.0 99.40.1 99.30.1
LWR3k 98.80.1 99.20.0 99.00.1 99.10.1 99.50.1 99.20.1
LWR5k 98.90.0 99.10.1 99.00.1 99.20.1 99.40.1 99.20.1
LabelEmbed 99.00.1 99.20.0 99.10.1 99.20.1 99.40.0 99.40.0
LwAL (Ours) 98.90.0 98.30.1 98.90.0 99.20.1 99.30.1 99.20.0
LwAL10 (Ours) 98.90.0 98.40.1 98.90.0 99.30.1 99.30.0 99.10.0
LwAL10+rpl (Ours) 99.10.0 98.40.0 99.10.0 99.30.1 99.30.0 99.40.0

Fashion MNIST

One-hot (STD) 91.70.0 92.10.1 91.80.2 92.30.2 93.10.2 92.40.3
StaticLabel 91.10.2 91.50.1 84.20.1 92.80.1 93.00.1 92.60.2
LWR2k 91.30.3 92.30.3 91.70.3 92.10.0 93.30.3 92.20.4
LWR3k 91.30.4 92.40.1 91.70.3 92.10.0 93.40.1 92.30.4
LWR5k 91.50.2 92.10.1 91.70.2 92.30.0 93.40.1 92.50.2
LabelEmbed 92.00.3 92.10.2 91.70.3 92.70.4 93.10.2 92.90.1
LwAL (Ours) 91.80.1 92.00.1 91.70.1 92.90.1 93.00.2 92.40.0
LwAL10 (Ours) 91.60.0 91.60.1 91.90.1 92.30.0 92.70.2 92.60.2
LwAL10+rpl (Ours) 91.70.1 91.50.1 92.10.1 92.70.2 92.80.2 93.00.2


One-hot (STD) 70.80.4 72.80.2 76.80.3 73.30.5 75.90.4 78.80.5
StaticLabel 52.01.5 67.51.1 49.10.6 74.00.7 75.70.5 77.70.3
LWR2k 64.12.5 71.90.1 72.11.7 67.81.1 74.70.3 74.10.7
LWR3k 65.11.7 72.30.2 73.11.2 69.30.8 75.30.1 75.60.9
LWR5k 67.81.0 72.70.3 75.00.9 69.91.1 76.30.4 76.90.7
LabelEmbed 68.30.8 72.40.1 76.20.5 72.20.9 76.70.3 79.40.4
LwAL (Ours) 70.80.2 73.30.4 76.70.2 72.00.5 76.70.4 78.90.0
LwAL10 (Ours) 72.40.1 72.50.1 77.50.6 73.90.0 76.20.2 79.20.4
LwAL10+rpl (Ours) 73.40.1 74.40.4 78.00.5 76.00.4 77.90.5 80.50.3


One-hot (STD) 30.30.1 35.40.4 35.60.9 37.40.6 40.50.5 44.60.8
StaticLabel 6.10.2 2.60.4 2.80.0 16.81.1 5.90.6 7.80.3
LWR2k 27.10.4 33.60.7 32.00.5 32.90.5 38.10.5 38.70.4
LWR3k 27.10.2 33.80.4 32.00.6 32.70.1 38.10.4 38.60.6
LWR5k 27.40.1 33.90.6 32.20.5 32.90.4 38.10.7 38.60.6
LabelEmbed 25.90.5 34.60.5 32.01.4 37.70.8 41.00.5 44.60.7
LwAL (Ours) 36.50.4 38.60.3 42.60.1 38.80.4 43.20.2 46.80.3
LwAL10 (Ours) 37.30.2 37.90.5 44.10.3 39.30.2 41.60.6 47.50.4
LwAL10+rpl (Ours) 37.50.2 38.40.4 43.70.1 39.90.4 42.20.5 48.00.0


One-hot (STD) 12.70.2 16.30.4 16.70.1 16.30.3 18.50.5 20.60.0
StaticLabel 1.30.1 1.30.3 2.20.1 2.60.5 2.00.8 6.40.5
LWR2k 10.70.1 16.20.3 14.80.2 13.80.1 18.00.3 17.90.2
LWR3k 10.80.1 16.20.3 14.80.2 13.90.1 18.20.3 18.00.3
LWR5k 10.90.0 16.50.4 14.90.2 13.90.1 18.50.4 17.80.1
LabelEmbed 9.70.2 16.70.2 15.60.3 15.80.1 19.80.3 21.60.5
LwAL (Ours) 14.50.1 19.60.4 18.11.0 16.60.3 22.00.6 21.10.1
LwAL10 (Ours) 15.80.1 18.70.4 19.70.2 17.50.3 20.50.5 22.50.1
LwAL10+rpl (Ours) 16.00.2 19.00.2 20.11.0 17.70.1 20.90.2 22.90.1
Table 5: Learning accuracy and speed comparison between LwAL and other baselines. LwAL is trained using 0 warmup steps and update frequency of once per step. Star (*) indicates the use of different learning rate ( ) due to failure of convergence. N/A for MNIST dataset using StaticLabel indicates that the BERT representation of MNIST categories is not appropriate.