Neural Networks have become an essential tool for achieving high-quality classification in various application domains. Part of their appeal stems from the fact that a practitioner does not have to hand-design the input features for model training. Instead, they can simply use the raw data representation (such as using pixels instead of highly processed SIFT or HOG features for a computer vision task) and learn a mapping to the target class. The high degree of flexibility enables neural networks to learn highly non-linear maps, and thus the target output representation is also usually kept relatively simple. It is customary to encode the target labels as a one-hot encoding111For a -way classification task, one-hot encoding of the th category is simply the
basis vector indimensions.. While simple and computationally convenient, a one-hot representation is rather arbitrary. Indeed, such an encoding destroys any semantic relationships that the target categories may have. For instance, for a 3-class apparel classification task with categories, say, sandal, sneaker and shirt, the semantic similarity between sandal and sneaker (both being footwear) is clearly not captured by the one-hot encoding. An alternate label representation can allow us to capture this semantic connection, and perhaps even make the learning process easier (cf. Figure 1).
What might be a better representation of the output labels? Since powerful word embedding models such as Word2Vec (Mikolov et al., 2013) and BERT Devlin et al. (2019) are known to capture the semantic meaning of commonly occurring words, one can use such prelearned representations of our labels for classification. In fact, Chen et al. (2021) explored this idea in detail. They study the effectiveness of several embeddings including BERT (pretrained on textual data) and audio spectrogram (trained on the vocal pronunciations of the class labels) to represent the target labels and show improved performance.
An alternate approach, of course, is to explicitly learn the label representation from data itself. This again can be done in several ways. Sun et al. (2017), for example, propose to augment the underlying neural network with specialized layers for data classification and label embedding that interact with each other during the training process. This of course adds complexity to the network potentially increasing network size. Deng and Zhang (2021), in contrast, learn a “soft” set of labels by “smoothing” the original one-hot encodings without modifying the underlying network architecture. While the learned labels are reasonably flexible in representation, a simple smoothing can miss capturing more complex semantic relationships among labels.
In this work, we learn a robust data-dependent label representation that addresses issues that were unresolved in previous literature. We propose Learning with Adaptive Labels (LwAL) algorithm, which simultaneously learns semantically appropriate label representations while training for the underlying classification task. LwAL is based on the insight that relationships between class labels should be inherent in data belonging to the classes. Since one can view a neural network as a function that maps the input data to a latent representation
, we can utilize this latent data representation to get an initial estimate the label representationfor each class . Given the initial estimate of the target labels, we can now tune the underlying neural network parameters to improve classification accuracy. This improved network can in turn, be used to get a better data-dependent label representation in the latent space. We can thus alternate between learning the best representation of the labels in the latent space and learning the best parameters for the underlying network for classification, such that at convergence we achieve both high quality accuracy and an improved representation of the target labels.
We propose a simple yet powerful alternating updates training algorithm LwAL, that can learn high-quality representations of labels. Our algorithm works with any underlying network architecture, without any architecture modifications, and introducing only minimal additional parameters to tune. We show that learning the labels simultaneously with LwAL significantly cuts down on the overall training time (usually by more than 50% and sometimes up to 80%) while often achieving better test accuracies than the previous works on label learning. We further show that our learned labels are in fact semantically meaningful and can reveal hierarchical relationships that may be present in our data.
Label representations beyond the one-hot encoding have gained interest in recent years. Here we discuss the related literature in detail.
Learning Labels Directly
Representations by label smoothing:
Label smoothing techniques aim to modify the hard one-hot
class probability distribution to a softer target, which can be used to provide broader signals to the model and hence potentially achieving better performance. Numerous smoothing-based regularization techniques such as Max-Entropy RegularizerMaxEntReg Pereyra et al. (2017), Teacher-Free Regularizer TFReg Yuan et al. (2019), and Learning with Retrospection LWR Deng and Zhang (2021) have been proposed in the literature, all showing promising improvements. Yet they do not consider unravelling or understanding the relationships between the learned class labels. Deng and Zhang (2021)
for instance focuses on learning labels generated by a temperature controlled softmax function for better training. Such representations, by their construction, are limited to learning smooth unimodal class probability distributions and cannot capture complex multimodal class distributions that may be necessary to model semantic relationships that may be present in data.
Representations by network augmentation:
Sun et al. (2017)
go beyond just label smoothing and propose a unique approach to augment the underlying neural network with specialized layers to learn sophisticated label representations during the training process. Interestingly, they show that even though their augmented network is more complex, it usually learns a good classifier at a faster rate, achieving state-of-the-art accuracies for label learning.
Static Label Representations:
Rather than learning a label representation that is tuned to a given classification task, Chen et al. (2021) take an alternate approach and use high-quality pre-trained embeddings (such as BERT or GLoVe) to represent their target labels. Since no label-training is involved, this approach has the advantage of using good label representations with no added complexity, but suffers from yielding relatively lower classification accuracies. This technique also relies on the practitioner having knowledge about which pre-trained embedding is most suitable for the given classification task, which may not be as obvious.
Other Notable Related Techniques
While not aiming to learn label representations explicitly, certain ML models yield labels beyond the traditional one-hot encoding as a side-effect. Student-Teacher learning paradigm Hinton et al. (2015), for instance, aims to learn a more compact network that approximates the behavior of a given large network. In this process of distillation, the original one-hot target labels of the larger network usually get an alternate “dense” representation in the learned compact network. While interesting, learning the distilled network is time-intensive and thus not an efficient mechanism to learn label representations.
Xie et al. (2016) develop an unsupervised framework for learning to cluster data in the latent space. They use an auto-encoder architecture to learn a compact latent of the input data where it is forced to form clusters. One can thus use these learned latent data clusters and use the cluster centers as a proxy for representing labels. The lack of direct supervision yields suboptimal partitions and hence suboptimal label encodings for classification.
Connection to Metric Learning
Metric learning aims to learn a transformation of the input space where data from the same category is mapped closer together than data from different categories Kulis (2012); Bellet et al. (2013). One can perhaps view learning labels as performing metric learning not on the input space, but rather on the output space. Interestingly, to the best of our knowledge, this viewpoint is not explored in existing literature and may be a fruitful avenue for future research.
Some metric learning literature does explore semantic hierarchical relationships between labels to learn more informed transformations. Notably, Verma et al. (2012) explicitly incorporate label hierarchy information to markedly improve nearest-neighbor classification accuracy. They additionally show that such a learned metric can also help in augmenting large taxonomies with new categories. Our work, in contrast, derives the label taxonomy directly from data without any prior hierarchical information.
Here we formally introduce our Learning with Adaptive Labels LwAL algorithm, which simultaneously learns label representations while training for the underlying classification task. We’ll start by reviewing the standard training procedure for neural networks, introducing our notation. We then present our LwAL modifications that simultaneously learns the label encodings. Finally we discuss additional optional variations to LwAL that can further improve performance in certain applications.
Standard Neural Network Training Procedure
Recall that given a dataset of samples for a -category classification task, where denotes the application specific input representation and denotes the one-hot output representation of the -th sample, the goal of a neural network (parameterized by ) to learn a mapping from the inputs () to the outputs (). This learning is usually done by finding a parameter setting that minimizes the loss between the predicted output and the desired (one-hot) output . In particular, let be the network encoding of the input . First a Softmax is applied to to obtain a probability distribution which encodes the affinity of to each of the classes. Then this induced probability distribution is compared with the ideal probability distribution using any distribution-comparing divergence such as the cross-entropy (CE). Thus the classification loss for the -th sample becomes
The optimal parameter setting can thus be learned by usual iterative gradient-type updates (such as SGD or Adam) on the aggregate loss over all training datapoints.
Learning with Adaptive Labels
To learn more enriched, semantically meaning label representations, we posit that that semantic relationships between classes are contained within the samples belonging to the class. Specifically, we model the label representation of a class as the vector that minimizes the average distance to the network encoding of the samples belonging to class . This is equivalent to considering
where is the number of samples belonging to class .
To bring the training in line with standard neural network updates, given this new class representation, one can define the probability that the network encoding of the -th datapoint belonging to class as
Therefore, the modified cross entropy loss for the -th datapoint becomes
where is the probability distribution that encodes the affinity of to each of the classes using the new label representation. One can thus train the optimal parameters of the underlying neural network the usual way.
One should note that the choice of cross-entropy as the loss function encourages the learned class representationsto be well separated yielding empirically better accuracies than other popular loss functions.
One can predict the label of test examples by simply assigning it to the closest learned label in the network encoded space. That is
Adapting to large-scale datasets
To accommodate large scale datasets, we use the mini-batch paradigm. The mini-batch training usually suffers from the problem of moving target (Mnih et al., 2013), that is,
are constantly changing leading to poor convergence. In order to alleviate this, we add hyperparametersthat controls the update frequency (Deng and Zhang, 2021), and initial warmup steps to promote more initial separation between classes when learning . See Algorithm 1 for details.
To further improve the label quality, we draw inspiration from the push-pull based losses from metric learning literature Xing et al. (2002); Weinberger and Saul (2009); Schroff et al. (2015). We add an optional “push” loss, that encourages our learned labels to be well-separated thus yielding better generalization accuracies. Specifically, we penalize the angle between the network encoding of the datapoints
from different classes, using cosine similarity. That is (c.f. Algorithm1),
We have a two-fold aim for our empirical study. First, we evaluate how LwAL fares (both in terms of speed and accuracy) when compared to other popular label-learning methodologies on benchmark datasets. Second, we evaluate the effectiveness of our learned labels for revealing semantically meaningful categorical relationships in our data.222An implementation of our algorithm is available at https://github.com/jasonyux/Learning-with- Adaptive-Labels.
|Percent Time/Epoch Reduced||Best Test Accuracy|
Learning Speed and Test Performance
To evaluate the robustness of our technique, we report results on several benchmark datasets with different sizes, number of categories and application domains. In particular we used the following datasets for our experiments.
|Dataset||Domain||# classes||# points|
We use the default train/test splits provided by the tensorflow library as of Aug 2022.
with ImageNet weights, for vision datasets. All of these architectures are available on the tensorflow library as of Aug 2022. For text datasets, we use BERTDevlin et al. (2019) which is available on the huggingface library as of Aug 2022.
We compare LwAL with several important baselines. We compare with the standard one-hot training procedure (STD). Chen et al. (2021) employ a static pretrained (BERT or audio spectrogram) label representation (StaticLabel). For our comparisons, we chose the pretrained BERT embedding as it was reported to show good performance on the benchmark datasets. From the label smoothing techniques, we use LWR Deng and Zhang (2021) with varying choices of the update-frequency hyperparameter (). We also compare with the network augmentation (LabelEmbed) technique by Sun et al. (2017).
In order for the backbones (ResNet50, EfficienNetB0, DenseNet121) to be used across different datasets, we attach a single dense layer with regularization of at the top to be used as the classification head.
We train all algorithms with the same set of parameters for consistency. We first pick a learning rate within the same backbone so that all algorithms can converge: for ResNet50 and DenseNet122, we use ADAM optimizer with and learning rate of ; for EfficientNetB0, we use the same optimizer but with learning rate of . For small datasets such as MNIST, F.MNIST, and CIFAR10, we train all algorithms over 10 epochs. For large datasets such as CIFAR100 and FOOD101, we train all algorithms over 20 epochs where we see the test accuracy reaches a plateau and starts to overfit. We repeat all runs with seeds and report the mean and spread.
For LWR, we use temperature , which is the recommended value. Since we are only training for a few epochs, we also experiment with varying values for the frequency and report all results in Table 1. For LabelEmbed, we use the default setting of the parameters in the implementation (Sun et al., 2017) (i.e. , , and ).
For LwAL, we can vary the output label dimension. We compare the results for output dimension of 10 times333We empirically found that increasing the output dimension often leads to improved performance, as discussed by Chen et al. (2020). Empirically, 10 times the number of classes usually leads to best performance. the number of classes (LwAL10). We also compare the results with the addition of optional loss (LwAL10+rpl). We use update frequency of and no warmup steps as we use large batch sizes. For LwAL10+rpl we used (cf. Algorithm 1).
Results and Observations
Tables 1 (main text) and 3 (Appendix) summarize our results for the Vision and NLP datasets respectively. Best results are highlighted in bold. Blank (–) in the Time column indicates that a particular algorithm+backbone combination was not able to achieve the STD one-hot baseline test accuracy.
Observe that LwAL significantly cuts down on the overall training time (usually by more than 50% and sometimes up to 80%) while often achieving better test accuracies over other baselines. Figure 2 depicts how the test accuracy curve improves as the training proceeds for a typical run using various backbones. It clearly highlights that one can achieve the same test accuracy as STD with 70% reduction in training time. This phenomenon is typical for various benchmark datasets and choice of backbones (cf. Table 1). One can conclude that LwAL10+repl with DenseNet121 backbone seems to give the best results with significant () savings overall. Curiously StaticLabel and LWR are not able to achieve STD one-hot label test accuracies for large multi-class datastes like CIFAR101 and FOOD101.
Semantic Label Representation
Here we want to empirically evaluate the effectiveness of our learned labels in discovering semantic relationships among categories. For this, we shall use the semantic hierarchy induced by WordNet Miller (1995) as the gold standard relationship among the categories, and compare how well our learned labels reveal those relationships.
To this end, we utilize the Kendall’s Tau-b () correlation coefficient score to compare the learned representations with the WordNet hierarchy. Specifically, first we compute the pairwise distances between distinct class labels for (i) the reference WordNet hierarchy tree (this is done using the short path distances between the tree nodes) , and (ii) the learned vectors from the label learning algorithm . Next, treating collected distance vectors (where ) for each of the classes as rank vectors, we can compute the average semantic correlation score as:
We report results on datasets for which the classes can be easily mapped to the WordNet Miller (1995) hierarchy. This includes the existing Fashion MNIST (8 out of 10 classes can be mapped) and CIFAR10 (10 out of 10 classes can be mapped). We also include the results for Animal with Attributes 2 (AwA2) dataset Xian et al. (2019), where 23 out of 50 classes can be mapped). We learn and evaluate the quality of the label representations of only the mappable classes for each of these datasets.
Architectures, Hyperparameters, and Baselines
We use ResNet50 (with ImageNet weights) as the underlying neural network backbone for our experiments. We compare the results of ourLwAL algorithm with other label learning techniques: LWR (best across ) and LabelEmbed. For LWR, the explicit label representation is computed via Eq. (1). For LabelEmbed, since it returns a similarity matrix between the learned labels, we compute the vectorial representation the standard (eigendecomposition) way.
The rest of the hyperparameter settings (including random seed, batch size, etc.) are same as the previous section.
|Datasets||Other Label Learning Algs.||Ours|
Observe that LwAL and its variants can consistently generate significantly superior semantically meaningful representations when compared to other label learning methods. While these results are compelling, it is worth noting that the learned labels and thus the semantic hierarchy is derived from the data inputs. LwAL can thus only extract those relationships that are present in the input data representation and likely cannot capture every fine-grained semantic detail between classes. Indeed, if the input representation (for example pixels for image classification tasks) does not contain any information about the semantic relationships, then one cannot expect LwAL to capture any useful relationship.
Conclusion and Future Work
In this work we present a simple yet powerful Learning with Adaptive Labels LwAL algorithm that can learn semantically meaningful label representations that the vanilla one-hot encoding is unable to capture. Interestingly, we find that by allowing the network to flexibly learn a label representation during training, we can significantly cut down on the overall training time while achieving high test accuracies. Extensive experiments on multiple datasets with varying dataset sizes, application domains, and network architectures show that our learning algorithm is effective and robust.
As noted, although LwAL can learn high-level semantically meaningful label representations extracted from inputs, it is interesting to explore to what degree this is possible. Can fine-grained semantic relationships be derived just from the raw input space? Or does one need to incorporate additional “side-information” to accelerate semantic discovery? We leave this as a topic for future research.
- A survey on metric learning for feature vectors and structured data. CoRR abs/1306.6709. External Links: Cited by: Connection to Metric Learning.
Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision (ECCV), Cited by: Datasets.
- Beyond categorical label representations for image classification. In International Conference on Learning Representations (ICLR), External Links: Cited by: Introduction, Static Label Representations:, Baselines.
Label representations in modeling classification as text generation. Ashia-Pacific Chapter of the Association for Computational Linguistics (ACL), Student Research Workshop, pp. 160–164. External Links: Cited by: footnote 3.
Learning with retrospection.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)35 (8), pp. 7201–7209. External Links: Cited by: Introduction, Representations by label smoothing:, Adapting to large-scale datasets, Baselines, Effects of Warmup Steps on LwAL.
- BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics (ACL), pp. 4171–4186. External Links: Cited by: Introduction, Network Architectures, Experiments with Text Dataset.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: Network Architectures.
- Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Cited by: Other Notable Related Techniques.
- Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2261–2269. External Links: Cited by: Network Architectures.
- Learning multiple layers of features from tiny images. Technical report Cited by: Datasets.
Metric learning: a survey.
Foundations and Trends in Machine Learning5 (4), pp. 287–364. Cited by: Connection to Metric Learning.
- MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2. Cited by: Datasets.
Learning word vectors for sentiment analysis. Association for Computational Linguistics (ACL), pp. 142–150. External Links: Cited by: Datasets, Experiments with Text Dataset.
- Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (NIPS) 26. External Links: Cited by: Introduction.
- WordNet: a lexical database for english. Commun. ACM 38 (11), pp. 39–41. External Links: Cited by: Datasets, Semantic Label Representation.
Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Cited by: Adapting to large-scale datasets.
- Regularizing neural networks by penalizing confident output distributions. CoRR abs/1701.06548. External Links: Cited by: Representations by label smoothing:.
FaceNet: a unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. External Links: Cited by: Additional Improvements.
- Label embedding network: learning label representation for soft training of deep networks. CoRR abs/1710.10393. Cited by: Introduction, Representations by network augmentation:, Baselines, Hyperparameters.
EfficientNet: rethinking model scaling for convolutional neural networks. Proceedings of the 36th International Conference on Machine Learning (ICML) 97, pp. 6105–6114. External Links: Cited by: Network Architectures.
- Learning hierarchical similarity metrics. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 2280–2287. External Links: Cited by: Connection to Metric Learning.
- Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (9), pp. 207–244. External Links: Cited by: Additional Improvements.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2251–2265. External Links: Cited by: Datasets.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Cited by: Datasets.
Unsupervised deep embedding for clustering analysis. Proceedings of The 33rd International Conference on Machine Learning (ICML) 48, pp. 478–487. External Links: Cited by: Other Notable Related Techniques.
- Distance metric learning with application to clustering with side-information. Neural Information Processing Systems (NIPS), pp. 505–512. Cited by: Additional Improvements.
- Revisiting knowledge distillation via label smoothing regularization. arXiv. External Links: Cited by: Representations by label smoothing:.
- Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems (NIPS) 28. External Links: Cited by: Datasets, Experiments with Text Dataset.
|Pecent Time/Epoch Reduced||Best Test Accuracy||Avg. AUAC|
Additional Results on Learning Speed and Test Performance
In addition to “Percent Time/Epoch Reduced” and “Best Test Accurarcy” in Table 1, we include average area under the accuracy curve (AUAC) in Table 5. This could be another useful metric to compare learning speed between algorithms, as larger area under the testing curve indicates a faster learning speed.
Experiments with Text Dataset
We also perform learning speed and test performance evaluations on text datasets, such as IMDB reviews Maas et al. (2011) and Yelp Polarity Reviews Zhang et al. (2015). Specifically, we first use BERT Devlin et al. (2019) to extract a 768-dimensional representation of each text, and then use two Dense layers for predictions (one outputs 768 dimension, and another outputs number of classes). For StaticLabel, we use BERT encodings of the word ”negative” for class 0, and ”positive” for class 1. We train all algorithms over 10 epochs, using ADAM with learning rate of 1e-4, and the rest of the training hyperparameters are the same discussed in the main text. The results are presented in Table 3.
Effects of Warmup Steps on LwAL
As discussed by Deng and Zhang (2021), we experiment our LwAL with some initial warmup steps to see if it can provide a better initial label separation and hence a better test performance. We experiment this with EfficientNetB0 backbone and report the results in Table 4. We find that using a few warmup steps can sometimes boost the test accuracy by a few percentage points. However, since this is not a consistent gain, we only presented results using in the main paper. In practice, this is a tuneable hyperparameter to further improve performance.
|Best Test Accuracy|
|Avg. AUAC||Best Test Accuracy|