Official repository for "Orthogonal Projection Loss" (ICCV'21)
Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes. However, this is a relative constraint and does not explicitly force different class features to be well-separated. Motivated by the observation that ground-truth class representations in CE loss are orthogonal (one-hot encoded vectors), we develop a novel loss function termed `Orthogonal Projection Loss' (OPL) which imposes orthogonality in the feature space. OPL augments the properties of CE loss and directly enforces inter-class separation alongside intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. As compared to other alternatives of CE, OPL offers unique advantages e.g., no additional learnable parameters, does not require careful negative mining and is not sensitive to the batch size. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. Furthermore, OPL offers better robustness against practical nuisances such as adversarial attacks and label noise. Code is available at: https://github.com/kahnchana/opl.READ FULL TEXT VIEW PDF
Official repository for "Orthogonal Projection Loss" (ICCV'21)
Recent years have witnessed great success across a range of computer vision tasks owing to progress in deep neural networks (DNNs). Effective loss functions for DNNs training have been a crucial component of these advancements . In particular, the softmax cross entropy (CE) loss, commonly used for tackling classification problems, has been pivotal for stable and efficient training of DNNs.
Multiple variants of CE have been explored to enhance discriminativity and generalizability of feature representations learned during training. Contrastive  and triplet  loss functions are a common class of methods that have gained popularity on tasks requiring more discriminative features. At the same time, methods like centre loss  and contrastive centre loss  have attempted to explicitly enforce inter-class separation and intra-class clustering through Euclidean margins between class prototypes. Angular margin based losses [33, 32, 7, 55, 54]
compose another class of objective functions that increase inter-class margins through altering the logits prior to the CE loss.
While these methods have proven successful at promoting better inter-class separation and intra-class compactness, they do possess certain drawbacks. Contrastive and triplet loss functions [16, 45] are dependent on carefully designed negative mining procedures, which are both time-consuming and performance-sensitive. Methods based on centre loss [59, 38], that work together with CE loss, promote margins in Euclidean space which is counter-intuitive to the intrinsic angular separation enforced through CE loss . Further, these methods introduce additional learnable parameters in the form of new class centres. Angular margin based loss functions [32, 33]
which are highly successful for face recognition tasks, make strong assumptions for face embeddings to lie on the hypersphere manifold, which does not hold universally for all computer vision tasks. Some loss designs are also specific to certain architecture classes ,  can only work with DNNs which output Class Activation Maps .
In this work, we explore a novel direction of simultaneously enforcing inter-class separation and intra-class clustering through orthogonality constraints on feature representations learned in the penultimate layer (Fig. 1). We propose Orthogonal Projection Loss (OPL), which can be applied on the feature space of any DNN as a plug-and-play module. We are motivated by how image classification inherently assumes independent output classes and how orthogonality constraints in feature space go hand in hand with the one-hot encoded (orthogonal) label space used with CE. Furthermore, orthogonality constraints provide a definitive geometric structure in comparison to arbitrarily increasing margins which are prone to change depending on the selected batch, thus reducing sensitivity to batch composition. Finally, simply maximizing the margins can cause negative correlation between classes and thereby unnecessarily focus on well-separated classes while we tend to ensure independence between different class features to successfully disentangle the class-specific characteristics.
Compared with contrastive loss functions [16, 45], OPL operates directly on mini-batches, eliminating the requirement of complex negative sample mining procedures. By enforcing orthogonality through computing dot-products between feature vectors, OPL provides a natural augmentation to the intrinsic angular property of CE, as opposed to methods [59, 38, 17] that enforce an Euclidean margin in feature space. Furthermore, OPL introduces no additional learnable parameters unlike [59, 53, 38], operates independent of model architecture unlike , and in contrast to losses operating on the hypersphere manifold [32, 33, 7, 55], performs well on a wide range of tasks. Our main contributions are:
We propose a novel loss, OPL, that directly enforces inter-class separation and intra-class clustering via orthogonality constraints with no learnable parameters.
We extensively evaluate on a diverse range of image classification tasks highlighting the discriminative ability of OPL. Further, our results on few-shot learning (FSL) and domain generalization (DG) datasets establish the transferability and generalizability of features learned with OPL. Finally, we establish the improved robustness of learned features to adversarial attacks and label noise.
Loss functions play a central role in all deep learning based computer vision tasks. Recent works include reformulating the generic softmax calculation to limit the variance and range of logits during training, constraining the feature space to follow specific distributions 
, and heuristically altering margins targeting specific tasks like few-shot learning[31, 28], class imbalance [21, 20, 17, 30] and zero-shot learning . OPL optimizes a different objective of inter-class separation and intra-class clustering through orthogonalization in the feature space.
Generalizable Representations. Recent works explore the transferability of features learned via supervised training , FSL [51, 4, 14] and DG [19, 9] tasks. Tian  establish a strong FSL baseline using only standard (non-episodic) supervised pre-training. Adaptation of supervised pre-trained models to the episodic evaluation setting of FSL tasks is explored in [61, 4]. Goldblum  show the importance of margin-based regularization methods for FSL. Our work differs by building on orthogonality constraints to learn more transferable features, and is more compatible with CE as opposed to . Multiple DG methods also explore constraints on feature spaces [19, 9] to boost cross-domain performance. In particular,  explores inter-class separation and intra-class clustering through contrastive and triplet loss functions. OPL improves on these while eliminating the need for compute expensive and complex sample mining procedures.
Orthogonality. Orthogonality of kernels in DNNs is well explored with an aim to diversify the learned weight vectors [34, 56]. The idea of orthogonality is also used for disentangled representations such as in  and to stabilize network training since orthogonalization ensures energy preservation [8, 43]. Orthogonal weight initializations have also shown their promise towards improving learning behaviours [37, 60]
. However, all of these works operate in the parameter space. Remarkably, the previous formulations to achieve orthogonality in the feature space generally depend on computing singular value decomposition[27, 48]
, which can be numerically unstable, difficult to estimate for rectangular matrices, and undergoes an iterative process. In contrast, our orthogonal constraints are enforced in a novel manner, realized via decomposition on the sample-to-sample relationships within a mini-batch, while simultaneously avoiding tedious pair/triplet computations.
Maximizing inter-class separation while enhancing intra-class compactness is highly desirable for classification. While the commonly used cross entropy (CE) loss encourages logits of the same class to be closer together, it does not enforce any margin amongst different classes. There have been multiple efforts to integrate max-margin learning with CE. For example, Large-margin softmax  enforces inter-class separability directly on the dot-product similarity while SphereFace  and ArcFace  enforce multiplicative and additive angular margins on the hypersphere manifold, respectively. Directly enforcing max-margin constraints to enhance discriminability in the angular domain is ill-posed and requires approximations . Some works turn to the Euclidean space to enhance feature space discrimination. For example, centre loss  clusters penultimate layer features using Euclidean distance. Similarly, Affinity Loss  forms uniformly shaped equi-distant class-wise clusters based upon Gaussian distances in the Euclid space. Margin-maximizing objective functions in the Euclid space are not ideally suited to work along-side CE loss, since CE seeks to separate output logits in the angular domain. By enforcing orthogonality constraints, our proposed OPL loss maximally separates intermediate features in the angular domain, thus complementing cross-entropy loss which enhances angular discriminability in the output space. In the following discussion, we revisit CE loss in the context of max-margin learning, and argue why OPL loss is ideally suited to supplement CE.
Consider a deep neural network , which can be decomposed into , where
is the feature extraction module andis the classification module. Given an input-output pair , let be the intermediate features and be the output predictions. For brevity, let us define the classification module as a linear layer with no unit-biases, where are class-wise learnable projection vectors for classes. The traditional CE loss can then be defined in terms of discrepancy between the predicted and ground-truth label , by projecting the features onto the weight matrix .
Inter-class orthogonality enforced by OPL can be observed in this MNIST 2-D feature visualization. We plot only three classes to better illustrate in 2D the inter-class orthogonality achieved. Normalization refers to projection of vectors to a unit-hypersphere in feature-space, and the normalized plot contains a histogram for each angle.
Since CE does not explicitly enforce any margin between each class pair, previous works have shown that the learned class regions for some class samples tend to be bigger compared to others [32, 7, 17]. To counter this effect and ensure all classes are equally separated, efforts have been made to introduce a margin between different classes by modifying the term as for additive angular margin , for multiplicative angular margin  and as additive cosine margin . The gradient propagation for such margin-based softmax loss formulations is hard, and previous works rely on approximations. Instead of introducing any margins to ensure uniform seperation between different classes, our proposed loss function simply enforces all classes to be orthogonal to each other with simultaneous clustering of within-class samples, using an efficient vectorized implementation with straightforward gradient computation and propagation.
By considering the vectors as individual class prototypes, the CE loss can be viewed as aligning the feature vectors
along its relevant class prototype. The cosine similarity in the form of the dot product () gives CE an intrinsic angular property, which is observed in Fig. 2 where features naturally separate in the polar coordinates with CE only. Moreover, during the standard SGD based optimization, the CE loss is applied on mini-batches. We note that there is no explicit enforcement of feature separation or clustering across multiple samples within the mini-batch. Given the opportunity to enforce such constraints since supervised training is commonly conducted adopting random mini-batch based iterations, we explore the possibility of within mini-batch constraints aimed at augmenting the intrinsic discriminative characteristics of CE loss.
The CE loss with one-hot-encoded ground-truth vectors seeks to implicitly achieve orthogonality between different classes in the output space. Our proposed OPL loss, ameliorates CE loss, by enforcing class-wise orthogonality in the intermediate feature space. Given an input-output pair in the dataset , let be the features output by an intermediate layer of the network. Our objective is to enforce constraints to cluster the features such that the features for different classes are orthogonal to each other and the features for the same class are similar. To this end, we define a unified loss function that simultaneously ensures intra-class clustering and inter-class orthogonality within a mini-batch as follows:
where is the cosine similarity operator applied on two vectors, is the absolute value operator, and denotes mini-batch size. Note that the cosine similarity operator used in Eq. 2 and 3 involves normalization of features (projection to a unit hyper-sphere) as follows:
where refers to the norm operator. This normalization is key to aligning the outcome of OPL with the intrinsic angular property of CE loss.
In Eq. 4, our objective is to push towards 1 and towards 0. Since already, we take the absolute value of given . This in turn restricts the overall loss such that . When minimizing this overall loss, the first term will ensure clustering of same class samples, while the second term will ensure the orthogonality of different class samples. The loss can be implemented efficiently in a vectorized manner on the mini-batch level, avoiding any loops (see Algorithm 1).
We further note that the ratio of contribution to the overall loss of each individual term in Eq. 4 can be controlled to re-prioritize between the two objectives of inter-class separation and intra-class compactness. While the unweighted combination of and alone performs well, specific use-cases could benefit from weighted combinations. We reformulate Eq. 4 as follows:
where is the hyper-parameter controlling the weight for the two different constraints.
Since OPL acts only on intermediate features, we apply cross entropy loss over the outputs of the final classifier. The overall loss used is a weighted combination of CE and OPL. We note that our proposed loss can also be used together with other common image classification losses, such as Guided Cross Entropy, Label Smoothing or even task specific loss functions in different computer vision tasks. The overall loss can be defined as:
where is a hyper-parameter controlling the OPL weight.
Overall Objective: Consider is a set of mini-batch samples comprising of normalized features from the same class in a given dataset . The overall OPL constraints can be viewed as a minimization of the following objective to update the network
over the random variables:
where is the absolute value operator and is the Iverson bracket operator. We refer to the term defined in Eq. 8 as the expected inter-class orthogonality. The behaviour of OPL in terms of minimizing this expectation is visualized in Fig. 4 where average per-class feature vectors over the CIFAR-100 dataset are calculated for ResNet-56 models trained under ‘CE-only’ and ‘CE+OPL’ settings. Clear improvements over the CE baseline in terms of minimizing the expected inter-class orthogonality can be observed. Moreover, the stochastic mini-batch based application of OPL prevents naively pushing all non-diagonal values to zero as observed. This translates to allowing necessary inter-class relationships not encoded in one-hot labels of a dataset to be captured within the learned features.
Decomposing OPL: Taking a step further, we decompose OPL into its sub-components, and , as defined in Eq. 4. While computes the pair-wise cosine similarity between all same-class features within the mini-batch, calculates the similarity between different class features. These measures can directly be adopted to quantify the inter-class separation and intra-class compactness in any given features space. Moreover, the unweighted OPL formulation in Eq. 4 can be considered a measure of the overall feature orthogonality within any given embedding space. It will be interesting to compare the contribution of OPL towards inter-class separation and intra-class clustering of a feature space in contrast to the generic CE based training scenario. We present this comparison by training ResNet-56 on CIFAR-100 dataset in Fig. 3. This separation of features achieved through OPL translates to performance improvements not only within the standard classification setting, but also in tasks requiring transferable or generalizable features. Goldblum  explore the significance of inter-class separation and intra-class clustering for better performance when transferring a feature embedding to a few-shot learning task. Similar notions regarding discriminative features are explored in  for domain generalization. We explore the effects of OPL in few-shot learning settings, and visualize the novel class embeddings learned with OPL in Appendix B.1 using LDA  to preserve the inter-class to intra-class variance ratio as suggested in .
Why orthogonality constraints?: One may wonder what benefit orthogonality in the feature space can provide in comparison to simply maximizing margin between classes. Our reasoning is twofold: reducing sensitivity to batch composition and avoiding negative correlation constraints. Within the random mini-batch based training setting, the orthogonality objective provides a definitive geometric structure irrespective of the batch composition, while the optimal max-margin separation is dependent on the batch composition. Furthermore, in the common case where the output space feature dimension ( number of classes), maximizing angular margin between normalized features on a unit hyper-sphere will lead to negative correlation among the class prototypes (considering a maximal and equi-angular separation). We argue that this is an undesired constraint since the categorical classification task itself assumes non-existence of ordinal relationships between classes (. use of orthogonal one-hot encoded labels). Moreover, extending the constraints to additionally cause negative correlation between classes unnecessarily focuses on already well-separated classes during training whereas our constraint which tends to ensure independence provides a more balanced objective to disentangle the class-specific characteristics of even more fine-grained classes.
We extensively evaluate our proposed loss function on multiple tasks including image classification (Tables 1 & 2), robustness against label noise (Table 3), robustness against adversarial attacks (Table 4) and generalization to domain shifts (Table 5). We further observe the enhanced transferability of orthogonal features in the case of few shot learning (Tables 6 & 7). Our approach shows consistent improvements and highlights the advantages of orthogonal features on this diverse set of tasks and datasets with various deep network backbones. Additionally, we demonstrate the plug-and-play nature of OPL by showing benefits of its use over CE, Truncated Loss (for noisy labels) , RSC  and various adversarial learning baselines.
We evaluate the effectiveness of orthogonal features in the penultimate layer on image classification using our proposed training objective (Eq. 7). Competitive results are achieved showing consistent improvements (Tables 1 & 2) on two datasets: CIFAR-100 , and ImageNet .
CIFAR-100 consists of 60,000 natural images spread across 100 classes, with 600 images per class. We apply OPL over a cross-entropy baseline for supervised classification on CIFAR-100 (following the experiment setup in ), and compare our results against other loss functions which impose margin constraints [32, 33, 55, 7], introduce regularization [47, 27, 30, 6], or promote clustering [59, 64] to enhance separation among classes in Table 1. Despite its simplicity, our method performs well against state-of-the-art loss functions. Note that HNC  is dependant on class activation maps, RBF  and LGM  involve learnable parameters, and CB Focal Loss  specifically solves class-imbalance. In contrast, OPL has a simple formulation easily integratable to any network architecture, involves no learnable parameters, and targets general classification. Additionally, we note how OPL has higher performance gains with respect to top-1 accuracy (in comparison to top-5 accuracy) which is the more challenging metric. We attribute this to the fact that increased separation through OPL mostly helps in classifying difficult samples. Further, we note top-5 is not a preferred measure for CIFAR-100 since most classes are different in nature as opposed to e.g., ImageNet with several closely related classes (where our gain is much pronounced for top-5 accuracy, as discussed next).
Center Loss 
|Focal Loss ||73.09%||93.07%||74.34%||93.34%|
|LMC Loss ||71.52%||91.64%||73.15%||91.88%|
|OLE Loss ||71.95%||92.52%||72.70%||92.63%|
|LGM Loss ||73.08%||93.10%||74.34%||93.06%|
|Anchor Loss ||-||-||74.38%||92.45%|
|AAM Loss ||71.41%||91.66%||73.72%||91.86%|
|CB Focal Loss ||73.09%||93.07%||74.34%||93.34%|
ImageNet is a standard large-scale dataset used in visual recognition tasks, containing roughly 1.2 million training images and 50,000 validation images. We experiment with OPL by integrating it to common backbone architectures used in image classification tasks: ResNet18 and ResNet50. We train 111https://github.com/pytorch/examples/tree/master/imagenet
the models for 90 epochs using SGD with momentum (initial learning rate 0.1 decayed by 10 every 30 epochs). Results for these experiments are presented in Table2 and Fig. 5. We note that simply enforcing our orthogonality constraints increases the top-1 (%) accuracy of ResNet50 from 76.15% to 76.98% without any additional bells and whistles. Moreover, given the large number of fine-grained classes among the 1000 categories of ImageNet (, multiple dog species) which can be viewed as difficult cases, the better discriminative features learned by OPL obtains notable improvements in top-5 accuracy as well.
|CE + OPL (ours)||70.27%||89.60%||76.98%||93.30%|
Given the rich representation capacity of deep neural networks, especially considering how most can even fit random labels or noise perfectly , errors in the sample labels pose a significant challenge for training. In most practical applications, label noise is almost impossible to avoid, in particular, when it comes to large-scale datasets requiring millions of human annotations. Multiple works [13, 65] explore modifications to common objective functions aimed at building robustness to label noise. Despite the explicit inter-class separation constraints on the feature space enforced by OPL, we argue that the random mini-batch based optimization exploited by OPL negates the effects of noisy labels. This hypothesis is supported by our experiments presented in Table 3 which show additional robustness of OPL against label noise. We simply integrate OPL over the approach followed in , without any task-specific modifications.
|TL + OPL||88.45%||87.02%|
|TL + OPL||65.62%||53.94%|
Adversarial attacks modify a given benign sample by adding adversarial noise such that the deep neural network is deceived . Adversarial examples are out-of-distribution samples and remain a challenging problem to solve. Adversarial training  emerges as an effective defense where adversarial examples are generated and added into the training set. We enforced orthogonality on such adversarial examples in the feature space while optimizing the model weights and show our benefit on different adversarial training mechanisms [35, 18, 57]. Important to note that all the considered adversarial training schemes [35, 18, 57] are different in nature training in Madry  is based on cross-entropy only, Hendrycks  propose to exploit pre-training, while Wang  introduce a surrogate loss along with cross-entropy. Our orthogonality constraint help maximizing adversarial robustness in all cases showing the generic and plug-and-play nature of our proposed loss. In order to have reliable evaluation, we report robustness gains against Auto-Attack (AA)  in Table 4. On CIFAR10, our method increased robustness of  by ,  by and  by .
|Madry  + OPL||87.76||49.15|
|Hendrycks  + OPL||87.51||55.73|
|MART + OPL||84.41||56.23|
|Madry  + OPL||61.13||23.01|
|Hendrycks  + OPL||61.00||30.05|
|MART + OPL||58.01||25.74|
The DG problem aims to train a model using multi-domain source data such that it can directly generalize to new domains without the need of retraining. We argue that the feature space constraints of OPL tend to capture more general semantic features in images which generalize better across domains. This is verified by the performance improvements for DG that we obtain by integrating OPL with the state-of-the-art approach in  and evaluating on the popular PACS dataset . The results presented in Table 5 indicate that integrating OPL with  sets new state-of-the-art across all four domains as compared to .
|RSC + OPL||88.28||84.64||84.17||96.83||88.48|
|RFS + OPL (Ours)||✓||73.020.4||86.120.2||63.100.36||79.870.26||70.200.41||85.010.27|
|SKD + OPL (Ours)||✓||74.940.4||88.060.3||66.900.37||83.230.25||72.100.41||86.700.27|
In this section, we explore the transferability of features learned with our loss function in relation to FSL tasks. We evaluate OPL on three benchmark few-shot classification datasets: miniImageNet, tieredImageNet, and CIFAR-FS. We run additional experiments on Meta-Dataset  which is a large-scale benchmark for evaluating FSL methods in more diverse and challenging settings. Similar to , we expand Meta-Dataset by adding three additional datasets, MNIST, CIFAR10, and CIFAR100. In light of work that shows promise of learning strong features for FSL , we experiment with OPL using it as an auxiliary loss on the feature space during the supervised training. Quantitative results highlighting the performance improvements are presented in Table 6. Our results on Meta-Dataset are obtained using the train on all setting presented in . We integrate OPL over the method presented in , train on the first 8 datasets of Meta-Dataset, and evaluate on the rest (including the three additional datasets from ). Results are presented in Table 7. See Appendix A.3 for robustness of OPL features against sample noise in FSL tasks.
OPL in its full form (Eq. 6) contains two hyper-parameters, and . We conduct a hyper-parameter search over a held-out validation set of CIFAR-100 (see Table 8). The optimum values selected from these experiments are kept consistent across all other tasks and used when reporting test performance. Furthermore, we evaluate the performance of OPL on the test split of CIFAR-100 for varying values keeping fixed, to illustrate the minimal sensitivity of our method to different values in Fig. 5(a).
Next, we consider how OPL operates on random mini-batches, and evaluate its performance against varying batch sizes (CIFAR-100 dataset). These results presented in Fig. 5(b) exhibit how OPL consistently
We present a simple yet effective loss function to enforce orthogonality on the output feature space and establish its performance improvements for a wide range of classification tasks. Our loss function operates in conjunction with the softmax CE loss, and can be easily integrated with any DNN. We also explore a variety of characteristics of the features learned with OPL illustrating its benefit for few-shot learning, domain-generalization and robustness against adversarial attacks and label noise. In future, we hope to explore other variants of OPL including its adaptation to unsupervised representation learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research. Cited by: Table 6.
Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems 29 (8), pp. 3573–3587. Cited by: §2.
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §4.1.
NeuralScale: efficient scaling of neurons for resource-constrained deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1478–1487. Cited by: §Appendix A.2, Table 2, item –.
RBF-softmax: learning deep representative prototypes with radial basis function softmax. In ECCV, Cited by: §Appendix A.1, §2, §4.1, Table 1.
In this supplementary document we include:
Here, we present results of experiments conducted using OPL on a set of additional settings.
We conduct experiments on the MNIST dataset integrating OPL over a CE baseline. We use a 4-layer convolutional neural network with 32-dimensional feature embedding (after a global average pool operation) following the experimental setup in . Our results are reported in Table 1. Additionally, we conduct experiments appending a fully-connected layer to reduce the feature dimensionality to 2 for generating better visualizations on behaviour of OPL in feature-space (presented in Fig. 1 in main article).
|Method||Backbone||Baseline|| + OPL|
We have already established through empirical evidence how OPL improves performance for few-shot learning tasks as well as robustness to adversarial examples present during evaluation. We now explore the more challenging task of exploring robustness to input sample noise in a FSL setting (similar to the one in Appendix B.1
). The base training is conducted with no noise present in training data. During evaluation, the support and query set images are corrupted with random Gaussian noise of varying standard deviation (referred to as). This can be considered a domain shift on top of unseen novel classes during evaluation. The features learned with OPL during base training exhibit better robustness to such input corruptions in this FSL setting. We report these results in Table 3. The experiments conducted followed the method in  integrated with OPL.
|RFS + OPL||()||65.420.40||81.410.30||56.210.36||73.200.29||66.600.41||83.210.29|
|RFS + OPL||()||71.050.41||84.460.28||61.700.37||77.590.27||69.600.40||84.500.29|
In this section, we present additional visualizations exploring various aspects of OPL and its performance.
Consider a few-shot learning setting, where a model trained in a fully-supervised manner (referred to as base model / base training) on a set of selected classes which contains training labels (referred to as base classes) is later evaluated on a set of unseen classes (referred to as novel classes). The sets of base and novel classes are disjoint. The evaluation protocol would involve episodic iterations, where in each step a small set of labelled samples from the novel classes (referred to as support set) is available during inference, and an another set of those same novel classes (referred to as query set) is available for calculating the accuracy metrics.
Given how our proposed loss is already able to explicitly enforce constraints on the feature space during base training, we want to examine if the additional discriminative nature endowed on the features by OPL is aware of higher level semantics. To evaluate this, we explore the more challenging task of inter-class separation and intra-class clustering of novel classes which are unseen during the base training. We train a model following the approach in  integrating OPL, and visualize the separation of different class features for both base and novel classes in Fig. 1.
We further explore the performance of our model (CE+OPL) trained on ImageNet (model used for experiments presented in Table 2 of main paper) by examining the failure cases of the baseline model that were improved upon when adding OPL. Visualizations for some randomly selected such cases are illustrated in Fig. 4 and Fig. 2.
We defined the overall objective of OPL as a minimization of the expected inter-class orthogonality (refer Eq. 8 of main paper) and conducted empirical analysis using models training using our proposed loss function against a CE only baseline (illustrated in Fig. 4 of main paper). In this section, we conduct additional analysis on those block-matrices to further understand the outcomes of our orthogonality constraints on the learned feature space. It is interesting to note that while OPL enforces a higher degree of orthogonality between the average class vectors, it does not naively push everything to be orthogonal. We note that this allows any hidden knowledge learned during the training process (information not captured in the labels explicitly) to remain captured within the features. The results of the experiments conducted on this are illustrated in Fig. 3.