Orthogonal Projection Loss

03/25/2021
by   Kanchana Ranasinghe, et al.
68

Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes. However, this is a relative constraint and does not explicitly force different class features to be well-separated. Motivated by the observation that ground-truth class representations in CE loss are orthogonal (one-hot encoded vectors), we develop a novel loss function termed `Orthogonal Projection Loss' (OPL) which imposes orthogonality in the feature space. OPL augments the properties of CE loss and directly enforces inter-class separation alongside intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. As compared to other alternatives of CE, OPL offers unique advantages e.g., no additional learnable parameters, does not require careful negative mining and is not sensitive to the batch size. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. Furthermore, OPL offers better robustness against practical nuisances such as adversarial attacks and label noise. Code is available at: https://github.com/kahnchana/opl.

READ FULL TEXT VIEW PDF

page 5

page 6

page 13

page 14

page 15

02/16/2022

Cyclical Focal Loss

The cross-entropy softmax loss is the primary loss function used to trai...
09/16/2019

On the Separability of Classes with the Cross-Entropy Loss Function

In this paper, we focus on the separability of classes with the cross-en...
12/05/2017

OLÉ: Orthogonal Low-rank Embedding, A Plug and Play Geometric Loss for Deep Learning

Deep neural networks trained using a softmax layer at the top and the cr...
03/21/2022

ViM: Out-Of-Distribution with Virtual-logit Matching

Most of the existing Out-Of-Distribution (OOD) detection algorithms depe...
09/22/2020

Role of Orthogonality Constraints in Improving Properties of Deep Networks for Image Classification

Standard deep learning models that employ the categorical cross-entropy ...
10/13/2021

Well-classified Examples are Underestimated in Classification with Deep Neural Networks

The conventional wisdom behind learning deep classification models is to...
05/16/2022

Robust Representation via Dynamic Feature Aggregation

Deep convolutional neural network (CNN) based models are vulnerable to t...

Code Repositories

opl

Official repository for "Orthogonal Projection Loss" (ICCV'21)


view repo

1 Introduction

Figure 1: Orthogonal Projection Loss: During training of a deep neural network, within each mini-batch, OPL enforces separation between features of different class samples while clustering together features of the same class samples. OPL integrates well with softmax CE loss as it simply complements its intrinsic angular property, leading to consistent performance improvements on various classification tasks with a variety of DNN backbones.

Recent years have witnessed great success across a range of computer vision tasks owing to progress in deep neural networks (DNNs)

[24]. Effective loss functions for DNNs training have been a crucial component of these advancements [15]. In particular, the softmax cross entropy (CE) loss, commonly used for tackling classification problems, has been pivotal for stable and efficient training of DNNs.

Multiple variants of CE have been explored to enhance discriminativity and generalizability of feature representations learned during training. Contrastive [16] and triplet [45] loss functions are a common class of methods that have gained popularity on tasks requiring more discriminative features. At the same time, methods like centre loss [59] and contrastive centre loss [38] have attempted to explicitly enforce inter-class separation and intra-class clustering through Euclidean margins between class prototypes. Angular margin based losses [33, 32, 7, 55, 54]

compose another class of objective functions that increase inter-class margins through altering the logits prior to the CE loss.

While these methods have proven successful at promoting better inter-class separation and intra-class compactness, they do possess certain drawbacks. Contrastive and triplet loss functions [16, 45] are dependent on carefully designed negative mining procedures, which are both time-consuming and performance-sensitive. Methods based on centre loss [59, 38], that work together with CE loss, promote margins in Euclidean space which is counter-intuitive to the intrinsic angular separation enforced through CE loss [32]. Further, these methods introduce additional learnable parameters in the form of new class centres. Angular margin based loss functions [32, 33]

which are highly successful for face recognition tasks, make strong assumptions for face embeddings to lie on the hypersphere manifold, which does not hold universally for all computer vision tasks

[47]. Some loss designs are also specific to certain architecture classes , [47] can only work with DNNs which output Class Activation Maps [66].

In this work, we explore a novel direction of simultaneously enforcing inter-class separation and intra-class clustering through orthogonality constraints on feature representations learned in the penultimate layer (Fig. 1). We propose Orthogonal Projection Loss (OPL), which can be applied on the feature space of any DNN as a plug-and-play module. We are motivated by how image classification inherently assumes independent output classes and how orthogonality constraints in feature space go hand in hand with the one-hot encoded (orthogonal) label space used with CE. Furthermore, orthogonality constraints provide a definitive geometric structure in comparison to arbitrarily increasing margins which are prone to change depending on the selected batch, thus reducing sensitivity to batch composition. Finally, simply maximizing the margins can cause negative correlation between classes and thereby unnecessarily focus on well-separated classes while we tend to ensure independence between different class features to successfully disentangle the class-specific characteristics.

Compared with contrastive loss functions [16, 45], OPL operates directly on mini-batches, eliminating the requirement of complex negative sample mining procedures. By enforcing orthogonality through computing dot-products between feature vectors, OPL provides a natural augmentation to the intrinsic angular property of CE, as opposed to methods [59, 38, 17] that enforce an Euclidean margin in feature space. Furthermore, OPL introduces no additional learnable parameters unlike [59, 53, 38], operates independent of model architecture unlike [47], and in contrast to losses operating on the hypersphere manifold [32, 33, 7, 55], performs well on a wide range of tasks. Our main contributions are:

  • [topsep=1pt,itemsep=0pt,partopsep=1ex,parsep=1ex,leftmargin=*]

  • We propose a novel loss, OPL, that directly enforces inter-class separation and intra-class clustering via orthogonality constraints with no learnable parameters.

  • Our orthogonality constraints are efficiently formulated compared to existing methods [27, 48]

    , allowing mini-batch processing without the need to explicitly obtain singular values. This leads to a simple vectorized implementation of OPL directly integrating with CE.

  • We extensively evaluate on a diverse range of image classification tasks highlighting the discriminative ability of OPL. Further, our results on few-shot learning (FSL) and domain generalization (DG) datasets establish the transferability and generalizability of features learned with OPL. Finally, we establish the improved robustness of learned features to adversarial attacks and label noise.

2 Related Work

Loss Functions.

Loss functions play a central role in all deep learning based computer vision tasks. Recent works include reformulating the generic softmax calculation to limit the variance and range of logits during training

[64], constraining the feature space to follow specific distributions [53]

, and heuristically altering margins targeting specific tasks like few-shot learning

[31, 28], class imbalance [21, 20, 17, 30] and zero-shot learning [39]. OPL optimizes a different objective of inter-class separation and intra-class clustering through orthogonalization in the feature space.

Generalizable Representations. Recent works explore the transferability of features learned via supervised training [62], FSL [51, 4, 14] and DG [19, 9] tasks. Tian [51] establish a strong FSL baseline using only standard (non-episodic) supervised pre-training. Adaptation of supervised pre-trained models to the episodic evaluation setting of FSL tasks is explored in [61, 4]. Goldblum [14] show the importance of margin-based regularization methods for FSL. Our work differs by building on orthogonality constraints to learn more transferable features, and is more compatible with CE as opposed to [14]. Multiple DG methods also explore constraints on feature spaces [19, 9] to boost cross-domain performance. In particular, [9] explores inter-class separation and intra-class clustering through contrastive and triplet loss functions. OPL improves on these while eliminating the need for compute expensive and complex sample mining procedures.

Orthogonality. Orthogonality of kernels in DNNs is well explored with an aim to diversify the learned weight vectors [34, 56]. The idea of orthogonality is also used for disentangled representations such as in [58] and to stabilize network training since orthogonalization ensures energy preservation [8, 43]. Orthogonal weight initializations have also shown their promise towards improving learning behaviours [37, 60]

. However, all of these works operate in the parameter space. Remarkably, the previous formulations to achieve orthogonality in the feature space generally depend on computing singular value decomposition

[27, 48]

, which can be numerically unstable, difficult to estimate for rectangular matrices, and undergoes an iterative process

[2]. In contrast, our orthogonal constraints are enforced in a novel manner, realized via decomposition on the sample-to-sample relationships within a mini-batch, while simultaneously avoiding tedious pair/triplet computations.

3 Proposed Method

Maximizing inter-class separation while enhancing intra-class compactness is highly desirable for classification. While the commonly used cross entropy (CE) loss encourages logits of the same class to be closer together, it does not enforce any margin amongst different classes. There have been multiple efforts to integrate max-margin learning with CE. For example, Large-margin softmax [33] enforces inter-class separability directly on the dot-product similarity while SphereFace [32] and ArcFace [7] enforce multiplicative and additive angular margins on the hypersphere manifold, respectively. Directly enforcing max-margin constraints to enhance discriminability in the angular domain is ill-posed and requires approximations [33]. Some works turn to the Euclidean space to enhance feature space discrimination. For example, centre loss [59] clusters penultimate layer features using Euclidean distance. Similarly, Affinity Loss [17] forms uniformly shaped equi-distant class-wise clusters based upon Gaussian distances in the Euclid space. Margin-maximizing objective functions in the Euclid space are not ideally suited to work along-side CE loss, since CE seeks to separate output logits in the angular domain. By enforcing orthogonality constraints, our proposed OPL loss maximally separates intermediate features in the angular domain, thus complementing cross-entropy loss which enhances angular discriminability in the output space. In the following discussion, we revisit CE loss in the context of max-margin learning, and argue why OPL loss is ideally suited to supplement CE.

3.1 Revisiting Softmax Cross Entropy Loss

Consider a deep neural network , which can be decomposed into , where

is the feature extraction module and

is the classification module. Given an input-output pair , let be the intermediate features and be the output predictions. For brevity, let us define the classification module as a linear layer with no unit-biases, where are class-wise learnable projection vectors for classes. The traditional CE loss can then be defined in terms of discrepancy between the predicted and ground-truth label , by projecting the features onto the weight matrix .

(1)
Figure 2: Feature space visualization for CE vs OPL:

Inter-class orthogonality enforced by OPL can be observed in this MNIST 2-D feature visualization. We plot only three classes to better illustrate in 2D the inter-class orthogonality achieved. Normalization refers to projection of vectors to a unit-hypersphere in feature-space, and the normalized plot contains a histogram for each angle.

Since CE does not explicitly enforce any margin between each class pair, previous works have shown that the learned class regions for some class samples tend to be bigger compared to others [32, 7, 17]. To counter this effect and ensure all classes are equally separated, efforts have been made to introduce a margin between different classes by modifying the term as for additive angular margin [7], for multiplicative angular margin [32] and as additive cosine margin [55]. The gradient propagation for such margin-based softmax loss formulations is hard, and previous works rely on approximations. Instead of introducing any margins to ensure uniform seperation between different classes, our proposed loss function simply enforces all classes to be orthogonal to each other with simultaneous clustering of within-class samples, using an efficient vectorized implementation with straightforward gradient computation and propagation.

By considering the vectors as individual class prototypes, the CE loss can be viewed as aligning the feature vectors

along its relevant class prototype. The cosine similarity in the form of the dot product (

) gives CE an intrinsic angular property, which is observed in Fig. 2 where features naturally separate in the polar coordinates with CE only. Moreover, during the standard SGD based optimization, the CE loss is applied on mini-batches. We note that there is no explicit enforcement of feature separation or clustering across multiple samples within the mini-batch. Given the opportunity to enforce such constraints since supervised training is commonly conducted adopting random mini-batch based iterations, we explore the possibility of within mini-batch constraints aimed at augmenting the intrinsic discriminative characteristics of CE loss.

3.2 Orthogonal Projection Loss

The CE loss with one-hot-encoded ground-truth vectors seeks to implicitly achieve orthogonality between different classes in the output space. Our proposed OPL loss, ameliorates CE loss, by enforcing class-wise orthogonality in the intermediate feature space. Given an input-output pair in the dataset , let be the features output by an intermediate layer of the network. Our objective is to enforce constraints to cluster the features such that the features for different classes are orthogonal to each other and the features for the same class are similar. To this end, we define a unified loss function that simultaneously ensures intra-class clustering and inter-class orthogonality within a mini-batch as follows:

(2)
(3)
(4)

where is the cosine similarity operator applied on two vectors, is the absolute value operator, and denotes mini-batch size. Note that the cosine similarity operator used in Eq. 2 and 3 involves normalization of features (projection to a unit hyper-sphere) as follows:

(5)

where refers to the norm operator. This normalization is key to aligning the outcome of OPL with the intrinsic angular property of CE loss.

In Eq. 4, our objective is to push towards 1 and towards 0. Since already, we take the absolute value of given . This in turn restricts the overall loss such that . When minimizing this overall loss, the first term will ensure clustering of same class samples, while the second term will ensure the orthogonality of different class samples. The loss can be implemented efficiently in a vectorized manner on the mini-batch level, avoiding any loops (see Algorithm 1).

We further note that the ratio of contribution to the overall loss of each individual term in Eq. 4 can be controlled to re-prioritize between the two objectives of inter-class separation and intra-class compactness. While the unweighted combination of and alone performs well, specific use-cases could benefit from weighted combinations. We reformulate Eq. 4 as follows:

(6)

where is the hyper-parameter controlling the weight for the two different constraints.

Since OPL acts only on intermediate features, we apply cross entropy loss over the outputs of the final classifier

. The overall loss used is a weighted combination of CE and OPL. We note that our proposed loss can also be used together with other common image classification losses, such as Guided Cross Entropy, Label Smoothing or even task specific loss functions in different computer vision tasks. The overall loss can be defined as:

(7)

where is a hyper-parameter controlling the OPL weight.

def forward(features, labels):
    """
␣␣␣␣features:␣␣␣featuresshaped(B,D)
␣␣␣␣labels:␣␣␣␣␣targetsshaped(B,1)
␣␣␣␣"""
    features = F.normalize(features, p=2, dim=1)
    # masks for same and diff class features
    mask = torch.eq(labels, labels.t())
    eye = torch.eye(mask.shape[0])
    mask_pos = mask.masked_fill(eye, 0)
    mask_neg = 1 - mask
    # s & d calculation
    dot_prod = torch.matmul(features, features.t())
    pos_total = (mask_pos * dot_prod).sum()
    neg_total = torch.abs(mask_neg * dot_prod).sum()
    pos_mean =  pos_total / (mask_pos.sum() + 1e-6)
    neg_mean = neg_total / (mask_neg.sum() + 1e-6)
    # total loss
    loss = (1.0 - pos_mean) + neg_mean
    return loss
Algorithm 1 Pytorch style pseudocode for OPL
(a) Feature orthogonality ()
(b) Similarity of same class features ()
(c) Similarity of different class features ()
Figure 3: Feature Analysis: We compare feature orthogonality as measured by OPL and feature similarity as measured by cosine similarity and plot their convergence during training. Feature similarity is initially high because all features are random immediately after initialization. OPL simultaneously enforces higher inter-class similarity and intra-class dissimilarity in comparison with the CE baseline.
(a) ‘CE-only’ trained features
(b) ‘CE+OPL’ trained features
Figure 4: Orthogonality Visualization: We present matrices illustrating the orthogonality of average per class features computed over the CIFAR-100 test set. See Appendix B.3 for more analysis.

3.3 Interpretation and Analysis

Overall Objective: Consider is a set of mini-batch samples comprising of normalized features from the same class in a given dataset . The overall OPL constraints can be viewed as a minimization of the following objective to update the network

over the random variables

:

(8)

where is the absolute value operator and is the Iverson bracket operator. We refer to the term defined in Eq. 8 as the expected inter-class orthogonality. The behaviour of OPL in terms of minimizing this expectation is visualized in Fig. 4 where average per-class feature vectors over the CIFAR-100 dataset are calculated for ResNet-56 models trained under ‘CE-only’ and ‘CE+OPL’ settings. Clear improvements over the CE baseline in terms of minimizing the expected inter-class orthogonality can be observed. Moreover, the stochastic mini-batch based application of OPL prevents naively pushing all non-diagonal values to zero as observed. This translates to allowing necessary inter-class relationships not encoded in one-hot labels of a dataset to be captured within the learned features.

Decomposing OPL: Taking a step further, we decompose OPL into its sub-components, and , as defined in Eq. 4. While computes the pair-wise cosine similarity between all same-class features within the mini-batch, calculates the similarity between different class features. These measures can directly be adopted to quantify the inter-class separation and intra-class compactness in any given features space. Moreover, the unweighted OPL formulation in Eq. 4 can be considered a measure of the overall feature orthogonality within any given embedding space. It will be interesting to compare the contribution of OPL towards inter-class separation and intra-class clustering of a feature space in contrast to the generic CE based training scenario. We present this comparison by training ResNet-56 on CIFAR-100 dataset in Fig. 3. This separation of features achieved through OPL translates to performance improvements not only within the standard classification setting, but also in tasks requiring transferable or generalizable features. Goldblum [14] explore the significance of inter-class separation and intra-class clustering for better performance when transferring a feature embedding to a few-shot learning task. Similar notions regarding discriminative features are explored in [9] for domain generalization. We explore the effects of OPL in few-shot learning settings, and visualize the novel class embeddings learned with OPL in Appendix B.1 using LDA [36] to preserve the inter-class to intra-class variance ratio as suggested in [14].

Why orthogonality constraints?: One may wonder what benefit orthogonality in the feature space can provide in comparison to simply maximizing margin between classes. Our reasoning is twofold: reducing sensitivity to batch composition and avoiding negative correlation constraints. Within the random mini-batch based training setting, the orthogonality objective provides a definitive geometric structure irrespective of the batch composition, while the optimal max-margin separation is dependent on the batch composition. Furthermore, in the common case where the output space feature dimension ( number of classes), maximizing angular margin between normalized features on a unit hyper-sphere will lead to negative correlation among the class prototypes (considering a maximal and equi-angular separation). We argue that this is an undesired constraint since the categorical classification task itself assumes non-existence of ordinal relationships between classes (. use of orthogonal one-hot encoded labels). Moreover, extending the constraints to additionally cause negative correlation between classes unnecessarily focuses on already well-separated classes during training whereas our constraint which tends to ensure independence provides a more balanced objective to disentangle the class-specific characteristics of even more fine-grained classes.

4 Experiments

We extensively evaluate our proposed loss function on multiple tasks including image classification (Tables 1 & 2), robustness against label noise (Table 3), robustness against adversarial attacks (Table 4) and generalization to domain shifts (Table 5). We further observe the enhanced transferability of orthogonal features in the case of few shot learning (Tables 6 & 7). Our approach shows consistent improvements and highlights the advantages of orthogonal features on this diverse set of tasks and datasets with various deep network backbones. Additionally, we demonstrate the plug-and-play nature of OPL by showing benefits of its use over CE, Truncated Loss (for noisy labels) [65], RSC [19] and various adversarial learning baselines.

4.1 Image Classification

We evaluate the effectiveness of orthogonal features in the penultimate layer on image classification using our proposed training objective (Eq. 7). Competitive results are achieved showing consistent improvements (Tables 12) on two datasets: CIFAR-100 [22], and ImageNet [23].

CIFAR-100 consists of 60,000 natural images spread across 100 classes, with 600 images per class. We apply OPL over a cross-entropy baseline for supervised classification on CIFAR-100 (following the experiment setup in [47]), and compare our results against other loss functions which impose margin constraints [32, 33, 55, 7], introduce regularization [47, 27, 30, 6], or promote clustering [59, 64] to enhance separation among classes in Table 1. Despite its simplicity, our method performs well against state-of-the-art loss functions. Note that HNC [47] is dependant on class activation maps, RBF [64] and LGM [53] involve learnable parameters, and CB Focal Loss [6] specifically solves class-imbalance. In contrast, OPL has a simple formulation easily integratable to any network architecture, involves no learnable parameters, and targets general classification. Additionally, we note how OPL has higher performance gains with respect to top-1 accuracy (in comparison to top-5 accuracy) which is the more challenging metric. We attribute this to the fact that increased separation through OPL mostly helps in classifying difficult samples. Further, we note top-5 is not a preferred measure for CIFAR-100 since most classes are different in nature as opposed to e.g., ImageNet with several closely related classes (where our gain is much pronounced for top-5 accuracy, as discussed next).


Resnet-56 ResNet-110
Loss Top-1 Top-5 Top-1 Top-5

Center Loss [59]
72.72% 93.06% 74.27% 93.20%
Focal Loss [30] 73.09% 93.07% 74.34% 93.34%
A-Softmax [32] 72.20% 91.28% 72.72% 90.41%
LMC Loss [55] 71.52% 91.64% 73.15% 91.88%
OLE Loss [27] 71.95% 92.52% 72.70% 92.63%
LGM Loss [53] 73.08% 93.10% 74.34% 93.06%
Anchor Loss [44] - - 74.38% 92.45%
AAM Loss [7] 71.41% 91.66% 73.72% 91.86%
CB Focal Loss [6] 73.09% 93.07% 74.34% 93.34%
HNC [47] 73.47% 93.29% 74.76% 93.65%
RBF [64] 73.36% 92.94% - -
CE (Baseline) 72.40% 92.68% 73.79% 93.11%
CE+OPL (Ours) 73.52% 93.07% 74.85% 93.32%
Table 1: CIFAR-100: These results indicate that a simple combination of cross-entropy along with our proposed orthogonal constraint gives improvements over the baseline loss function.

ImageNet is a standard large-scale dataset used in visual recognition tasks, containing roughly 1.2 million training images and 50,000 validation images. We experiment with OPL by integrating it to common backbone architectures used in image classification tasks: ResNet18 and ResNet50. We train 111https://github.com/pytorch/examples/tree/master/imagenet

the models for 90 epochs using SGD with momentum (initial learning rate 0.1 decayed by 10 every 30 epochs). Results for these experiments are presented in Table

2 and Fig. 5. We note that simply enforcing our orthogonality constraints increases the top-1 (%) accuracy of ResNet50 from 76.15% to 76.98% without any additional bells and whistles. Moreover, given the large number of fine-grained classes among the 1000 categories of ImageNet (, multiple dog species) which can be viewed as difficult cases, the better discriminative features learned by OPL obtains notable improvements in top-5 accuracy as well.


ResNet-18 ResNet-50
Method top-1 top-5 top-1 top-5

CE (Baseline)
69.91% 89.08% 76.15% 92.87%
CE + OPL (ours) 70.27% 89.60% 76.98% 93.30%
Table 2: Results on ImageNet: OPL gives an improvement over a cross-entropy (CE) baseline for common backbone architectures.
Figure 5: Qualitative Results: We present the top-5 predictions for OPL and CE in images where training with OPL has fixed the incorrect prediction by ‘CE only’ model. See Appendix B.2.

4.2 Robustness against Label Noise

Given the rich representation capacity of deep neural networks, especially considering how most can even fit random labels or noise perfectly [63], errors in the sample labels pose a significant challenge for training. In most practical applications, label noise is almost impossible to avoid, in particular, when it comes to large-scale datasets requiring millions of human annotations. Multiple works [13, 65] explore modifications to common objective functions aimed at building robustness to label noise. Despite the explicit inter-class separation constraints on the feature space enforced by OPL, we argue that the random mini-batch based optimization exploited by OPL negates the effects of noisy labels. This hypothesis is supported by our experiments presented in Table 3 which show additional robustness of OPL against label noise. We simply integrate OPL over the approach followed in [65], without any task-specific modifications.

Dataset Method Uniform Class Dependent
CIFAR10 TL [65] 87.62% 82.28%
TL[65] + OPL 88.45% 87.02%
CIFAR100 TL [65] 62.64% 47.66%
TL[65] + OPL 65.62% 53.94%
Table 3: Results on CIFAR-100 for Noisy Labels: We explore the effect of noisy labels when training with OPL for image classification tasks. We use the method in [65] as a baseline comparison with 0.4 noise level and ResNet18 backbone.

4.3 Robustness against Adversarial Attacks

Adversarial attacks modify a given benign sample by adding adversarial noise such that the deep neural network is deceived [50]. Adversarial examples are out-of-distribution samples and remain a challenging problem to solve. Adversarial training [35] emerges as an effective defense where adversarial examples are generated and added into the training set. We enforced orthogonality on such adversarial examples in the feature space while optimizing the model weights and show our benefit on different adversarial training mechanisms [35, 18, 57]. Important to note that all the considered adversarial training schemes [35, 18, 57] are different in nature training in Madry [35] is based on cross-entropy only, Hendrycks [18] propose to exploit pre-training, while Wang [57] introduce a surrogate loss along with cross-entropy. Our orthogonality constraint help maximizing adversarial robustness in all cases showing the generic and plug-and-play nature of our proposed loss. In order to have reliable evaluation, we report robustness gains against Auto-Attack (AA) [5] in Table 4. On CIFAR10, our method increased robustness of [35] by , [18] by and [57] by .

Dataset Method Clean Advers.
CIFAR10 Madry [35] 87.14 44.04
Madry [35] + OPL 87.76 49.15
Hendrycks [18] 87.11 54.92
Hendrycks [18] + OPL 87.51 55.73
MART [57] 84.49 54.10
MART[57] + OPL 84.41 56.23
CIFAR100 Madry [35] 60.20 20.60
Madry [35] + OPL 61.13 23.01
Hendrycks [18] 59.23 28.42
Hendrycks [18] + OPL 61.00 30.05
MART [57] 58.90 23.40
MART[57] + OPL 58.01 25.74
Table 4: OPL performance on Adversarial Robustness: We show the impact of enforcing orthogonality on the robust features. We adversarially train baseline methods [35, 18, 57] by adding OPL constraint during training. Robust features obtained with OPL leads to better accuracy and show clear improvements over the baseline. Top-1 accuracy is reported against Auto-Attack [5] in whitebox setting (attacker has full knowledge of the model architecture and pretrained weights).

4.4 Domain Generalization (DG)

The DG problem aims to train a model using multi-domain source data such that it can directly generalize to new domains without the need of retraining. We argue that the feature space constraints of OPL tend to capture more general semantic features in images which generalize better across domains. This is verified by the performance improvements for DG that we obtain by integrating OPL with the state-of-the-art approach in [19] and evaluating on the popular PACS dataset [29]. The results presented in Table 5 indicate that integrating OPL with [19] sets new state-of-the-art across all four domains as compared to [19].

Method Art Cartoon Sketch Photo Avg
JiGen [3] 86.20 78.70 70.63 97.66 83.29
MASF [10] 82.89 80.49 72.29 95.01 82.67
MetaReg [1] 87.20 79.20 70.30 97.60 83.60
RSC [19] 87.89 82.16 83.35 96.47* 87.47
RSC + OPL 88.28 84.64 84.17 96.83 88.48
Table 5: Results on PACS dataset: We integrate OPL with [19], gaining improvements for domain generalization tasks (*best replicated value).
Method New Loss Cifar:1shot Cifar:5shot Mini:1shot Mini:5shot Tier:1shot Tier:5shot
MAML [12] - 58.901.9 71.501.0 48.701.84 63.110.92 51.671.81 70.301.75
PN [46] - 55.500.7 72.000.6 49.420.78 68.200.66 53.310.89 72.690.74
RN [49] - 55.001.0 69.300.8 50.440.82 65.320.70 54.480.93 71.320.78
Shot-Free [41] - 69.20N/A 84.70N/A 59.04N/A 77.64N/A 63.52N/A 82.59N/A
MetaOptNet [26] - 72.600.7 84.300.5 62.640.61 78.630.46 65.990.72 81.560.53
RFS[51] - 71.450.8 85.950.5 62.020.60 79.640.44 69.740.72 84.410.55
RFS + OPL (Ours) 73.020.4 86.120.2 63.100.36 79.870.26 70.200.41 85.010.27
NAML [28] - - 65.420.25 75.480.34 - -
Neg-Cosine [31] - - 63.850.81 81.570.56 - -
SKD [40] 74.500.9 88.000.6 65.930.81 83.150.54 71.690.91 86.660.60
SKD + OPL (Ours) 74.940.4 88.060.3 66.900.37 83.230.25 72.100.41 86.700.27
Table 6: Few-Shot Learning Improvements: We obtain performance improvements using OPL over the RFS [51] baseline and SKD baseline [40] containing ResNet-12 backbones. Our loss is simply plugged in to their supervised feature learning phase. Results reported for our experiment are averaged over 3000 episodic runs. Note that [40, 28, 31] are recent loss functions specific to FSL.
Dataset
CNAPs [42]
SUR [11]
SUR + OPL
(Ours)
Imagenet 52.31.0 56.41.2 56.51.1
Omniglot 88.40.7 88.50.8 89.80.7
Aircraft 80.50.6 79.50.8 79.60.7
Birds 72.20.9 76.40.9 76.90.7
Textures 58.30.7 73.10.7 72.70.7
Quick Draw 72.50.8 75.70.7 75.70.7
Fungi 47.41.0 48.20.9 50.11.0
VGG Flower 86.00.5 90.60.5 90.90.5
MSCOCO 42.61.1 52.11.0 52.01.0
MNIST 92.70.4 93.20.4 94.30.4
CIFAR10 61.50.7 66.40.8 66.60.7
CIFAR100 50.11.0 57.11.0 57.61.0
Average 67.0 71.4 71.9
Table 7: Results on Meta-Dataset: OPL is integrated with the SUR-PNF method in [11] for the Meta-Dataset train on all setting. Traffic Signs dataset has been omitted in comparisons due to an error in Meta-Dataset possibly affecting prior work.
hyper-parameter =2 =1 =0.5
= 0.05 70.48 70.66 72.02
= 0.1 70.12 70.94 71.30
= 0.5 70.26 71.18 70.66
= 1 69.78 70.48 72.20
= 2 67.64 69.58 70.52
Table 8: Hyper-parameter search: We report the top-1 accuracy values on a held-out validation set on CIFAR-100 using ResNet-56 backbone after training using OPL with various pairs of and hyper-parameters.

4.5 Few Shot Learning (FSL)

In this section, we explore the transferability of features learned with our loss function in relation to FSL tasks. We evaluate OPL on three benchmark few-shot classification datasets: miniImageNet, tieredImageNet, and CIFAR-FS. We run additional experiments on Meta-Dataset [52] which is a large-scale benchmark for evaluating FSL methods in more diverse and challenging settings. Similar to [42], we expand Meta-Dataset by adding three additional datasets, MNIST, CIFAR10, and CIFAR100. In light of work that shows promise of learning strong features for FSL [51], we experiment with OPL using it as an auxiliary loss on the feature space during the supervised training. Quantitative results highlighting the performance improvements are presented in Table 6. Our results on Meta-Dataset are obtained using the train on all setting presented in [52]. We integrate OPL over the method presented in [11], train on the first 8 datasets of Meta-Dataset, and evaluate on the rest (including the three additional datasets from [42]). Results are presented in Table 7. See Appendix A.3 for robustness of OPL features against sample noise in FSL tasks.

(a) Sensitivity to
(b) Variation with Batch Size
Figure 6: Ablative Study: OPL achieves consistent performance improvements against a CE-only baseline when evaluated on CIFAR-100 dataset with a ResNet-56 backbone.

4.6 Ablative Study

OPL in its full form (Eq. 6) contains two hyper-parameters, and . We conduct a hyper-parameter search over a held-out validation set of CIFAR-100 (see Table 8). The optimum values selected from these experiments are kept consistent across all other tasks and used when reporting test performance. Furthermore, we evaluate the performance of OPL on the test split of CIFAR-100 for varying values keeping fixed, to illustrate the minimal sensitivity of our method to different values in Fig. 5(a).

Next, we consider how OPL operates on random mini-batches, and evaluate its performance against varying batch sizes (CIFAR-100 dataset). These results presented in Fig. 5(b) exhibit how OPL consistently

5 Conclusion

We present a simple yet effective loss function to enforce orthogonality on the output feature space and establish its performance improvements for a wide range of classification tasks. Our loss function operates in conjunction with the softmax CE loss, and can be easily integrated with any DNN. We also explore a variety of characteristics of the features learned with OPL illustrating its benefit for few-shot learning, domain-generalization and robustness against adversarial attacks and label noise. In future, we hope to explore other variants of OPL including its adaptation to unsupervised representation learning.

References

  • [1] Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) MetaReg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: Table 5.
  • [2] N. Bansal, X. Chen, and Z. Wang (2018) Can we gain more from orthogonality regularizations in training deep networks?. Advances in Neural Information Processing Systems 31, pp. 4261–4271. Cited by: §2.
  • [3] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019-06) Domain generalization by solving jigsaw puzzles. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: Table 5.
  • [4] Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell (2020) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390. Cited by: §2.
  • [5] F. Croce and M. Hein (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, Cited by: §4.3, Table 4.
  • [6] Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie (2019) Class-balanced loss based on effective number of samples. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9260–9269. Cited by: §4.1, Table 1.
  • [7] J. Deng, J. Guo, and S. Zafeiriou (2019) ArcFace: additive angular margin loss for deep face recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694. Cited by: §1, §1, §3.1, §3, §4.1, Table 1.
  • [8] G. Desjardins, K. Simonyan, R. Pascanu, and K. Kavukcuoglu (2015) Natural neural networks. arXiv preprint arXiv:1507.00210. Cited by: §2.
  • [9] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker (2019) Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §3.3.
  • [10] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker (2019) Domain generalization via model-agnostic learning of semantic features. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Table 5.
  • [11] N. Dvornik, C. Schmid, and J. Mairal (2020) Selecting relevant features from a universal representation for few-shot classification. arXiv preprint arXiv:2003.09338. Cited by: §4.5, Table 7.
  • [12] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    Proceedings of Machine Learning Research. Cited by: Table 6.
  • [13] A. Ghosh, H. Kumar, and P. S. Sastry (2017) Robust loss functions under label noise for deep neural networks. In AAAI, Cited by: §4.2.
  • [14] M. Goldblum, S. Reich, L. Fowl, R. Ni, V. Cherepanova, and T. Goldstein (2020-13–18 Jul) Unraveling meta-learning: understanding feature representations for few-shot tasks. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 3607–3616. External Links: Link Cited by: Figure 1, §2, §3.3.
  • [15] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  • [16] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §1.
  • [17] M. Hayat, S. Khan, S. W. Zamir, J. Shen, and L. Shao (2019) Gaussian affinity for max-margin class imbalanced learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6469–6479. Cited by: §1, §2, §3.1, §3.
  • [18] D. Hendrycks, K. Lee, and M. Mazeika (2019) Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning, pp. 2712–2721. Cited by: §4.3, Table 4.
  • [19] Z. Huang, H. Wang, E. P. Xing, and D. Huang (2020) Self-challenging improves cross-domain generalization. In ECCV, Cited by: §2, §4.4, Table 5, §4.
  • [20] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri (2017)

    Cost-sensitive learning of deep feature representations from imbalanced data

    .
    IEEE transactions on neural networks and learning systems 29 (8), pp. 3573–3587. Cited by: §2.
  • [21] S. Khan, M. Hayat, S. W. Zamir, J. Shen, and L. Shao (2019) Striking the right balance with uncertainty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 103–112. Cited by: §2.
  • [22] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §4.1.
  • [24] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [25] E. Lee and C. Lee (2020)

    NeuralScale: efficient scaling of neurons for resource-constrained deep neural networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1478–1487. Cited by: §Appendix A.2, Table 2, item –.
  • [26] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019-06) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6.
  • [27] J. Lezama, Q. Qiu, P. Musé, and G. Sapiro (2018) Ole: orthogonal low-rank embedding-a plug and play geometric loss for deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8109–8118. Cited by: 2nd item, §2, §4.1, Table 1.
  • [28] A. Li, W. Huang, X. Lan, J. Feng, Z. Li, and L. Wang (2020) Boosting few-shot learning with adaptive margin loss. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12573–12581. Cited by: §2, Table 6.
  • [29] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017-10) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4.4.
  • [30] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. Cited by: §2, §4.1, Table 1.
  • [31] B. Liu, Y. Cao, Y. Lin, Q. Li, Z. Zhang, M. Long, and H. Hu (2020) Negative margin matters: understanding margin in few-shot classification. In ECCV, Cited by: §2, Table 6.
  • [32] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6738–6746. Cited by: §1, §1, §1, §3.1, §3, §4.1, Table 1.
  • [33] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks. International Conference on Machine Learning. Cited by: §1, §1, §1, §3, §4.1.
  • [34] W. Liu, Y. Zhang, X. Li, Z. Yu, B. Dai, T. Zhao, and L. Song (2017) Deep hyperspherical learning. In Advances in Neural Information Processing Systems, pp. 3953–3963. Cited by: §2.
  • [35] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §4.3, Table 4.
  • [36] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers (1999) Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), pp. 41–48. External Links: Document Cited by: §3.3.
  • [37] D. Mishkin and J. Matas (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: §2.
  • [38] C. Qi and F. Su (2017) Contrastive-center loss for deep neural networks. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 2851–2855. External Links: Document Cited by: §1, §1, §1.
  • [39] S. Rahman, S. Khan, and N. Barnes (2018) Polarity loss for zero-shot object detection. arXiv preprint arXiv:1811.08982. Cited by: §2.
  • [40] J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah (2020) Self-supervised knowledge distillation for few-shot learning. https://arxiv.org/abs/2006.09785. Cited by: Table 6.
  • [41] A. Ravichandran, R. Bhotika, and S. Soatto (2019) Few-shot learning with embedded class models and shot-free meta training. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: Table 6.
  • [42] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and R. E. Turner (2019) Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 7957–7968. Cited by: §4.5, Table 7.
  • [43] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, and X. Roca (2016) Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967. Cited by: §2.
  • [44] S. Ryou, S. Jeong, and P. Perona (2019-10) Anchor loss: modulating loss scale based on prediction difficulty. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 1.
  • [45] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06) FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §1.
  • [46] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, Cited by: Table 6.
  • [47] G. Sun, S. Khan, W. Li, H. Cholakkal, F. Khan, and L. Van Gool (2020) Fixing localization errors to improve image classification. ECCV. Cited by: §1, §1, §4.1, Table 1.
  • [48] Y. Sun, L. Zheng, W. Deng, and S. Wang (2017) Svdnet for pedestrian retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3800–3808. Cited by: 2nd item, §2.
  • [49] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 6.
  • [50] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §4.3.
  • [51] Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020) Rethinking few-shot image classification: a good embedding is all you need?. arXiv preprint arXiv:2003.11539. Cited by: §Appendix A.3, Table 3, §Appendix B.1, §2, §4.5, Table 6.
  • [52] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, and H. Larochelle (2019) Meta-dataset: A dataset of datasets for learning to learn from few examples. http://arxiv.org/abs/1903.03096 abs/1903.03096. Cited by: §4.5.
  • [53] W. Wan, Y. Zhong, T. Li, and J. Chen (2018) Rethinking feature distribution for loss functions in image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.1, Table 1.
  • [54] F. Wang, X. Xiang, J. Cheng, and A. Yuille (2017) NormFace: l2 hypersphere embedding for face verification. Proceedings of the 25th ACM international conference on Multimedia. Cited by: §1.
  • [55] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu (2018) CosFace: large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §1, §1, §3.1, §4.1, Table 1.
  • [56] J. Wang, Y. Chen, R. Chakraborty, and S. X. Yu (2020) Orthogonal convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [57] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu (2020) Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, External Links: Link Cited by: §4.3, Table 4.
  • [58] Y. Wang, D. Gong, Z. Zhou, X. Ji, H. Wang, Z. Li, W. Liu, and T. Zhang (2018) Orthogonal deep features decomposition for age-invariant face recognition. In ECCV, Cited by: §2.
  • [59] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In ECCV, Cited by: §1, §1, §1, §3, §4.1, Table 1.
  • [60] D. Xie, J. Xiong, and S. Pu (2017) All you need is beyond a good init: exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6176–6185. Cited by: §2.
  • [61] H. Ye, H. Hu, D. Zhan, and F. Sha (2020) Few-shot learning via embedding adaptation with set-to-set functions. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8805–8814. Cited by: §2.
  • [62] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §2.
  • [63] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. External Links: Link Cited by: §4.2.
  • [64] X. Zhang, R. Zhao, Y. Qiao, and H. Li (2020)

    RBF-softmax: learning deep representative prototypes with radial basis function softmax

    .
    In ECCV, Cited by: §Appendix A.1, §2, §4.1, Table 1.
  • [65] Z. Zhang and M. R. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18. Cited by: §4.2, Table 3, §4.
  • [66] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba (2016) Learning Deep Features for Discriminative Localization.. CVPR. Cited by: §1.

Supplementary: Orthogonal Projection Loss

In this supplementary document we include:

  • Additional comparisons with the baseline on the MNIST dataset (Appendix A.1).

  • OPL performance with scalable neural architecture method [25] (Appendix A.2).

  • Robustness of OPL against noise in the input images (Appendix A.3).

  • Visualization of classification results (Appendix B).

Appendix A Experimentation

Here, we present results of experiments conducted using OPL on a set of additional settings.

Appendix A.1 Digit Classification (MNIST)

We conduct experiments on the MNIST dataset integrating OPL over a CE baseline. We use a 4-layer convolutional neural network with 32-dimensional feature embedding (after a global average pool operation) following the experimental setup in [64]. Our results are reported in Table 1. Additionally, we conduct experiments appending a fully-connected layer to reduce the feature dimensionality to 2 for generating better visualizations on behaviour of OPL in feature-space (presented in Fig. 1 in main article).

Method 1st 2nd 3rd Avg
CE (baseline) 99.28% 99.27% 99.25% 99.27%
CE+OPL (ours) 99.58% 99.56% 99.61% 99.58%
Table 1: Results on MNIST: OPL obtains improvements over the CE baseline on MNIST dataset. Each experiment is replicated thrice and the average across runs is also reported.

Appendix A.2 Scalable Architectures

We consider the recent Neural Architecture Scaling approach proposed in [25] and plug-in our OPL on top of it to study our scalability. Refer Table 2 for the results.

Method Backbone Baseline[25] [25] + OPL
NeuralScale [25] ResNet18 77.59% 77.81%
NeuralScale [25] VGG11 67.42% 67.69%
Table 2: Additional results on CIFAR-100: Performance improvements integrating OPL into small-scalable backbones for classification. Reported values are top-1 classification accuracies.

Appendix A.3 Robustness to Noise: FSL

We have already established through empirical evidence how OPL improves performance for few-shot learning tasks as well as robustness to adversarial examples present during evaluation. We now explore the more challenging task of exploring robustness to input sample noise in a FSL setting (similar to the one in Appendix B.1

). The base training is conducted with no noise present in training data. During evaluation, the support and query set images are corrupted with random Gaussian noise of varying standard deviation (referred to as

). This can be considered a domain shift on top of unseen novel classes during evaluation. The features learned with OPL during base training exhibit better robustness to such input corruptions in this FSL setting. We report these results in Table 3. The experiments conducted followed the method in [51] integrated with OPL.

Method Noise Cifar:1shot Cifar:5shot Mini:1shot Mini:5shot Tier:1shot Tier:5shot
RFS [51] () 63.300.39 80.360.28 55.980.37 74.460.27 66.540.43 82.920.29
RFS[51] + OPL () 65.420.40 81.410.30 56.210.36 73.200.29 66.600.41 83.210.29
RFS [51] () 68.320.38 84.340.27 60.220.36 77.450.27 68.650.41 83.120.27
RFS[51] + OPL () 71.050.41 84.460.28 61.700.37 77.590.27 69.600.40 84.500.29
Table 3: Additional FSL Experiments: We explore the robustness of models to noise (random Gaussian noise of varying standard deviation is added to input images) in FSL setting. Models trained with our proposed OPL loss are significantly more robust compared to the cross-entropy only baseline in [51].
Figure 1: LDA visualization for CE vs OPL in FSL setting: Training with OPL increases separation of features in both base and novel classes when applied in a few-shot learning setting. LDA has been used following the insights in [14].

Appendix B Visualization

Figure 2: Visualization of Images: we show images where OPL predicts the correct but CE fails.

In this section, we present additional visualizations exploring various aspects of OPL and its performance.

Appendix B.1 Class Embeddings

Consider a few-shot learning setting, where a model trained in a fully-supervised manner (referred to as base model / base training) on a set of selected classes which contains training labels (referred to as base classes) is later evaluated on a set of unseen classes (referred to as novel classes). The sets of base and novel classes are disjoint. The evaluation protocol would involve episodic iterations, where in each step a small set of labelled samples from the novel classes (referred to as support set) is available during inference, and an another set of those same novel classes (referred to as query set) is available for calculating the accuracy metrics.

Given how our proposed loss is already able to explicitly enforce constraints on the feature space during base training, we want to examine if the additional discriminative nature endowed on the features by OPL is aware of higher level semantics. To evaluate this, we explore the more challenging task of inter-class separation and intra-class clustering of novel classes which are unseen during the base training. We train a model following the approach in [51] integrating OPL, and visualize the separation of different class features for both base and novel classes in Fig. 1.

Appendix B.2 ImageNet Examples

We further explore the performance of our model (CE+OPL) trained on ImageNet (model used for experiments presented in Table 2 of main paper) by examining the failure cases of the baseline model that were improved upon when adding OPL. Visualizations for some randomly selected such cases are illustrated in Fig. 4 and Fig. 2.

Figure 3: Orthogonality Visualization: The diagram (enlarged and elaborated version of Fig 3(b) in main paper) visualizes the cosine similarity between each pair of per-class feature vectors extracted from an OPL trained ResNet-56 for the CIFAR-100 test-set. Each per-class feature vector is calculated averaging over the features of all samples belonging to that class within the test-set. We analyse the relationships for two randomly selected classes, dolphin and pear. Consider the similarity of the dolphin class column (label highlighted in blue). In general, it has low similarity with the other classes, except in 3 instances. Two of those, shark and otter (pink arrows) align with our heuristics on similarity of those categories. The similarity to oak tree category can be attributed to some correlation present within the test-set images of these two classes (, both contain large blue portions - ocean for dolphin and sky for oak tree). Now, consider pear (label highlighted in green), which has an average similarity to most other classes except two: tank and shark (labels highlighted in green / tank in CIFAR-100 is the military vehicle). These two classes have relatively lower similarity with the pear class as seen from the diagram (pink lines and pink arrow) which again aligns with our intuition about the relationships between these categories. Overall, we note that the orthogonality constraints we enforce on feature space through OPL allows room for learning hidden inter-class relationships which can be interpreted meaningfully, in comparison to the same relationships for the CE baseline.
Figure 4: Visualization of Classification Results: We show some examples of images where OPL predicts the correct class but CE fails.

Appendix B.3 Block Matrix

We defined the overall objective of OPL as a minimization of the expected inter-class orthogonality (refer Eq. 8 of main paper) and conducted empirical analysis using models training using our proposed loss function against a CE only baseline (illustrated in Fig. 4 of main paper). In this section, we conduct additional analysis on those block-matrices to further understand the outcomes of our orthogonality constraints on the learned feature space. It is interesting to note that while OPL enforces a higher degree of orthogonality between the average class vectors, it does not naively push everything to be orthogonal. We note that this allows any hidden knowledge learned during the training process (information not captured in the labels explicitly) to remain captured within the features. The results of the experiments conducted on this are illustrated in Fig. 3.