Max-margin Class Imbalanced Learning with Gaussian Affinity

01/23/2019 ∙ by Munawar Hayat, et al. ∙ 0

Real-world object classes appear in imbalanced ratios. This poses a significant challenge for classifiers which get biased towards frequent classes. We hypothesize that improving the generalization capability of a classifier should improve learning on imbalanced datasets. Here, we introduce the first hybrid loss function that jointly performs classification and clustering in a single formulation. Our approach is based on an `affinity measure' in Euclidean space that leads to the following benefits: (1) direct enforcement of maximum margin constraints on classification boundaries, (2) a tractable way to ensure uniformly spaced and equidistant cluster centers, (3) flexibility to learn multiple class prototypes to support diversity and discriminability in feature space. Our extensive experiments demonstrate the significant performance improvements on visual classification and verification tasks on multiple imbalanced datasets. The proposed loss can easily be plugged in any deep architecture as a differentiable block and demonstrates robustness against different levels of data imbalance and corrupted labels.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks are data hungry in nature and require large amounts of data for successful training. For imbalanced datasets, where several (potentially important) classes have a scarce representation, the learned models are biased towards highly abundant classes. This is because the scarce classes have less representations during training which results in a mismatch between the joint distribution model for training

and test sets . This leads to lower recall rates for rare classes, which are otherwise critically desirable in numerous scenarios. As an example, a malignant lesion is rare compared to benign ones, but should not be miss-classified.

Figure 1: Affinity Loss integrates classification and clustering in a single objective. It’s flexible formulation in Euclidean space allows enforcing margin between classes, control over learned clusters, number of class-prototypes and the distance between class-prototypes. Such max-margin learning greatly helps in overcoming class imbalance by learning balanced classification regions and generalizable class boundaries.

The soft-max loss is a popular choice for conventional recognition tasks. However, through extensive experiments, we show that it is less suitable to handle mismatch between train and test distributions. This is partly due to no direct enforcement of margins in the classification space and the lack of a principled approach to control intra-class variations and inter-class separation. Here, we propose that max-margin learning can improve generalization which can help mitigate classifier bias towards more frequent classes by learning balanced representations for all classes. Remarkably, some recent efforts focus on introducing max-margin constraints within the soft-max loss function [10, 33, 32]

. Since soft-max loss computes similarities in the angular domain (vector dot-product or cosine similarity), direct enforcement of angular margins is ill-posed and existing works either involve approximations or make restricting assumptions (e.g., points lying on a hypersphere).

In this paper, we propose a novel loss formulation that enhances generalization by jointly reducing intra-class variations and maximizing inter-class distances. A notable difference from the previous works is the automatic learning of class representative prototypes in the Euclidean space with inherent flexibility to enforce certain geometric constraints on the learned prototypes. This is in contrast to soft-max loss where more abundant classes tend to occupy additional space in the projected feature space and rare classes get a skewed representation. The proposed objective is named the ‘Affinity loss function’ as it is based on a Gaussian similarity metric defined in terms of Bergman divergence. The proposed loss formulation learns to map input images to a highly discriminative Euclidean space where the distance with class representative prototypes provides a direct similarity measure for each class. The class prototypes are key points in the embedding space around which feature points are clustered


The affinity loss function promotes the classifier to have a simpler, balanced and more generalizable inductive bias during training. The proposed loss function thus provides the following advantages:

  • An inherent mechanism to jointly cluster and classify feature vectors in the Euclidean space.

  • A tractable way to ensure uniformly spaced and equidistant class prototypes (when embedding dimension and prototype number are related as: ).

  • Along-with uniformly spaced prototypes, our formulation ensures that the clusters formed around the prototypes are uniformly shaped (in terms of second order moments).

  • The resulting classifier shows robustness against different levels of label noises and imbalances amongst classes.

The proposed loss function is a differentiable module which is applicable to different network architectures, and complements the commonly deployed regularization techniques including dropout, weight decay and momentum. Through extensive evaluations on a number of datasets, we demonstrate that it achieves a highly balanced and generalizable classifier, leading to significant improvements over previous techniques.

2 Related Work

Class-imbalanced Learning: Imbalanced datasets exhibit complex characteristics and learning from such data requires designing new techniques and paradigms. The existing class imbalance approaches can be divided into two main categories, 1) data-level, and 2) algorithm-level approaches. The data-level schemes modify the distribution of data e.g., by oversampling the minority classes [41, 7, 14, 15, 21] or undersampling the majority classes [25, 3]. Such approaches are usually susceptible to redundancy and over-fitting (for over-sampling) and critical information loss (for under-sampling). In comparison, the algorithm level approaches improve the classifier itself e.g., through cost-sensitive learning. Such methods incorporate prior knowledge about classes based upon their significance or representation in the training data [26, 38, 23]. These methods have been applied to different classifiers including SVMs [48]

, decision trees

[61] and boosting [49]. Some works further explore ensemble of cost-sensitive classifiers to tackle imbalance [19, 24]. A major challenge associated with these cost-sensitive methods is that the class-specific costs are only defined at the beginning, and they lack mechanisms to dynamically update the costs during the course of training.

Deep Imbalanced Learning: Some recent attempts have been made to learn deep models from imbalanced data [20, 23, 5, 52, 36]. For example, the method in [20] first learns to under sample the training data using a neural network, followed by Synthetic Minority Oversampling TEchnique (SMOTE) based technique to re-balance the data. Deep models are trained to directly optimize the imbalanced classification accuracy in [52, 36]. Wang et. al. [53] propose a meta learning approach to progressively transfer the model parameters from majority towards less-frequent classes. Some works [23, 5] train cost sensitive deep networks, which alternatively optimize class costs and network weights. Continually determining class costs while training a deep model is still an open and challenging research problem, and makes optimization intractable in learning from large scale datasets [18].

Joint Loss Formulation: Popular loss functions used for classification in deep networks include hinge loss, soft-max loss, Euclidean loss and contrastive loss [22]. A triplet loss could simultaneously perform recognition and clustering, however its training is prohibitive due to huge number of triplet combinations on large-scale datasets [40]. Since these loss functions are limited in their capability to achieve discriminability in feature space, recent literature explores the combination of multiple loss functions. To this end, [44] showed that the combination of soft-max and contrastive losses concurrently enforce intra-class compactness and inter-class separability. On a similar line, [54] proposed ‘center loss’ that uses separate objectives for classification and clustering.

Max-margin Learning:

Margin-maximizing learning objectives have been traditionally used in machine learning. Hinge loss in Support vector machines is one of the pioneering max-margin learning framework

[16]. Some recent works aim to integrate max-margin learning with cross-entropy loss function. Among these, Large-margin soft-max [33] enforces inter-class separability directly on the dot-product similarity while SphereFace [32] and ArcFace [10]

enforce multiplicative and additive angular margins on the hypersphere manifold, respectively. The hypersphere assumption for feature space makes the resulting loss less generalizable to applications other than face recognition. Furthermore, enforcing margin based separation in angular domain is an ill-posed problem and either requires approximations or assumptions (e.g., unit sphere)

[12]. THis paper proposes a new flexible loss function which simultaneously performs clustering and classification, and enables direct enforcement of the max-margin constraints. We describe the proposed loss formulation next.

3 Max-margin Framework

We propose a hybrid multi-task formulation to perform learning on imbalanced datasets. The proposed formulation combines classification and clustering in a single objective that minimizes intra-class variations while simultaneously achieving maximal inter-class separation. We first explain why traditional Soft-max Loss (SL) is unsuitable for large-margin learning and then introduce our novel objective function.

3.1 Soft-max Loss

Given an input-output pair , a deep neural network transforms input to a feature space representation using a function parameterized by i.e., . The soft-max loss can then compute the discrepancy between prediction and ground-truth in the label space as follows:


where , and are number of training examples and classes respectively. It is worth noting that we have included the last fully connected layer in the definition of soft-max loss which will be useful in further analysis. Also, for the sake of brevity, we do not mention unit biases in Eq. 1.

Although soft-max loss is one of the most popular choices for multi-class classification, in the following discussion, we argue that it is not suitable for class imbalanced learning due to several limitations.

Limitations of SL:

The loss function in Eq. 1 computes inner vector product which measures the projection of feature representation on to each of the class vectors . The goal is to perfectly align with the correct class vector such that the data likelihood is maximized. Due to the reliance of the oft-max loss on vector dot-product, it has the following limitations:

  • No inherent mechanism to ensure max margin constraints. Computation of inter-class margin for soft-max loss is intractable [12]. Large margin constraints promote better generalization in imbalanced distributions and robustness against input perturbations [9].

  • The learned projection vectors are not necessarily equi-spaced in the classification space. That is, ideally the angle between closest projection vectors should be equal (e.g., in 2D where is the number of classes). However, in practice the projection vectors for majority classes occupy more angular space compared with minority classes. This has been visualized in Fig. 2 on imbalanced MNIST dataset, and leads to poor generalization to test samples.

  • The length of the learnt projection vectors for different classes is not necessarily the same. It has been shown in the literature that the minority class projection vectors are weaker (i.e., with less magnitude) compared with the majority classes [33]. Cost-sensitive learning which artificially augments the magnitude of the minority class projection vectors has been shown to be effective for imbalance learning [23].

Unsuitability of SL for Imbalanced Learning:

We attribute the above limitations to not directly enforcing the max-margin constraints on the classification boundaries. Consider the definition of soft-max loss (Eq. 1) in terms of dot-products , we can simplify the expression as follows:


The decision boundary for a class pair is given by the case where , i.e., the class boundaries are shared between the pair of classes. Further, minimization of requires for correct class assignment to . This is a ‘relative constraint’ and therefore the soft-max loss does not necessarily: (a) reduces intra-class variations, (b) enforces a margin between each class pair. To address these issues, we propose our new loss function next.

Figure 2: 2D feature space projections in terms of penultimate layer activations. The model is trained on imbalanced MNIST data (by retaining only 10% of the samples for digits 0-4) using different losses: (a) soft-max loss learns floral petals in angular space, note that the minority class feature vectors are weaker (shorter in length) and occupy less angular space. (b) center loss reduces intra-class variations by performing clustering. However, the minority class vectors tend to be congested near the center and are confused amongst each other (c) the proposed affinity loss learns equi-spaced clusters of uniform shapes for both majority and minority classes.

3.2 Max-margin Learning with Hybrid Objective

Euclidean space similarity measure: Instead of computing similarities with class prototypes using vector dot-product, we propose to measure class similarities for an input feature in the Euclidean space using a Gaussian similarity measure in terms of Bergman divergence (squared distance):


where, denotes a weighting parameter. This provides us: (a) the flexibility to directly enforce margin maximizing constraints, (b) have equi-spaced classification boundaries for multiple classes, (c)

control the variance of learned clusters and therefore enhancing intra-class compactness,

(d) the freedom to use standard distance measures in Euclidean domain to measure similarity and most importantly (e) simultaneous classification and clustering in a single objective function.

Proposition 1.

The similarity function is a valid similarity metric for any real-valued inputs.


The real-valued similarity function will define a valid similarity metric if it satisfies the following conditions [30]:

  • Non-negativity:

  • Symmetry:

  • Equivalence: iff

  • Self-similarity:

  • Triangular similarity:

Since, all above conditions are true for , therefore, it is a valid similarity metric. ∎

Relation between Dot-product and Gaussian Similarity: The proposed Gaussian similarity measure is related to the dot-product as follows:


Intuitively, the above relation implies the dependence of soft-max loss on the scale/magnitude of feature vectors and class prototypes. It leads to two conclusios: (1) It can be seen that is bounded between since , while can have large magnitudes. (2) The Gaussian measure can be considered as an inverse chord distance when magnitudes of vectors are normalized to be equal. The dot product in that case is directly proportional to the Gaussian similarity and both similarity measures will behave similarly if no additional constraints are included in our proposed similarity measure. However, the main flexiblity with our formulation is the explicit introduction of margin constraints, which we introduce next.

Enforcing margin between classes: Note that some variants of soft-max loss introduce angle based margin constraints [32, 10], however, the margins in angular domain are computationally expensive and implemented only as approximations due to intractability. Our formulation allows a more direct margin penalty in the loss function. The proposed max margin loss function based on Eq. 3 is given by,


where is the similarity of the sample with its true class, is its similarity with other classes, and is the enforced margin.

Uniform classification regions: The soft-max loss does not ensure uniform classification regions for all classes. As a result, undersampled minority classes get a shrinked representation in the feature space compared to more frequent classes. To ensure equi-distant weight vectors, we propose to apply a regularization on the learned class weights. This regularizer is termed as a ‘diversity regularizer’ as it enforces all class centers () to be uniformly spread out in the feature space. The diversity regularizer is formally defined as follows:


where is the mean distance between all class prototypes.

Multi-centered learning: For challenging classification problems, the feature space may be partitioned such that all samples belonging to the same class are not co-located in a single region. Therefore, clustering all same class samples with a single prototype (class center) will not be optimal in such cases. To resolve this limitation, we introduce a novel multi-centered learning paradigm based on our max-margin framework. Instead of learning a single projection vector for each class, the proposed framework enables learning multiple projection vectors per-class. Specifically, we can learn projection vectors per class, where similarity of a feature vector with a class is given by:


Max-margin loss is then defined similar to Eq. 6 above. The overall loss function therefore becomes:


The diversity regularizer for the multi-center case is enforced on the similarity between all prototypes.

4 Experiments

Figure 3: Data Imbalance due to long-tail distribution.

To demonstrate the effectiveness of the proposed affinity loss, we perform experiments on datasets which exhibit natural imbalance. These include Dermofit Image Library (DIL) for skin lesion classification and large scale image datasets for facial verification. We further extensively evaluate various components of the proposed approach by systematically generating imbalance and introducing different levels of label noise. Through these empirical evaluations, we provide an evidence of the robustness of the proposed method against different data imbalance levels and noisy training labels. A brief description about the evaluated datasets is presented next.

4.1 Datasets

Skin Melanoma Dataset (DIL): Edinburgh Dermofit Image Library (DIL) contains images belonging to skin lesion categories including melanomas, seborrhoeic keratosis and basal cell carcinomas. The images are based upon diagnosis from dermatologists and dermatopathologists. The number of images vary amongst categories (between and , mean , median ), and show significant imbalance, with of all images belonging to only top two classes (Fig. 3). Similar to [2], we perform two experiments, considering five and ten class splits respectively, and report results for 3-fold cross validation.

Face Recognition: Datasets used to train large scale face recognition models have natural imbalance. This is because the data is web-crawled, and images for some identities are easily available in abundance compared with others. For unconstrained face recognition, we train our model on VGG2 [4], which is a large scale dataset with inherent class imbalance. We evaluate the trained network on four different datasets. These include two popular widely used benchmarks i.e., Labelled Faces in the Wild (LFW) [27] and YouTube Faces (YTF) [55]. We further evaluate on Celebrities in Frontal Profile (CFP) [39] and Age Database (AgeDB) [35].

VGG2: facial image dataset [4] contains million images belonging to identities. The number of samples for each subject exhibit imbalance and vary from to with a mean of . The data is collected from the Internet and has real-life variations in the form of ethnicites, head poses, illumination changes and age groups.

LFW: Labelled Faces in the Wild (LFW) [27] contains static images of individuals collected over the Internet in real-life situations. We follow the standard evaluation protocol ‘unrestricted with labeled outside data’ [27] and test on pairs for face verification.

YTF: YouTube Faces (YTF) [55] has videos belonging to different subjects. The length of video sequences varies between and frames, with an average of frames. We follow the standard evaluation protocol for face verification on video pairs.

CFP: contains frontal and profile view images for different identities [39]. Two evaluation protocols are used based upon the type of images in the gallery and probe: frontal-frontal (FF) and frontal-profile (FP). Each protocol has 10 runs, each with 700 face pairs (350 same and 350 different).

AgeDB: has images acquired in-the-wild for subjects [35]. Along with variations across expression deformations, head poses and illumination conditions, a distinct feature of this dataset is the diversity across ages of the subjects, which ranges between and years, with an average of years. Test evaluation protocol has four groups with different age gaps (5, 10, 20 and 30 years). Each group contains ten splits, each having 600 face image pairs (300 same, 300 different). We use the most challenging split with 30 years gap.

Imbalanced MNIST: Standard MNIST has handwritten images of digits (0-9),

of these images are used for training (∼600/class) and the remaining 10,000 for testing (∼100/class). For this paper, we perform experiments on the standard evaluation split, as well as by systematically creating imbalance in the training set. For this, we reduce the even and odd digit samples to 10% and 25%. We further perform ablative study (Sec. 

4.6) by gradually introducing different imbalance ratios amongst classes and noise levels in the training labels.

4.2 Experimental Settings

For experiments on DIL dataset, ResNet-18 backbone is used in combination with the proposed affinity loss. For training the model to learn features for face verification tasks, we deploy Squeeze and Excitation (SE) networks [17] with ResNet-50 backbone and affinity loss. The face images are cropped and re-sized to

using multi-task cascaded Convolution Neural Network (CNN)


. The model is trained using random horizontal flips as data augmentation. The features extracted after the global pooling layer are then used for face verification evaluations on different datasets. The experiments on MNIST are performed on a simple network with four hidden layers having three convolution layers (

, and filters of ), one fully connected layer (neurons), and an output layer. The model is trained with Stochastic Gradient Descent (SGD) optimizer with momentum and learning rate decay. For ablative study in Sec. 4.6, we only change the output soft-max layer with the proposed Affinity loss layer and keep rest of the architecture fixed.

4.3 Results and Analysis

Table. 2 present our experimental results on DIL dataset. In Exp#1, we report average performance for 3 fold cross validation on five classes (Actinic Keratosis, Basal Cell Carcinoma, Melanocytic Nevus, Squamous Cell Carcinoma and Seborrhoeic Keratosis). Compared with existing state of the art [23], we achieve an absolute gain of on Exp#1. For Exp#2 on DIL dataset, all classes are considered. Evaluations on 3 fold cross validation in Table 2 show a significant performance improvement of

for Exp#2. Confusion matrix analysis for class-wise accuracy comparison in Fig 

4 shows that the performance boost is more pronounced for minority classes with lower representations. We attribute this to the capability of the proposed method to simultaneously optimize within class compactness by performing feature space clustering, and enhance inter-class separability by enforcing max-margin constraints. Our method achieves competitive performance on LFW and YTF datasets in Table 3. The performances on LFW and YTF are already saturated with many recent methods surpassing human-level results. The top performing methods on these datasets have been trained on much larger models with significantly more data and model parameters. Further evaluations on other facial recognition benchmarks achieve verification accuracies of 95.9%, 99.5% and 96.0% on AgeDB30, CFP-FF and CFP-FP datasets respectively. These results prove the effectiveness of the proposed approach for large scale imbalanced learning. It is worth noting that our proposed Affinity loss does not require additional compute and memory and is easily scalable to larger datasets. This is in contrast to some of the existing loss formulations (such as triplet loss [40] and contrastive loss [13]) which do enhance feature space discriminability, but suffer scalability to large data due to substantial possible combinations of training pairs, triplets or quintuplets.

Methods (using stand. split) Performances
Deeply Supervised Nets [29] 99.6%
Generalized Pooling Func. [28] 99.7%
Maxout NIN [6] 99.8%
Imbalanced () CoSen CNN [23] Affinity Loss
Stand. split 99.3% 99.6%
10% of odd digits 98.6% 99.3%
10% of even digits 98.4% 99.3%
25% of odd digits 98.9% 99.4%
25% of even digits 98.5% 99.5%
Table 1:

Evaluations on Imbalanced MNIST Database.

Methods Performances
(using stand. split) Exp#1 (5-classes) Exp#2 (10-classes)


74.3 2.5% 68.8 2.0%
Hierarchical-Bayes [1] 69.6 0.4% 63.1 0.6%
Flat-KNN [2] 69.8 1.6% 64.0 1.3%
CoSen CNN [23] 80.2 2.5% 72.6 1.6%
Affinity Loss 91.1 1.7% 80.3 2.1%
Table 2: Evaluation on DIL Database.
(a) CosSen CNN [23] (b) Affinity Loss
Figure 4: Confusion matrices for Exp#1 on DIL dataset.
Methods #Models Train Data LFW YTF
DeepFace [46] 3 4M 97.35 91.4
FaceNet [40] 1 200M 99.63 95.4
Web-scale [47] 4 4.5M 98.37 -
VGG Face [37] 1 2.6M 98.95 97.3
DeepID2+ [45] 25 0.3M 99.47 93.2
Baidu [31] 1 1.3M 99.13 -
Center Face [54] 1 0.7M 99.28 94.9
Marginal Loss [11] 1 4M 99.48 95.98
Noisy Softmax [8] 1 Ext. WebFace 99.18 94.88
Range Loss [60] 1 1.5M 99.52 93.7
Augmentation [34] 1 WebFace 98.06 -
Center Invariant Loss [56] 1 WebFace 99.12 93.88
Feature transfer [58] 1 4.8M 99.37 -
Softmax+Contrastive [44] 1 WebFace 98.78 93.5
Triplet Loss [40] 1 WebFace 98.7 93.4
Large Margin Softmax [33] 1 WebFace 99.10 94.0
Center Loss [54] 1 WebFace 99.05 94.4
SphereFace [32] 1 WebFace 99.42 95.0
CosFace [51] 1 WebFace 99.33 96.1
LMLE [18] 1 WebFace 99.51 95.8
Affinity Loss 1 VGG2 99.65 97.3
Table 3: Face Verification Performance on LFW and YTF datasets.
Figure 5: The effect of label noise on the soft-max and affinity loss functions.

4.4 Generalization

To test the generalization of the proposed method for different imbalance levels, we gradually reduce the training set by changing the representation of the minority class samples on MNIST data. Specifically, we gradually alter the majority to minority class ratios (up-to ) by randomly dropping samples of the first five digits (). Under these settings, we therefore have significantly lower representation for half of the classes. The experimental results in terms of error rates against fraction of retained minority class samples are shown in Fig. 6. We also repeat these experiments for standard soft-max loss. The comparison in Fig. 6 demonstrates a consistently superior performance of the proposed loss function across all settings. The effect on achieved performance is more noticeable for larger imbalance levels between majority and minority classes. The proposed Affinity loss enhances inter-class separability irrespective of the class frequencies by enforcing margin maximization constraints. Soft-max loss does not have inherent margin learning characteristics. Further, compared with soft-max loss, where intra-class variations can vary across classes depending upon their representative samples, affinity loss learns uniformed sized clusters. As visualized in Fig. 2, feature space within class disparities are observed for soft-max loss with minority classes occupying compact regions compared with their majority counterparts. In comparison, our proposed loss formulation is flexible, and allows learnt class prototypes to be equi-spaced and form uniformly shaped clusters. This reduces bias towards the less frequent cases and enhances the overall generalization capabilities, thus yielding a more discriminatively learnt feature space and an improved performance.

Figure 6: Robustness analysis against different imbalance levels (fraction of retained minority class samples)

4.5 Robustness against Noisy Labels

For many real-world applications, the acquired data has noisy labels, and generalization of the learning methods against label noise is highly desirable [42, 50, 19]. To check the robustness of our proposed approach against noisy labels in the training data, we randomly flip the classes of MNIST training samples. The fraction of the miss-labelled samples is gradually increased from to with an increment of . In order to avoid over-fitting on the noisy data, we deploy early stopping [57], and finish training when the performance on a held-out cross validation set starts to degrade. For comparison, we repeat all experiments using standard soft-max loss. The experimental results in Fig. 5 show that that the proposed Affinity loss performs better across the entire range of different noise levels. Although, the performance for both soft-max and affinity losses degrades with increasing noise factors, the proposed affinity loss shows more robustness, specially for larger noise ratios, with comparatively less performance degradation. The multi-centered learning in our loss provides flexibility to the noisy samples to associate themselves with class prototypes which are different from the non-noisy and clean samples.

Figure 7: Effect of changing parameter that controls the spread of clusters. Results on imbalanced MNIST show that increasing the cluster variance above a certain point results in overlapped clusters and higher error rate.

4.6 Ablation

Number of Cluster Centers: A unique aspect of the proposed affinity loss is its multi-centered learning which provides us the flexibility to have multiple class prototypes for each class. Here, we perform experiments on the imbalanced MNIST dataset ( representation for first five digits), by gradually changing the number of representative prototypes per class from to . The experimental results in terms of error rates vs prototypes in Fig. 8 show that the best performance is achieved for . Fewer prototypes per class () yield relatively poor performance. The proposed method performs consistently when prototypes are increased beyond . Such multi-centered learning supports diversity in input samples. It is specifically helpful in scenarios with complex data distributions where large differences are observed amongst samples of the same class. Such diverse samples might not necessarily cluster around a single region, and could form multiple clusters by virtue of the proposed multi-centered learning mechanism. Furthermore, our experiments in Sec. 4.5 show that by providing flexible class prototypes, multi-centered learning proves an effective and robust scheme against noisy samples.

Figure 8: Performance for different number of clusters per class.

Cluster Spread : The parameter in Eq. 3 determines the cluster spread and helps achieve uniform intra-class variations. Our 2D visualization of the learnt features in Fig. 2 demonstrate that the clusters for each class are uniformly sized for both the majority and minority classes. This is in contrast to the traditional soft-max loss, where shrinked feature space regions are observed for the minority classes. For our proposed loss formulation, the size of the cluster is directly related with the value of parameter , with larger indicating larger variance for a cluster. We perform experiment on imbalanced MNIST dataset for different values of the the parameter . The results in Fig. 7 show that the optimal performance is achieved for values of between and . Very high values of results in larger cluster spreads causing overlaps and confusion amongst classes and lower classification performance.

Distance Similarity Performance
Table 4: Evaluation with different combinations of distance and similarity measures.

Distance and Similarity Metrics: Our original affinity loss formulation in Eq. 3 first computes the squared distance between the feature f and class prototype w, which is then converted to a similarity measure using the Gaussian metric. In this experiment, we evaluate different combinations of distance and similarity metrics. and metrics are used to compute distances, whereas Gaussian and inverse distance (defined by ) are the two similarity measures. We perform these experiments on imbalanced MNIST data (by retaining 10% samples for first five digits). Table. 4 shows our evaluation results. The proposed scheme works well with all combinations except for distance and Gaussian similarity, where it fails to converge. The best performance is achieved for Gaussian similarity in combination with squared distance.

5 Conclusion

Class imbalance is ubiquitous in natural data and learning from such data is an unresolved challenge. The paper proposed a flexible loss formulation, aimed at producing a generalizable large margin classifier, to tackle class imbalance learning using deep networks. Based upon Euclid space affinity defined using Gaussian similarity on Bregmen divergence, the proposed loss jointly performs feature space clustering and max-margin classification. It enables learning uniform sized equi-spaced clusters in the feature space, thus enhancing between class separability and reducing intra-class variations. The proposed scheme complements existing regularizer such as weight decays, and can be incorporated with different architectural backbones without incurring additional compute overhead. Experimental evaluations validate the effectiveness of the affinity loss for face verification and image classification benchmarks involoving imbalanced data.


  • [1] L. Ballerini, R. B. Fisher, B. Aldridge, and J. Rees. Non-melanoma skin lesion classification using colour image data in a hierarchical k-nn classifier. In Biomedical Imaging (ISBI), 2012 9th IEEE International Symposium on, pages 358–361. IEEE, 2012.
  • [2] L. Ballerini, R. B. Fisher, B. Aldridge, and J. Rees. A color and texture based hierarchical k-nn approach to the classification of non-melanoma skin lesions. In Color Medical Image Analysis, pages 63–86. Springer, 2013.
  • [3] R. Barandela, E. Rangel, J. S. Sánchez, and F. J. Ferri. Restricted decontamination for the imbalanced training sample problem. In

    Iberoamerican Congress on Pattern Recognition

    , pages 424–431. Springer, 2003.
  • [4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, 2018.
  • [5] C. L. Castro and A. P. Braga.

    Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data.

    IEEE transactions on neural networks and learning systems, 24(6):888–899, 2013.
  • [6] J.-R. Chang and Y.-S. Chen. Batch-normalized maxout network in network. arXiv preprint arXiv:1511.02583, 2015.
  • [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  • [8] B. Chen, W. Deng, and J. Du. Noisy softmax: Improving the generalization ability of dcnn via postponing the early softmax saturation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5372–5381, 2017.
  • [9] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [10] J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698, 2018.
  • [11] J. Deng, Y. Zhou, and S. Zafeiriou. Marginal loss for deep face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2006–2014. IEEE, 2017.
  • [12] G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio. Large margin deep networks for classification. arXiv preprint arXiv:1803.05598, 2018.
  • [13] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In null, pages 1735–1742. IEEE, 2006.
  • [14] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pages 878–887. Springer, 2005.
  • [15] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
  • [16] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
  • [17] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
  • [18] C. Huang, Y. Li, C. Change Loy, and X. Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5375–5384, 2016.
  • [19] C. Huang, C. C. Loy, and X. Tang. Discriminative sparse neighbor approximation for imbalanced learning. IEEE transactions on neural networks and learning systems, 29(5):1503–1513, 2018.
  • [20] P. Jeatrakul, K. W. Wong, and C. C. Fung. Classification of imbalanced data by combining the complementary neural network and smote algorithm. In International Conference on Neural Information Processing, pages 152–159. Springer, 2010.
  • [21] T. Jo and N. Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49, 2004.
  • [22] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun. A guide to convolutional neural networks for computer vision. Synthesis Lectures on Computer Vision, 8(1):1–207, 2018.
  • [23] S. H. Khan, M. Hayat, M. Bennamoun, F. Sohel, and R. Togneri.

    Cost sensitive learning of deep feature representations from imbalanced data.

    IEEE Transactions on Neural Networks and Learning Systems, 2017.
  • [24] B. Krawczyk, M. Woźniak, and G. Schaefer. Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing, 14:554 – 562, 2014.
  • [25] M. KUBAT. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the 14th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 1997.
  • [26] S. Lawrence, I. Burns, A. Back, A. C. Tsoi, and C. L. Giles.

    Neural network classification and prior class probabilities.

    In Neural networks: Tricks of the trade, pages 295–309. Springer, 2012.
  • [27] G. B. H. E. Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, University of Massachusetts, Amherst, May 2014.
  • [28] C.-Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 464–472, 2016.
  • [29] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. 2015.
  • [30] M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi. The similarity metric. IEEE transactions on Information Theory, 50(12):3250–3264, 2004.
  • [31] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv preprint arXiv:1506.07310, 2015.
  • [32] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6738–6746. IEEE, 2017.
  • [33] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In International Conference on Machine Learning, pages 507–516, 2016.
  • [34] I. Masi, A. T. Trần, T. Hassner, J. T. Leksut, and G. Medioni. Do we really need to collect millions of faces for effective face recognition? In European Conference on Computer Vision, pages 579–596. Springer, 2016.
  • [35] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: the first manually collected, in-the-wild age database. In Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition (CVPR-W 2017), Honolulu, Hawaii, June 2017.
  • [36] W. W. Ng, G. Zeng, J. Zhang, D. S. Yeung, and W. Pedrycz.

    Dual autoencoders features for imbalance classification problem.

    Pattern Recognition, 60:875–889, 2016.
  • [37] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In Proceedings of the British Machine Vision Conference, pages 6–14, 2015.
  • [38] M. D. Richard and R. P. Lippmann.

    Neural network classifiers estimate bayesian a posteriori probabilities.

    Neural computation, 3(4):461–483, 1991.
  • [39] C. C. V. P. R. C. D. J. S. Sengupta, J.C. Cheng. Frontal to profile face verification in the wild. In IEEE Conference on Applications of Computer Vision, February 2016.
  • [40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [41] L. Shen, Z. Lin, and Q. Huang.

    Relay backpropagation for effective learning of deep convolutional neural networks.

    In European conference on computer vision, pages 467–482. Springer, 2016.
  • [42] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614–622. ACM, 2008.
  • [43] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [44] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
  • [45] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2892–2900, 2015.
  • [46] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
  • [47] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2746–2754, 2015.
  • [48] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser. Svms modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):281–288, 2009.
  • [49] K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In In Proceedings of the 17th International Conference on Machine Learning. Citeseer, 2000.
  • [50] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • [51] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [52] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy. Training deep neural networks on imbalanced data sets. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 4368–4374. IEEE, 2016.
  • [53] Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In Advances in Neural Information Processing Systems, pages 7029–7039, 2017.
  • [54] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016.
  • [55] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 529–534. IEEE, 2011.
  • [56] Y. Wu, H. Liu, J. Li, and Y. Fu. Deep face recognition with center invariant loss. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 408–414. ACM, 2017.
  • [57] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
  • [58] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Feature transfer learning for deep face recognition with long-tail data. arXiv preprint arXiv:1803.09014, 2018.
  • [59] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, Oct 2016.
  • [60] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE International Conference on Computer Vision, pages 5409–5418, 2017.
  • [61] Z.-H. Zhou and X.-Y. Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232–257, 2010.