Dynamic Curriculum Learning for Imbalanced Data Classification

01/21/2019 ∙ by Yiru Wang, et al. ∙ SenseTime Corporation 0

Human attribute analysis is a challenging task in the field of computer vision, since the data is largely imbalance-distributed. Common techniques such as re-sampling and cost-sensitive learning require prior-knowledge to train the system. To address this problem, we propose a unified framework called Dynamic Curriculum Learning (DCL) to online adaptively adjust the sampling strategy and loss learning in single batch, which resulting in better generalization and discrimination. Inspired by the curriculum learning, DCL consists of two level curriculum schedulers: (1) sampling scheduler not only manages the data distribution from imbalanced to balanced but also from easy to hard; (2) loss scheduler controls the learning importance between classification and metric learning loss. Learning from these two schedulers, we demonstrate our DCL framework with the new state-of-the-art performance on the widely used face attribute dataset CelebA and pedestrian attribute dataset RAP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human attribute analysis, including facial characteristics and clothing categories, has facilitated society in so many levels, such as tracking and identification. However, different from the general image classification problem like ImageNet challenge

[23], human attribute analysis naturally involves largely imbalanced data distribution. For example, when collecting the face data of attribute ”Bald”, most of them will be labelled as ”No Bald” and its imbalanced ratio to the ”Bald” class is usually very large. Learning a classification with equal importance for samples in different classes results in a bias to the majority class of the data with poor accuracy for the minority class. Therefore, how to handle the imbalanced data learning problem in human attribute analysis is a very important topic in the field.

For general classification problem, class bias accuracy is defined as the number of correctly predicted data divided by the number of whole test data. While for imbalanced data classification, class balanced accuracy is defined as the average of accuracy in each class for evaluation.

Impressive results have been achieved for the general imbalanced data learning in the past years. One intuitive motivation is resampling [3, 8, 12, 14, 31], which either oversamples the minority class data or downsamples the majority class data, to balance the data distribution. However, oversampling can easily cause overfitting problem due to repeatedly visiting duplicated minority samples. While downsampling usually discards many useful information in majority samples. Another approach called cost-sensitive learning is also exploited to handle the imbalanced data learning problem, which aims to avoid above issues by directly imposing heavier cost on misclassifying the minority class [38, 39, 43, 44] . However, it is still difficult to determine the meaningful cost for different samples in different environments. Specifically in deep human attributes analysis, [13] proposed a batch-wise method that selects part of the majority samples and increases the weight of minority samples to match a pre-defined target distribution. Besides of the standard cross entropy classification loss, [6, 7] proposed to add another class rectification loss (CRL) regularising algorithm to avoid the dominant effect of majority classes.

Our proposed Dynamic Curriculum Learning (DCL) method is motivated by the following two considerations. (1) Sampling is a good strategy for the problem, but always targeting at a balanced distribution in the whole learning process hurts the generalization ability of the system, particularly for largely imbalanced task. For example, in the early period of learning with balanced target distribution, system discards many majority samples and emphasizes too much on minority samples. System tends to learn a good representation of the minority class but bad/unstable representation of the majority one. However, what we want is to teach the system first learn the general good representation for both of the classes on the target attribute and then classify the samples into correct labels. This results in a good balance between the class bias accuracy and class balanced accuracy. (2) The combination of cross entropy loss (CE) and metric learning loss (ML) is reasonable. However, we think those two components contribute to different emphasis. Treating them equally in training cannot fully utilize the discriminative power of deep CNN. Specifically, CE pays more attention on classification task by assigning specific labels, while ML focuses more on learning a soft feature embedding to separate different samples in feature space without assigning labels. Similarly to the previous point, we want the system to first learn the good feature representation and then classify the samples to the correct labels.

In the spirit of the curriculum learning [1], we propose Dynamic Curriculum Learning (DCL) framework for imbalanced data learning. Specifically, we design two level curriculum schedulers: (1) sampling scheduler: it aims to find the most meaningful samples in one batch to train the network dynamically that not only distributed from imbalanced to balanced but also from easy to hard; (2) loss scheduler: it controls the learning weights between classification loss and metric learning loss. These two components can be defined by the scheduler function, which reflects the network learning status. To summarize our contributions:

  • For the first time, we introduce the curriculum learning idea into the DCL framework for imbalanced data learning problem. Based on the designed scheduler function, two curriculum schedulers are proposed for dynamic sampling operation and loss backward propagation.

  • The proposed DCL framework is a unified representation, which can generalize to several existing the state-of-the-art methods with corresponding setups.

  • We achieve the new state-of-the-art performance on the commonly used face attribute dataset CelebA [30] and pedestrian attribute dataset RAP [26].

2 Related Work

Imbalanced data learning. There are several groups of methods trying to address the imbalanced learning problem in literature. (1) Data-level: considering the imbalanced distribution of the data, one intuitive way to do is resampling the data [3, 8, 12, 14, 31, 32] into a balanced distribution, which could oversample the minority class data and downsample the majority class data. One advanced sampling method called SMOTE [3]

augments artificial examples created by interpolating neighboring data points. Some extensions of this technique were proposed

[12, 31]. However, oversampling can easily cause overfitting problem due to repeatedly visiting duplicated minority samples. While downsampling usually discards many useful information in majority samples. (2) Algorithm-level: cost-sensitive learning aims to avoid above issues by directly imposing heavier cost on misclassifying the minority class [38, 39, 43, 44] . However, how to determine the cost representation in different problem settings or environments is still an open question. Besides of the cost-sensitive learning, another option is to change the decision threshold during testing, which is called threshold-adjustment technique [5, 42, 44]. (3) Hybrid: this is an approach that combines multiple techniques from one or both abovementioned categories. Widely used example is ensembling idea. EasyEnsemble and BalanceCascade are methods that train a committee of classifiers on undersampled subsets [29]. SMOTEBoost, on the other hand, is a combination of boosting and SMOTE oversampling [4].

Deep imbalanced learning. Recently, several deep methods have been proposed for imbalanced data learning [2, 6, 7, 9, 13, 16, 17, 19, 21, 22, 35, 41, 44]

. One major direction is to integrate the sampling idea and cost-learning into an efficient end-to-end deep learning framework.

[19]

treated the Complementary Neural Network as an under-sampling technique, and combined it with SMOTE-based over-sampling to rebalance the data.

[44] studied data resampling for training cost-sensitive neural networks. In [2, 21]

, the cost-sensitive deep features and the cost parameter are jointly optimized.

[32]

resampled the number of foreground and background image patches for learning a convolutional neural network (CNN) for object classification.

[13] proposed a selective learning(SL) method to manage the sample distribution in one batch to a target distribution and assign larger weight for minority classes for backward propagation. Another recent direction of the problem involves the metric learning into the system. [6, 7] proposed a class rectification loss (CRL) regularising algorithm to avoid the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes. More recently, LMLE/CLMLE [16, 17] are proposed to preserve the local class structures by enforcing large margins between intra-class and inter-class clusters.

Curriculum learning. The idea of curriculum learning was originally proposed in [1], it demonstrates that the strategy of learning from easy to hard significantly improves the generalization of the deep model. Up to now, works been done via curriculum learning mainly focus on visual category discovery [24], object tracking [37]

, semi-/weakly-supervised learning

[10, 11, 20, 33], etc. [33] proposed an approach that processes multiple tasks in a sequence with sharing between subsequent tasks instead of solving all tasks jointly by finding the best order of tasks to be learned. Very few works approach the imbalanced learning. [11] developed a principled learning strategy by leveraging curriculum learning in a weakly supervised framework, with the goal of effectively handling massive amount of noisy labels and data imbalance.

3 Method

We propose a Dynamic Curriculum Learning (DCL) framework for imbalanced data classification problem. DCL consists of two level curriculum schedulers. The first level is a sampling scheduler where the key idea of this design is to find the most meaningful samples in one batch to train the network dynamically that not only distributed from imbalanced to balanced but also from easy to hard. This scheduler determines the sampling strategy for the proposed Dynamic Selective Learning (DSL) component. The second level is the loss scheduler. It controls the learning importance between two losses: the DSL loss and another metric learning loss (triplet loss). In early stage of the training process system focuses more on the soft feature space embedding while later on it pays more attention on the task of classification.

3.1 Scheduler Function Design

Most of the traditional curriculum learning methods manually define different training strategies. While in our proposed DCL framework for imbalanced data learning, we formulate the key idea of curriculum scheduling with different groups of functions, as we called Scheduler Function. We also show the semantic interpretation for those functions here.

The scheduler function is a function decreasing from 1 to 0 with input variable

, which represents the current training epoch. It can reflect the network learning status and the slope of the function measures the curriculum learning speed. We explore several function classes as following (illustrated in Figure

1):

  • Convex function: indicating the learning speed from slow to fast. For example:

    (1)
  • Linear function: indicating the constant learning speed. For example:

    (2)
  • Concave function: indicating the learning speed from fast to slow. For example:

    (3)
  • Composite function: indicating the learning speed from slow to fast and then slow again. For example:

    (4)

where refers to expected total training epochs and is an independent hyper parameter that in range .

Figure 1: Four types of designed scheduler functions.

Different classes of represent different curriculum learning styles. Based on the above introduced scheduler functions, we design the proposed Dynamic Curriculum Learning framework for imbalanced data classification.

3.2 Sampling Scheduler

Sampling is one of the most commonly used techniques to deal with imbalanced data learning. In this section, we introduce the proposed Dynamic Selective Learning (DSL) component, which is based on our sampling scheduler. As mentioned in the introduction section, fixed target distribution for selective learning cannot handle class balanced accuracy and class biased accuracy at the same time. While in our proposed DSL component, we utilizes a dynamic curriculum scheduler to generate the target batch distribution from imbalanced to balanced during the training.

Specifically, we define element of the imbalanced data distribution as the number of class sample divided by the number of minority sample. Putting the minority class in the front then we have:

where is the number of class.

For sampling scheduler, the target distribution for one batch is set following training set distribution at the very beginning of the training, which is imbalanced distributed. During the training process, it gradually transfers to a balanced distribution. Formally, is a function as following:

(5)

where refers to current training epoch and is the sampling scheduler function, which can be any choice in Section 3.1. According to current distribution , the majority class samples are dynamically selected and the minority class samples are re-weighted in different epochs to confirm different target distributions in one batch. Therefore, the DSL loss is defined as:

(6)
(7)

where is batch size, is the number of samples of class in current batch, is number of class, is the ground truth label. is the cost weight for each class. is the class target distribution in current epoch . is the class distribution in current batch before sampling. If , we sample percentage of class data with assigned label 1 and the remainings with 0.

With different sampling scheduler functions (four types in previous section), the batch target distribution changes from training set biased distribution to balanced distribution. At the beginning epoch , , the target distribution equals to the train set distribution, in other words, the real-world distribution. At the final epoch, is close to zero, so all the element in target distribution is close to 1, in other words, it is a balanced distribution.

The learning rate is usually set conforming to a decay function. At early period of the training process, with a large learning rate and biased distribution, the curriculum scheduler manages the model to learn more on whole training data. Usually system learns lots of easy samples in this stage. Going further with the training process, the target distribution is gradually balanced. With the selected majority samples and re-weighted minority samples, system focuses more on harder situation.

3.3 Metric Learning with Easy Anchors

Besides of the above cost-sensitive cross entropy loss , we also involve a metric learning loss to learn a better feature embedding for imbalance data classification.

A typical selection of the metric learning loss is triplet loss, which is introduced in CRL[7]

with hard mining. The hard examples are defined as those with high prediction score on the wrong class. The loss function in CRL is defined as following:

(8)

where refers to margin of class in triplet loss and denotes the feature distance between two samples. In current batch, is all the samples in class j, and is positive samples and negative samples, respectively. T refers to the number of triplet pairs. In CRL[7], all the minority class samples are selected as anchors, while only hard samples are selected as positive and negative samples in this loss function.

Figure 2: This figure visualizes a case of Triplet Loss in CRL[7] that hard positive sample is chosen as anchor. Assuming minority class as the positive class, the triplet pair shown in figure is trying to pull both the positive sample and the negative sample across the border, which is actually making the positive sample closer to the negative side. It can cause the features of positive samples to be more chaotic.
Figure 3: This figure visualizes a case of our proposed Triplet Loss that only choose easy samples as anchor. Assuming minority class as the positive class, since easy positive samples’ features can be grouped easily, the hard positive sample can be pulled closer to all the easy positive samples. Our proposed method can avoid the situation in Figure 3 and get all the anchors to the well-classified side.
Figure 2: This figure visualizes a case of Triplet Loss in CRL[7] that hard positive sample is chosen as anchor. Assuming minority class as the positive class, the triplet pair shown in figure is trying to pull both the positive sample and the negative sample across the border, which is actually making the positive sample closer to the negative side. It can cause the features of positive samples to be more chaotic.

Since features of samples in minority class can be separated by majority samples’ features, choosing all the minority samples as anchors can cause problems such as pull easy positive sample to the negative side (Easy sample means correct predicted sample). Examples are illustrated in Figure 3.

We propose a method to improve sampling operation of the Triplet loss with Easy Anchors , defined as follow:

(9)

where refers to easy samples in class , others are similar to equation 8. Number of hard positives, hard negatives and easy anchors to be selected is determined by one hyper-parameter .

With loss, only easy samples in minority class are chosen as anchors, which pulls the hard positive samples closer and pushes hard negative samples away. As illustrated in Figure 3. Though it sounds helpful to make rectification on feature space for all minority samples in CRL, our propose easy anchor method is based on the result of classifier and pull all the samples to well-classified side. One thing to mention is that there is no conflict between hard mining and easy anchor since hard mining depends on anchor while anchor is determined by classification result.

3.4 Loss Scheduler

To better train the network, we analyze the different characteristics of the two proposed losses. Generally speaking, triplet loss targets at learning a soft feature embedding to separate different samples in feature space without assigning labels, while cross entropy loss aims to classify the samples by assigning specific labels.

Particularly for imbalanced data learning, what we want is that the system first learns a good feature representation and then benefits in correct classification. Therefore, in order to fully utilize these two properties, we design a loss curriculum scheduler to manage these two losses.

Even though we can choose any one of the schedule functions in Section 3.1, we use the composite function (Equation 4) as an example here. The network learns with the following scheduler ( illustrated in Figure 4):

(10)
(11)

where refers to current training epoch, refers to expected total training epochs. Small modifications including a hyper parameter ranging in , which is defined as advanced self-learning point. And is the self-learning ratio. The reason why we have a non-zero here is that even though in self-learning stage, the network still needs to maintain the feature structure learned from in previous stage.

In the early period of training, a large weight is initialized to the triplet loss for feature embedding soft learning and decreases through time in respect to the scheduler function. In the later period, the scheduler assigns a small impact on and system emphasizes more on the Dynamic Selective Loss to learn the classification. Finally when it reaches the self-learning point, no ”teacher” curriculum scheduler is needed. Network automatically finetunes the parameters until convergence.

Figure 4: Network Loss Scheduler

3.5 Generalization of DCL Framework

To handle the imbalanced data learning problem, we propose the Dynamic Curriculum Learning framework. Revisiting the overall system, DCL consists of two level curriculum schedulers. One is for sampling and another is for loss learning . We can find that several the state-of-the-art imbalanced learning methods can be generalized from the framework with different setups for the scheduler. The correspondings are listed in Table 1. Selective Learning [13] does not contain metric learning and only uses a fixed target distribution. CRL-I[6] does not contain a re-weight or re-sample operation and only uses a fixed weight for metric learning.

Method g(x) f(x)
Cross Entropy 1 0
Selective Learning[13] 0/1 0
CRL-I[6] 1
DCL(Ours) Sampling scheduler Loss scheduler
Table 1: Generalization of proposed Dynamic Curriculum Learning method to other non-clustering imbalanced learning methods with corresponding setups.

Attractive

Mouth Open

Smiling

Wear Lipstick

High Cheekbones

Male

Heavy Makeup

Wavy Hair

Oval Face

Pointy Nose

Arched Eyebrows

Black Hair

Big Lips

Big Nose

Young

Straight Hair

Brown Hair

Bags Under Eyes

Wear Earrings

No Beard

Bangs

Imbalance level 1 2 2 3 5 8 11 18 22 22 23 26 26 27 28 29 30 30 31 33 35
DeepID2(CE) [36] 78 89 89 92 84 94 88 73 63 66 77 83 62 73 76 65 79 74 75 88 91
Over-Sampling [8] 77 89 90 92 84 95 87 70 63 67 79 84 61 73 75 66 82 73 76 88 90
Down-Sampling [8] 78 87 90 91 80 90 89 70 58 63 70 80 61 76 80 61 76 71 70 88 88
Cost-Sensitive [14] 78 89 90 91 85 93 89 75 64 65 78 85 61 74 75 67 84 74 76 88 90
Selective-Learning [13] 81 91 92 93 86 97 90 78 66 70 79 87 66 77 83 72 84 79 80 93 94
CRL-I [6] 83 95 93 94 89 96 84 79 66 73 80 90 68 80 84 73 86 80 83 94 95
LMLE [16] 88 96 99 99 92 99 98 83 68 72 79 92 60 80 87 73 87 73 83 96 98
CLMLE [17] 90 97 99 98 94 99 98 87 72 78 86 95 66 85 90 80 89 82 86 98 99
DCL (ours) 83 93 93 95 88 98 92 81 70 73 82 89 69 80 86 76 86 82 85 95 96

Blond Hair

Bushy Eyebrows

Wear Necklace

Narrow Eyes

5 o’clock Shadow

Receding Hairline

Wear Necktie

Eyeglasses

Rosy Cheeks

Goatee

Chubby

Sideburns

Blurry

Wear Hat

Double Chin

Pale Skin

Gray Hair

Mustache

Bald

Average

Imbalance level 35 36 38 38 39 42 43 44 44 44 44 44 45 45 45 46 46 46 48
DeepID2(CE) [36] 90 78 70 64 85 81 83 92 86 90 81 89 74 90 83 81 90 88 93 81.17
Over-Sampling [8] 90 80 71 65 85 82 79 91 90 89 83 90 76 89 84 82 90 90 92 81.48
Down-Sampling [8] 85 75 66 61 82 79 80 85 82 85 78 80 68 90 80 78 88 60 79 77.45
Cost-Sensitive [14] 89 79 71 65 84 81 82 91 92 86 82 90 76 90 84 80 90 88 93 81.60
 Selective-Learning [13] 93 85 73 74 89 87 92 97 90 94 87 94 86 96 89 92 94 92 95 85.93
CRL-I [6] 95 84 74 72 90 87 88 96 88 96 87 92 85 98 89 92 95 94 97 86.60
LMLE [16] 99 82 59 59 82 76 90 98 78 95 79 88 59 99 74 80 91 73 90 83.83
CLMLE [17] 99 88 69 71 91 82 96 99 86 98 85 94 72 99 87 94 96 82 95 88.78
DCL (ours) 95 87 76 79 93 90 95 99 92 97 93 97 93 99 94 96 99 97 99 89.05
Table 2: Mean Accuracy (mA) for each class (%) and class imbalance level (majority class rate-50%) of each of the attributes on CelebA dataset. The 1st/2nd best results are highlighted in red/blue.

4 Experiments

4.1 Datasets

The proposed Dynamic Curriculum Learning (DCL) method is evaluated on two commonly used attribute datasets, comparing with the state-of-the-art methods.

CelebA[30] is a human facial attribute dataset with annotations of 40 binary classifications. CelebA is an imbalanced dataset, specifically on some attributes, where the sample imbalance level (majority class rate-50) could be up to 48, meaning an accuracy at 98 with a model simply predicting any input sample as the majority. The dataset contains 202,599 images from 10,177 different person. Following the instruction, this dataset is partitioned by 162,770 / 19,687 / 19,962 images for training / evaluation / testing.

RAP[26] is a richly annotated dataset for pedestrian attribute recognition in real surveillance scenario. It contains 41,585 images from 26 indoor cameras, with 72 different attributes. RAP is a highly imbalanced dataset with the imbalance ratio (minority samples to majority samples) up to 1:1800.

4.2 Evaluation Metric

Accuracy is not proper for evaluation when dataset imbalance is one of the main concerns, because omitting all minority classes and predicting every sample as one of the major class would be the way to acquire a fancy result on the metric, while the model as simple as this seem not useful. Following the standard profile, we apply the class balanced accuracy (binary classification) on every single task, and then compute the mean accuracy of all tasks as the overall metric. It can be formulated as following:

(12)
(13)

where indicates the balanced mean accuracy of the -th task, with and indicating the count of predicted true positive samples and positive samples in the ground truth for the -th task while and refers to the opposite. We also introduce as the number of tasks and the overall balanced mean accuracy is the average of all those accuracy value in the single task. Balanced mean accuracy (mA) is widely used for evaluating the performance of attribute prediction.

4.3 Experiments on CelebA Face Dataset

4.3.1 Implementation Details

Network Architecture We use DeepID2[36] as the backbone for experiments on CelebA for fair comparison. DeepID2[36] is a CNN of 4 convolution layers. All the experiments listed on table 2 set DeepID2[36] as backbone. The baseline is trained with a simple Cross-Entropy loss. Since CelebA is a multi-task dataset, we set an independent 64D feature layer and a final output layer for each task. Hyper-Parameter Settings We trained DCL at learning rate of 0.003, batch size at 512, training epoch at 300 and weight decay at 0.0005. Horizontal Flip is applied during training. Specifically, we set sampling scheduler to convex function in Equation 1, loss scheduler to composite function in Equation 11 with advanced self-learning point to 0.3, and in (Equation 9) to 25.

4.3.2 Overall Performance

We compared our proposed method DCL with DeepID2 [36], Over-Sampling and Down-Sampling in [8], Cost-Sensitive [14], Selective Learning (SL) [13], CRL[6], LMLE[16] and CLMLE[17].

Table 2 shows the overall results on CelebA that have been evaluated with balanced mean accuracy. Our proposed method DCL based on two dynamic curriculum schedulers outperforms all other competitors in balanced mean accuracy. The baseline of our evaluation is the general face classification framework DeepID2[36] with standard cross entropy loss, where we achieve around 8 performance improvement. Compared to the recent advanced method, our method outperforms 3.12 to Selective Learning[13], 2.45 to CRL-I[6], 5.22 to LMLE[16] and 0.27 to CLMLE[17], repectively. Our framework is a light-weight system with end-to-end training operating in one single batch. CLMLE[17] is more complex in training with multiple rounds of quintuplets construction and learning.

4.3.3 Effect of Data Imbalance Level

Figure 5: Comparison of performance gain to the DeepID2 for DCL, CRL and CLMLE with respect to the imbalance ratio.

In this part, we show the performance gain of each attribute respecting to the data imbalance level (majority class rate-50) compared with the baseline method DeepID2 in Figure 5. In the figure, red, blue, green curves indicate DCL, CRL, CLMLE respectively. The horizontal axis indicates the imbalance level and the vertical axis is the performance gain for each method compared to the baseline DeepID2. We can observe that our proposed DCL method outperforms the baseline on each individual attribute. Specifically, CRL is poor on attribute ”Heavy Makeup”(-4: level-11) and CLMLE is poor on attributes ”Wear Necklace”(-1: level-43)/”Blurry”(-2: level-45)/”Mustache”(-6: level-46). In addition, our method outperforms other two methods with a large margin when the data is largely imbalanced. For example, when the imbalance level is larger than 25 (refer to Table 2), our method achieves 2.8 and 2.1 performance gain in average compared to CRL and CLMLE, respectively. The most significantly improved attribute is ”Blurry”, with imbalance ratio 45 (8 performance gain to CRL, 21 to CLMLE). This result demonstrates the effectiveness of our proposed two curriculum scheduler components for learning a model with better generalization and discrimination.

4.3.4 Ablation Study

There are several important parts in the proposed DCL framework, including the sampling scheduler, design of the triplet loss with easy anchor and loss scheduler. We provide the ablation study in Table 3 to illustrate the advantages of each components. Sampling scheduler (C1) aims to dynamically manage the target data distribution from imbalance to balance (easy to hard) and weight of each sample in (Equation 6). Metric learning with easy anchors (C2) modifies the anchor selection of triplet loss for better learning (). Loss scheduler (C3) controls the learning importance between loss and loss. From the table, we can see that our two important curriculum schedulers contribute a lot with performance gain to the whole system.

Method C1 C2 C3 Performance
1: Baseline (DeepID2) 0 0 0 81.17
2: 1 + C1 1 0 0 86.58
3: 2 + C2 1 1 0 87.55
4: 3 + C3 1 1 1 89.05
Table 3: Ablation study of each component in the DCL method: C1-Sampling Scheduler, C2-Triplet Loss with Easy Anchor, C3-Loss Scheduler.
Method Performance
1: DeepID2 81.17
2: DeepID2 + Convex 86.58
3: DeepID2 + Linear 86.36
4: DeepID2 + Concave () 85.90
5: DeepID2 + Composite 86.07
Table 4: Performance comparison between different scheduler functions selection. We only include the selection variation for sampling scheduler, disable the metric learning with easy anchor and loss scheduler to avoid mutual effect. Method 2 in this table is corresponding to the method 2 in Table 3.
Method Deep-Mar[25] Inception-v2[18] HP-net[28] JRL[40] VeSPA[34] LG-Net [27] DCL(ours)
mA 73.8 75.4 76.1 77.8 77.7 78.7 83.7
Table 5: Comaprison with the state-of-the-art methods on RAP[26] dataset, evaluated by balanced mean accuracy (mA). The 1st/2nd best results are highlighted in red/blue.

4.3.5 Effect of Scheduler Function Selection

Since we design several scheduler functions with different properties, we also include the analysis of them. The experiment setup is that we only include the selection variation for sampling scheduler, disable the metric learning with easy anchor and loss scheduler to avoid mutual effect. From the Table 4, even though the performance difference is less than 1 between different functions since they all gradually decrease from 1 to 0, we can still observe that convex function is a better selection for sampling scheduler. According to the definition of scheduler function which indicates the learning speed, it interprets that it is better for the system to slowly learn the imbalanced data at the very beginning of training and then speed up for balanced data learning.

4.4 Experiments on RAP Pedestrian Dataset

4.4.1 Implementation Details

Network Architecture We use ResNet-50[15] as the backbone for our proposed method. For each attribute, we set an extra feature layer of 64-dimension and a final output layer. Our baseline in table 6 is a ResNet-50 model trained with Cross Entropy loss. Hyper-Parameter Settings We trained DCL with batch size 512, learning rate 0.003, decay at 0.0005 and the epoch at 300. Horizontal Flip is applied during training. Specifically, we set sampling scheduler to convex function in Equation 1, loss scheduler to composite function in Equation 11 with advanced self-learning point to 0.3, and in (Equation 9) to 25.

4.4.2 Overall Evaluation

For overall evaluation, we include several the state-of-the-art methods that been evaluated in this dataset, including Deep-Mar [25], Inception-v2 [18], HP-net [28], JRL [40], VeSPA [34] and LG-Net [27].

Table 5 indicates the balanced mean accuracy (mA) for each method in RAP dataset. The 1st/2nd best results are highlighted in red/blue, respectively. We can see that our proposed DCL method outperforms the previous best one (LG-Net) with a large performance gain (5). In term of computational complexity, methods like LG-Net and HP-net apply classwise attention to their model, so their methods take more resource in training and inference. Our proposed method is an end-to-end framework with limited extra cost.

4.4.3 Effect of Data Imbalance Ratio

We also analyze the performance of our DCL method with respect to the imbalance ratio in RAP dataset. Different from the definition of imbalance level (majority class rate-50), imbalance ratio (1:x) is the ratio of minority samples to majority samples. As we mentioned, the imbalance ratio in this dataset is up to 1:1800. Therefore, to show the advantage of our method for imbalanced data learning, we group the tasks into three categories with respect to imbalance ratio and compare them with the baseline method. The baseline is a ResNet-50 model trained with cross entropy loss. From Table 6, we can observe that for the group 1 with ratio 125, our method outperforms 3.8 to the baseline. When the data is more imbalance distributed in group 2 with ratio 2550 and group 3 with ratio >50, DCL achieves 15.0 and 17.5 performance gain, respectively. This demonstrates that our proposed DCL method indeed works effectively for extremely imbalanced data learning.

Imbalance Ratio (1:x) 125 2550 >50
Baseline 79.3 68.9 68.0
DCL 83.1 83.9 85.5
Table 6: Balanced mean accuracy (mA) at different imbalance ratio. Baseline is a ResNet-50 model trained with cross entropy loss. Imbalance Ratio is defined as the ratio of minority samples to majority samples.

5 Conclusion

In this work, a unified framework for imbalanced data learning, called Dynamic Curriculum Learning (DCL) is proposed. For the first time, we introduce the idea of curriculum learning into the system by designing two curriculum schedulers for sampling and loss backward propagation. Similar to teachers, these two schedulers dynamically manage the network to learn from imbalance to balance and easy to hard. Also, a metric learning triplet loss with easy anchor is designed for better feature embedding. We evaluate our method on two widely used attribute analysis datasets (CelebA and RAP) and achieve the new state-of-the-art performance, which demonstrates the generalization and discriminative power of our model. Particularly, DCL shows a strong ability for classification when data is largely imbalance-distributed.

References

  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    , pages 41–48. ACM, 2009.
  • [2] C. L. Castro and A. P. Braga.

    Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data.

    IEEE transactions on neural networks and learning systems, 24(6):888–899, 2013.
  • [3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    , 16:321–357, 2002.
  • [4] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery, pages 107–119. Springer, 2003.
  • [5] J. Chen, C.-A. Tsai, H. Moon, H. Ahn, J. Young, and C.-H. Chen. Decision threshold adjustment in class prediction. SAR and QSAR in Environmental Research, 17(3):337–352, 2006.
  • [6] Q. Dong, S. Gong, and X. Zhu. Class rectification hard mining for imbalanced deep learning. 2017.
  • [7] Q. Dong, S. Gong, and X. Zhu. Imbalanced deep learning by minority class incremental rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [8] C. Drummond, R. C. Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer, 2003.
  • [9] F. Fernández-Navarro, C. Hervás-Martínez, and P. A. Gutiérrez. A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44(8):1821–1833, 2011.
  • [10] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang. Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
  • [11] S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang. Curriculumnet: Weakly supervised learning from large-scale web images. arXiv preprint arXiv:1808.01097, 2018.
  • [12] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pages 878–887. Springer, 2005.
  • [13] E. M. Hand, C. D. Castillo, and R. Chellappa. Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction. In AAAI, 2018.
  • [14] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, (9):1263–1284, 2008.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] C. Huang, Y. Li, C. Change Loy, and X. Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5375–5384, 2016.
  • [17] C. Huang, Y. Li, C. C. Loy, and X. Tang. Deep imbalanced learning for face recognition and attribute prediction. arXiv preprint arXiv:1806.00194, 2018.
  • [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [19] P. Jeatrakul, K. W. Wong, and C. C. Fung. Classification of imbalanced data by combining the complementary neural network and smote algorithm. In International Conference on Neural Information Processing, pages 152–159. Springer, 2010.
  • [20] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann. Self-paced curriculum learning. In AAAI, volume 2, page 6, 2015.
  • [21] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems, 29(8):3573–3587, 2018.
  • [22] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors. IEEE Transactions on Neural Networks, 21(5):813–830, 2010.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [24] Y. J. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discovery. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1721–1728. IEEE, 2011.
  • [25] D. Li, X. Chen, and K. Huang. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, pages 111–115. IEEE, 2015.
  • [26] D. Li, Z. Zhang, X. Chen, H. Ling, and K. Huang. A richly annotated dataset for pedestrian attribute recognition. arXiv preprint arXiv:1603.07054, 2016.
  • [27] P. Liu, X. Liu, J. Yan, and J. Shao. Localization guided learning for pedestrian attribute recognition. arXiv preprint arXiv:1808.09102, 2018.
  • [28] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. arXiv preprint arXiv:1709.09930, 2017.
  • [29] X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
  • [30] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [31] T. Maciejewski and J. Stefanowski. Local neighbourhood extension of smote for mining imbalanced data. In Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on, pages 104–111. IEEE, 2011.
  • [32] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717–1724, 2014.
  • [33] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5492–5500, 2015.
  • [34] M. S. Sarfraz, A. Schumann, Y. Wang, and R. Stiefelhagen. Deep view-sensitive pedestrian attribute inference in an end-to-end model. arXiv preprint arXiv:1707.06089, 2017.
  • [35] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3982–3991, 2015.
  • [36] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
  • [37] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2379–2386, 2013.
  • [38] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser. Svms modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):281–288, 2009.
  • [39] K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In In Proceedings of the 17th International Conference on Machine Learning. Citeseer, 2000.
  • [40] J. Wang, X. Zhu, S. Gong, and W. Li. Attribute recognition by joint recurrent learning of context and correlation. 2017.
  • [41] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy. Training deep neural networks on imbalanced data sets. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 4368–4374. IEEE, 2016.
  • [42] H. Yu, C. Sun, X. Yang, W. Yang, J. Shen, and Y. Qi. Odoc-elm: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data. Knowledge-Based Systems, 92:55–70, 2016.
  • [43] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 435–442. IEEE, 2003.
  • [44] Z.-H. Zhou and X.-Y. Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, 2006.