Hard-Aware Fashion Attribute Classification

07/25/2019 ∙ by Yun Ye, et al. ∙ JD.com, Inc. Columbia University Peking University Intel 3

Fashion attribute classification is of great importance to many high-level tasks such as fashion item search, fashion trend analysis, fashion recommendation, etc. The task is challenging due to the extremely imbalanced data distribution, particularly the attributes with only a few positive samples. In this paper, we introduce a hard-aware pipeline to make full use of "hard" samples/attributes. We first propose Hard-Aware BackPropagation (HABP) to efficiently and adaptively focus on training "hard" data. Then for the identified hard labels, we propose to synthesize more complementary samples for training. To stabilize training, we extend semi-supervised GAN by directly deactivating outputs for synthetic complementary samples (Deact). In general, our method is more effective in addressing "hard" cases. HABP weights more on "hard" samples. For "hard" attributes with insufficient training data, Deact brings more stable synthetic samples for training and further improve the performance. Our method is verified on large scale fashion dataset, outperforming other state-of-the-art without any additional supervisions.



There are no comments yet.


page 2

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attributes, also known as mid-level semantic features [1, 2], is fundamental for describing fashion items. As an example, in Fig. 1

, the skirt shown in the upper plot can be described with “print" texture, “tribal" style and “a-line" shape. Attributes have been extensively used in many computer vision tasks, such as image retrieval 

[3, 4], person Re-ID [5], etc. Particularly in fashion domain, cloth attribute is of great importance to many other high-level tasks including fashion image classification [6, 7], fashion item search [8, 9, 6, 10, 11], fashion style understanding [12, 13, 14, 15], fashion recommendation [7, 16, 17], fashion outfit learning [16, 18, 19], and fashion trend analysis [20, 7, 21].

Figure 1: Sample images and statistics of DeepFashion-C. Over 1/5 attributes have fewer than 100 positive samples. 80% attributes has fewer than 1000 positive samples.
Figure 2: Model predicted probabilites for positive samples vs. numbers of positive samples on DeepFashion-C. CE: Cross Entropy loss. Left: Train set; Right: Test set. In general, our method handles better on attributes with only a few positive samples.

In this paper, we address one of the major problems in fashion attribute classification: imbalanced data distribution, specifically the samples or attributes with very few positive labels. Patterns in fashion images are highly diversified due to its non-rigid nature and abundant semantic behind. Combined with very rich attributes of fashion items, it brings the imbalance and sparsity of positive labels for some attributes or specific kind of samples. The upper plot in Fig. 1 demonstrates the positive attribute counts from DeepFashion: Category and Attribute Prediction Benchmark (DeepFashion-C) [6]. The dataset contains images and tags from shopping websites and search engine, that is representative in a real-world scenario. Among the 1000 annotated attributes, the most frequent label “print” has 37,367 occurrences, whereas the least label “topstitched" only shows up in 51 images. In addition to imbalance, fashion attributes are usually sparsely distributed as shown in Fig. 1, over 1/5 attributes have fewer than 100 positive labels and on average there are only 3.3 positive tags per image. Moreover, the diversity of fashion items makes the problem even worse. Take “party" as an example, countless diversified fashion images can be defined as “party" (Fig. 1), such that a specific minority “party" case may not be easy to learn from the 2,882 tagged samples. So the problem is at both attribute and sample level. A big difficulty in training with such kind of dataset is that majority data are generally well trained while minority data is either under-trained or is prone to over-fitting with too few samples.

Many efforts have been devoted to tackling this problem [22]. A common solution is re-sampling [23, 24, 25]. Though has been widely used, over-sampling has its limitations such as the tendency to over-fit, whereas under-sampling suffers from the risk of missing valuable information. Moreover, it is not trivial to extend re-sampling to multi-label datasets [26, 27, 28], and few of them focused on imbalanced multi-label computer vision problems [29]. Another popular family takes into account the misclassification errors, known as cost-sensitive learning [23, 30, 31, 32, 33]

. Broadly speaking, it covers a wide range of methods that use algorithms or strategies based on cost. Among the scope, hard-aware methods are being actively studied in recent years with deep neural networks, such as focal loss 

[34], hard example mining [35, 36, 29], etc.

In this work, we develop an approach leveraging both cost-sensitive and re-sampling strategies to make full use of “hard” data. The key idea is to focus on the minority data as much as possible, and don’t affect the majority since they usually are already well trained. Minority data are often strongly correlated with high classification error as suggested by [24, 34]

. We also verified this by comparing the average predicted probability for positive attributes vs. the number of positive labels in Fig.

2 (more details will be discussed in Section 4.1

). In the figure, the blue crosses are predicted probabilities for positive samples, from a well-trained model using cross entropy loss. Based on this, we use the error probability estimated by model 

[34] as a metric to identify “hard” data. To make the best of this key metric in training, two techniques are developed. We first present a solution from the view of cost-sensitive learning that to backpropagate losses on each sample and each attribute weighted by the estimated errors. We refer this method as Hard-Aware BackPropagation (HABP). From the perspective of re-sampling, we further suggest to sample synthetic complementary images, which are samples that around but not overlap with real samples in feature space, to train hard/minority attributes with generative adversarial networks [37] (GAN). The proposed method is similar to semi-supervised GAN [38] but is much easier to train and implement. A possible reason that GAN is not widely used in a practical problem is the trickiness to train with high-resolution such as . This was induced by problems including mode collapse [39] and gradient vanishing [40]. In order to generate diversified high-resolution complementary images, we introduce a decorrelation regularization loss to deal with mode collapse. It successfully relieves mode collapse in training a multi-resolution GAN (MR-GAN) architecture we used in this work.

Evaluations on DeepFashion-C demonstrates that our approach outperforms the state-of-the-art, without using additional supervisions. Our main contribution is proposing to take full advantage of “hard” samples with two techniques from the view of cost-sensitive learning and re-sampling respectively: 1) We propose Hard-Aware BackPropagation (HABP) that effectively reduce the impact of strong imbalance in multi-label image dataset. 2) Based on hard labels identified, we present a method to train model with synthetic complementary samples and a decorrelation loss for stably generating high-resolution synthetic samples.

2 Related Work

2.1 Fashion Attribute Classification

Fashion attribute classification has already become a prevalent topic in the research area [8]. However in the early stage, most published datasets are either small-scale or annotated with a few numbers of attributes [41, 42, 10]. Based on DeepFashion-C, FashionNet [6] proposed to jointly learn cloth attributes and landmarks. Corbière  [43] collected noisy data from shopping website to perform weakly supervised image tagging. In the recent work [44], the authors grounded human knowledge to landmark detection. Then attribute classification was improved with landmark enhanced visual attention. Most existing works incorporated other supervision (such as landmarks, low-level features [41]) to improve attribute classification. A few of them [29] used attribute annotations only, but the method is not strongly tied to vision problems. In contrast, our method only uses attribute annotations and makes full application of training images in a semi-supervised manner.

2.2 Hard-Aware Learning

Hard example mining [45]

has been making successes with deep neural networks in areas including face recognition 

[46], object detection [35], person Re-ID [47], and metric learning [36]. Based on the same idea that hard samples are usually more informative, variants have been proposed. Among them, focal loss [34]

(FL) is closely related to our work by sharing the idea of modeling the estimated probability of classification error and take it as weights in loss function. Variants of FL has been applied to attribute classification 

[48]. A key difference between HABP and FL is that HABP introduces an output dependent normalization term for better stability and performance. OHEM [35] is also related to our method in the idea of sampling “hard” data. More details will be discussed in Section 3.

2.3 GAN & Semi-Supervised GAN

GAN [37, 39] has enjoyed a resurgence of interest in recent years for its ability to generate high fidelity images. A number of efforts have been made for synthesizing higher-resolution images. Denton  [49] employed a Laplacian pyramid with multiple discriminators to generate images at multiple resolutions. The idea of multi-resolution was further developed in [50] with the progressive growth of GAN. Based on the idea of weight sharing across multiple resolutions, Karnewar [51] published multi-scale gradients GAN (MSG-GAN) that train multi-resolution images simultaneously. To make use of GAN for discriminative tasks, semi-supervised GAN [38, 52, 53]

jointly to train a generator and a discriminator that classify true labels for real sample and an auxiliary label for fake samples simultaneously. The scheme is good at learning a better decision boundary with only a few samples. In this work, we introduce deactivation based training with synthetic complementary samples (Deact), which is similar to semi-supervised GAN but is easier to train and implement. To make the proposed method stable, we moreover proposed decorrelation regularization to alleviate mode collapse 

[39, 54] problem in GAN.

Figure 3: Overall pipeline of the proposed method. HABP is calculated as the error probabilities weighted mean of cross entropy losses for all nodes. Based on the approximated error probailities, hard labels are sampled to generate synthetic complementary samples to further improve the performance. The part in purple dashed box is the network for attribute classification.

3 Methodology

3.1 Habp

The key idea of HABP is to emulate sampling losses from the output nodes. Consider a batch with samples and attributes as illustrated in Fig. 3. After a forward pass there will be output nodes, each can be calculated with labels for cross entropy (CE) loss:


where is the model predicted probability of target label, for the attribute of the sample in the batch. As an example, in binary classification a commonly used formula is:


where , and

are sigmoid function, ground truth label, and model output respectively. CE assumes that individual samples and attributes are equally important. When we apply CE loss on training extremely imbalanced dataset, minority attributes are always much less trained than majority attributes, resulting in much higher prediction errors.

A natural idea is to only backpropagate losses on more informative nodes. For example, a solution is to simply sample “hard” nodes to backpropagate losses. We borrow the idea from FL, to model the sampling probability as the probability of wrong prediction:


where is a tuning parameter. We then use Eq. (3) to calculate a weighted average of losses (Eq. (1)) in a batch to emulate sampling nodes for backpropagation, which we call HABP:


Note that this is equivalent to sampling nodes with the error probabilities, while it is more efficient because directly sampling suffers from the risk of missing information in unsampled nodes. Compared to FL, HABP makes hard losses more prominent and stable. Because in multi-label training, losses on hard nodes may be averaged out by the big number of attributes, particularly in the late training stage. For example, in the beginning, most attributes and samples tend to be “hard”. As the training goes on, the ratio of “hard” samples will be fewer than in the beginning. If the number of attributes is large, the total losses by FL at different training stages will possibly be different in orders of magnitude, which may results in either unstable at the beginning stage or too slow learning at the late stage. More discussions with experiments will be presented in Section 4.2

3.2 Deactivation Training with Synthetic Complementary Samples

As a popular re-sampling technique, semi-supervised GAN has two drawbacks: 1) Training GAN is a tricky task. There are some differences between training a GAN and training a discriminative model. For example, GAN usually requires more iterations and larger batch size [50, 55] to achieve better image quality, which may not be optimal and necessary for training a classification model. 2) Dai stated and proved that a good semi-supervised GAN requires a “bad” generator. Ideally, the generator should synthesize samples around but not overlapped with real samples in feature space. This is again a tricky task.

For these reasons, we present an alternative scheme which is easier to implement and more stable in training. We first train a generator with MR-GAN with enough epochs to synthesize recognizable images. Then to make sure the generator is “bad” enough for semi-supervised training, we degrade the generator by adding an element-wise perturbation to the most semantic meaningful feature maps (Fig.

4), which are the feature maps directly projected from latent space. Empirically, the perturbation should be strong enough to synthesize images that visually different from real samples as in Fig. 4.

Figure 4: Demonstration of generating complementary samples from a well trained GAN. represents latent noise, and represents conditional inputs. The image above is generated without perturbation.

To make it easier for both implementation and extendability to binary attribute case, we propose an alternative to auxiliary classifier based semi-supervised GAN. Since activating the auxiliary output for fake samples is largely equivalent to deactivating outputs for real classes, we simply pose a deactivation loss to minimize activations of real classifier outputs when training with synthetic complementary images:


where C is the number of classes, and is a threshold of activation. We use for all the experiments that in our paper. For binary attribute classification, we want the outputs do not activate for both positive and negative, so the formula is simplified to:


3.3 Decorrelation Regularization for MR-GAN

Aiming to synthesize high-resolution images with GAN, we employ a conditional [56] multi-resolution architecture as illustrated in Fig. 5. Both generator and discriminator deal with images at different resolutions simultaneously. In Fig. 5 is the latent noise, and

is the conditional input vector of attribute/category annotations. Each dimension corresponds to an attribute. If a positive label is sampled, the value of the corresponding dimension is set to 1. In such a structure, the higher resolution images are the refined version of lower resolution images. Thus the training is much more stable than the single resolution scheme.

Figure 5: Architecture of MR-GAN. Decorrelation regularization is applied to the weights for projecting latent noise to corresponding feature maps.

As training to converge is not a problem anymore in such an architecture, we put our focus on mode collapse. Notice that the generated high-resolution images strongly depends on low-resolution images, so if we can have diversified low-resolution images, high-resolution images are not likely to fall into strong mode collapse. So we simply use a decorrelation (DC) regularization loss to decrease the correlation between latent dimensions (Fig. 5). For a transposed convolution projecting dim noise to feature maps, we denote as the filter weight of th dimension of the noise to the th channel of feature maps. Then we define decorrelation regularization loss as:




Note that

measures correlation as the square of cosine similarity, ranging from 0 to 1. Together with multi-resolution architecture, we call our method MR-GAN.

3.4 Overall Training Pipeline

With the key components above, we present our overall pipeline in Fig. 3. The underlying idea is that semi-supervised GAN usually does not help on data with sufficient labels. So we want to train with synthetic samples only with those minority or hard attribute labels, whilst not affecting the majority or easy attributes. We implement this idea by simply sampling synthetic samples from them.

As illustrated in Fig. 3, in each iteration we first train a batch of real samples (green dashed line box) with HABP and get the model estimated error probabilities for all labels. For each label, we update the error probability for the th attribute with an exponential moving average:


where is the label for the th attribute, is the being updated error of label , is error at last time showed up and is the average error probability of samples that with th attribute labeled as within a training batch. We normalize the recorded errors along each category/attribute by dividing the sum. Then they are used as the probability mass function of categorical distribution to sample hard labels. The sampled labels are used as inputs to generate the synthetic complementary samples (red dashed line box) MR-GAN. To make the deactivation based part more focused on hard labels, we again use a errors weighted average on deactivation losses for all nodes:


The overall objective to minimize is then as follows with a tunning parameter :


4 Experiments

We first evaluate the proposed method on DeepFashion-C. Then more experiments on each module are further explored to verify the efficacy of the proposed method.

4.1 Experiments on DeepFashion-C

Category Texture Fabric Shape Part Style All
Top-k Top-3 Top-5 Top-3 Top-5 Top-3 Top-5 Top-3 Top-5 Top-3 Top-5 Top-3 Top-5 Top-3 Top-5

FashionNet [6]
82.58 90.17 37.46 49.52 39.30 49.84 39.47 48.59 44.13 54.02 66.43 73.16 45.52 54.61
Corbiere  [43] 86.30 92.80 53.60 63.20 39.10 48.80 50.10 59.50 38.80 48.90 30.50 38.30 23.10 30.40
Wang  [44] 90.99 95.78 50.31 65.48 40.31 48.23 53.32 61.05 40.65 56.32 68.70 74.25 51.53 60.95
OHEM [35] 89.66 95.28 58.19 67.60 45.20 55.61 57.83 67.01 45.09 55.21 33.33 41.79 48.40 58.02
FL [34] 90.38 95.51 59.63 69.15 47.95 58.61 61.26 70.16 50.23 60.16 36.22 44.76 51.46 61.10
Weighted FL [48] 90.32 95.39 58.52 68.26 47.65 58.07 60.77 69.62 50.37 60.74 36.79 45.60 51.31 61.01

Weighted CE-A
90.20 95.25 58.74 68.96 47.31 57.65 60.35 69.54 48.59 59.37 37.20 45.68 50.79 60.68
Weighted CE-B 88.21 93.72 58.03 67.98 45.30 56.27 58.82 68.68 47.23 58.53 33.21 42.53 48.95 59.30
Baseline 89.93 95.20 57.08 66.72 43.96 54.19 56.79 65.98 44.36 54.09 33.10 41.40 47.46 57.00
Deact only 90.93 95.73 58.52 68.26 46.38 56.82 59.09 68.08 47.66 57.84 35.66 44.28 49.84 59.54
HABP only 89.96 94.89 60.34 70.06 48.73 59.65 61.44 70.70 50.73 61.09 37.69 46.23 52.17 62.07
FL+Deact 89.92 95.00 60.38 70.23 49.01 59.53 61.52 70.32 51.07 61.14 37.85 46.57 52.36 62.06
HABP+Deact 90.06 95.04 60.87 70.54 49.40 59.88 61.97 70.80 51.39 61.82 38.61 46.99 52.82 62.49
Table 1: Results of category and attribute classification (%)
Texture Fabric Shape Part Style All
CRL [29] 55.37 55.02 55.22 53.90 53.75 54.56
W-CE-B 76.88 77.10 81.86 76.81 72.23 76.73
HABP 77.10 77.58 82.26 77.31 73.26 77.29
Deact 78.59 78.31 83.83 78.57 74.20 78.45
Ours 78.69 78.82 83.80 79.23 74.26 78.74
Table 2: Class-balanced accuracy on DeepFashion-C (%)

Dataset. The 1000 attributes of DeepFashion-C [6] are divided into 5 groups by the authors, characterizing texture, fabric, shape, part, and style. We follow the official split by DeepFashion-C, more specifically, 209,222 training samples, 40,000 validation samples, and 40,000 test samples. The validation set is only used to make sure there was no overfitting.

Evaluation Metrics

. Two evaluation metrics and the corresponding settings are used: 1) top-k recall/accuracy. For binary attribute prediction, we calculate top-k recall following 

[57], which is obtained by ranking the classification scores and determine how many attributes have been matched in the top-k list for each group. For category classification, top-k classification accuracy is calculated; 2) To further prove the effectiveness and flexibility of our approach, we also conduct experiments evaluating the class-balanced accuracy for attributes, which is calculated by averaging accuracy for both positive and negative labels attribute-wise [29].

Comparsions. For attribute and category classification, we compared our method with recently published results [6, 43, 44], and our re-produced results of popular hard-aware methods including OHEM [35] focal loss (FL) [34], and weighted FL for multi-label dataset [48]. Weighted FL weight loss of each attribute with , where a is the prior attribute distribution. For OHEM we tried different ratio (0.5, 0.33, 0.17, 0.1) of hard nodes, and select the best result with the ratio of 0.17. For FL based methods, we use the same =1.2 as we used for HABP. As we discussed in Section 3, without an output dependent normalization term, FL may result in either unstable at the beginning stage or too slow learning in the late stage. To avoid a low performance of FL by either case and make the comparison more sensible, we tried different base learning rate for FL. In our experiments, we found that for top-k recall/accuracy gives the best result using FL. Similarly, we run experiments for weighted FL and report the best results with . We also tried a commonly used strategy to weight the positive/negative ratio for binary cross entropy loss:


where is a weight depends on positive/negative ratio of a given attribute. Denoting and as number of positive and negative labels for the attribute among attributes, we tried two ways for the weight111We didn’t use the negative to positive ratio as to balance the CE loss. Because we have tried different settings with this, and observed either numerical instability or very bad performance due to the very large variation of the ratio (6.77471.2). : A: weight each attribute adaptively. ; B: one weight for all attributes. .

Implementation Details

. 1) For top-k recall/accuracy, the base model we used is an imagenet pre-trained VGG-16 

[58]. We replace the fully-connected layers by two

convolutions without padding. The first convolution outputs 2048 channels, and the second outputs 4096 channels. Each convolution is followed by a ReLU activation. Then the 4096 channels are reduced to a vector by average pooling. A dropout with the probability of 0.5 is followed to avoid over-fitting. Final output for category and binary attributes are fully-connected layers with 50 and 1000 outputs respectively. We view attribute classification as 1000 binary classification tasks, and category classification as a multi-class task, such that the loss weight we used for category classification is the same as every single task in attribute classification. We train the network 15 epochs with mini-batch of 16 images in all experiments. Each image is cropped with the ground truth bounding box and resized to

. For the first 6 epochs, the learning rate is 0.01, then it is decreased by a factor of 10 every 3 epochs. for HABP is set to 1.2 for all experiments. Loss from synthetic complementary samples is added with a weight of 1e-4, and for semantic feature perturbation is set to 1.5. For training efficiency, deactivation loss is computed every 20 iterations. 2) For class-balanced accuracy, we follow  [29] by using ResNet50 [59] as the network. We found that the best result is achieved by using weighted CE-B described in the last paragraph as the base loss term in eq. 4. For experiments, the weight for deactivation loss is 0.001, and =0.1. Other settings remain the same as the experiments for top-k recall/accuracy.

Results. As mentioned before, error probability strongly depends on the number of positive labels. We first verified this and the effectiveness of our approach on reducing prediction errors of minority data. For the convenience of visualization, we compute the average of predicted positive probability for all positive labels instead of error probability. Comparisons between two well-trained models with CE loss and our method on both train and test set are illustrated in Fig. 2. From the figure, we can see that our method significantly reduced errors of positive labels, particularly for minority labels.

The evaluation results using top-k recall/accuracy on test set are summarized in Table 1. From the table, our overall performance on attribute classification out-performs all others including the current state-of-the-art [44]. The category classification result is also better than most of the others. We also observed that with only HABP, the result surpass FL, weighted FL, and OHEM, which are methods with similar spirits. This proved the better stability of HABP. To understand this, consider that if the number of hard nodes is only a few in a batch, both FL and OHEM will result in a small loss that does not contribute much to gradients, while HABP constantly backpropagates a stable total loss from hard nodes no matter how many hard nodes in a batch. Together with HABP, deactivation based training with synthetic complementary samples further improves the final result, as demonstrated in the lower part of Table 1. Note that our method only used attribute annotations, while in both [6] and [44] landmark annotations are used to enhance the attribute classification.

An ablation study is also presented in the lower part of Table 1. By independently activate HABP and synthetic complementary samples, we found the two techniques both improves over baseline. We also tried to replace HABP with FL in the pipeline. It achieves a better result than both baseline and deactivation based training with synthetic complementary samples. Yet it is still lower than our proposed pipeline, which further proves the advantage of HABP over FL.

The experimental results with class-balanced accuracy in Table 2 further shows the flexibility and superiority of the proposed method. We observed that both HABP and deactivation loss improve the performance by some margin. Unlike the baseline with CE loss, by using the settings of weighted CE-B, training with synthetic complementary samples contributes more than HABP. We think a future work worth to study is how to optimally combine HABP and Deact given a specific task.

4.2 HABP vs. FL

Figure 6: Top-3 recall of HABP & FL under different learning rate. HABP demonstrates better stability and performance than FL.
Figure 7: Training loss of attributes in the epoch. HABP

In Table 1 we already verified the better performance of our method over other popular choices. In this section, we focus on the comparison between HABP and FL by more experiments. As we already mentioned in Section 4.1, we tune the base learning rate for FL to avoid either too low learning or numerical instability. With the sample experimental settings for top-k recall/accuracy in Table. 1, we demonstrate the top-3 recall for attributes by both HABP and FL under different base learning rate in Fig. 6. Compared to FL, HABP demonstrates not only better performance (as shown in the blue dashed line), but also less sensitive to base learning rate.

We also plotted the training loss for attributes in the first epoch with the experiment setting for Table. 1. From Fig. 7 we can see that loss calculated by HABP keeps prominent as training goes, while the loss by FL at the beginning is almost two orders of magnitude larger than the loss at the end of the first epoch. This sensitive behavior of FL limited its performance because a too large learning rate may result in convergency issues, while a smaller learning rate may not be able to learn parameters in the late stage.

Figure 8: Correlations between weights w/ and w/o decorrelation regularization
Figure 9: Images generated using MR-GAN. (a) Upper: samples generated using DeepFashion-C. Lower: samples generated using CelebA-HQ. (b) Random samples w/ (right) and w/o (left) decorrelation regularization loss.

4.3 More Experiments of Deact

We further independently verified the validity of the proposed deactivation based training on MNIST classification in this section. The network we used is a LeNet-5 [60] with ReLU activations. We train a subset of randomly sampled training data with different sizes ranging from 251000. The total number of training samples is set to 500k, with a batch size of 64. SGD optimizer is used with a learning rate of 0.01 and momentum of 0.9. The weight of deactivation loss is =0.05 for all experiments. To generate complementary samples we use MR-GAN with the following sequential operations as the building block: (transposed) convolutionbatch normalizationleaky ReLU. The number of channels inversely depends on the size of feature maps, and for generating highest resolution images we set it to 64. MR-GAN is also trained with the same subset. In total 1M samples are trained with a batch size of 128. The of Gaussian noise used to perturb feature maps, and the improvements with our proposed method are summarized in Table 3.

# of samples LeNet Deactivation AC
25 44.12% 40.06%(=3.6) 43.71%
50 29.73% 26.22%(=2.2) 27.65%
100 14.53% 12.14%(=2.4) 14.33%
500 4.88% 4.58%(=2.4) 5.01%
1000 3.93% 3.72%(=2.2) 4.01%
Table 3: Classification errors on MNIST [60]
DeepFashion-C [6] CelebA-HQ [50]
w/o DC 28.51 26.93
w/ DC 27.28 22.16
Table 4: FID with and without decorrelation regularization on DeepFashion-C and CelebA-HQ. The lower the better.

4.4 Effectiveness of Decorrelation Regularization and MR-GAN

We validate the effectiveness of MR-GAN with DeepFashion-C and CelebA-HQ [50]. In training, images at different resolutions are generated by one forward pass, whilst multiple forwards with discriminator are needed for generating corresponding outputs. Due to the strong stability of MR-GAN, we simply use vanilla GAN loss [37] for training. The proposed decorrelation regularization is added to the loss of G with a weight of 2e-6 for all experiments. For discriminator, we calculate the mean over losses for multiple resolutions as the final loss. Implementation Details. We crop each image with ground truth bounding box provided by DeepFashion-C, and resize to . Adam [61] optimizer is used for both G and D with a learning rate at 1e-4. We use 32 as the number of channels for generating the highest resolution images, and the maximum number of channels is set to 512. The network is trained for 30 epochs with the batch size of 128, on train set only. To further validate the MR-GAN’s ability to synthesizes higher resolution images, we also experimented with CelebA-HQ dataset, which contains 30k face images at . We build the network for images from to . The number of channels for the largest image is set to 12, and the number of channels is limited up to 384. For CelebA-HQ we train without conditional inputs for 50 epochs, with mini-batch of 64 images.

We computed Fréchet Inception Distance [62] (FID) from 30k images on the last 10 epochs for both datasets, and pick the smallest FID. For DeepFashion-C, the labels are sampled with prior distribution from the train set. The results are summarized in table 4, showing that the image samples using both datasets with decorrelation regularizations are better than without it.

The cosine similarities between the transposed convolution kernels that projecting latent noises are calculated and demonstrated in Fig. 8. The correlations between weights are clearly reduced by decorrelation regularization loss as shown. Sample images generated with MR-GAN are illustrated in Fig. 9(a). In Fig. 9(b), left is random samples without decorrelation loss, we can see very similar faces labeled with red boxes, while this is not observed in the right image with decorrelation regularization loss.

5 Conclusion

We have proposed a pipeline to make use of “hard” data with two techniques from the view of cost-sensitive learning and the view of re-sampling respectively. It consists of HABP that effectively and adaptively learning with hard data, and deactivation based training with synthetic complementary samples that is more stable to train and easier to implement. HABP focus on positive minority data, whilst deactivation based training helps to learn a better decision boundary by deactivating complementary samples for minority data. The two components can either be combined or separately used depending on the specific metric. Along with the pipeline, we also presented a decorrelation regularization loss for training a multi-resolution GAN. Evaluations are performed on a large scale fashion dataset and related datasets. Overall our method achieves the state-of-the-art for attribute classification. At the same time, from the observations in experiments, we believe how to optimally combine the components we proposed will be a topic that worth future studies.


  • [1] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1778–1785, June 2009.
  • [2] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3474–3481, June 2012.
  • [3] B. Siddiquie, R. S. Feris, and L. S. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR 2011, pages 801–808, June 2011.
  • [4] B. Chen, Y. Chen, Y. Kuo, and W. H. Hsu. Scalable face image retrieval using attribute-enhanced sparse codewords. IEEE Transactions on Multimedia, 15(5):1163–1173, Aug 2013.
  • [5] Ryan Layne, Timothy M. Hospedales, and Shaogang Gong. Towards person identification and re-identification with attributes. In Andrea Fusiello, Vittorio Murino, and Rita Cucchiara, editors, Computer Vision – ECCV 2012. Workshops and Demonstrations, pages 402–412, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [6] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1096–1104, June 2016.
  • [7] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in fashion: Modeling the perception of fashionability. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 869–877, June 2015.
  • [8] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. Where to buy it: Matching street clothing photos in online shops. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3343–3351, Dec 2015.
  • [9] J. Huang, R. Feris, Q. Chen, and S. Yan. Cross-domain image retrieval with a dual attribute-aware ranking network. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1062–1070, Dec 2015.
  • [10] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan. Deep domain adaptation for describing people based on fine-grained clothing attributes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5315–5324, June 2015.
  • [11] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3330–3337, June 2012.
  • [12] Yihui Ma, Jia Jia, Suping Zhou, Jingtian Fu, Yejun Liu, and Zijian Tong.

    Towards better understanding the clothing fashion styles: A multimodal deep learning approach.


    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    , AAAI’17, pages 38–44. AAAI Press, 2017.
  • [13] Kevin Matzen, Kavita Bala, and Noah Snavely. Streetstyle: Exploring world-wide clothing styles from millions of photos. CoRR, abs/1706.01869, 2017.
  • [14] M. Takagi, E. Simo-Serra, S. Iizuka, and H. Ishikawa. What makes a style: Experimental analysis of fashion prediction. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 2247–2253, Oct 2017.
  • [15] W. Hsiao and K. Grauman. Learning the latent “look”: Unsupervised discovery of a style-coherent embedding from fashion images. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4213–4222, Oct 2017.
  • [16] Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S. Davis. Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, pages 1078–1086, New York, NY, USA, 2017. ACM.
  • [17] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and Shuicheng Yan. Hi, magic closet, tell me what to wear! In Proceedings of the 20th ACM International Conference on Multimedia, MM ’12, pages 619–628, New York, NY, USA, 2012. ACM.
  • [18] Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma. Neurostylist: Neural compatibility modeling for clothing matching. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, pages 753–761, New York, NY, USA, 2017. ACM.
  • [19] Wei-Lin Hsiao and Kristen Grauman. Creating capsule wardrobes from fashion images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [20] KuanTing Chen, Kezhen Chen, Peizhong Cong, Winston H. Hsu, and Jiebo Luo. Who are the devils wearing prada in new york city? In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 177–180, New York, NY, USA, 2015. ACM.
  • [21] Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. Fashion forward: Forecasting visual style in fashion. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [22] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4):221–232, Nov 2016.
  • [23] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on Learning from Imbalanced Datasets II, volume 11. Citeseer, 2003.
  • [24] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002.
  • [25] Kate McCarthy, Bibi Zabar, and Gary Weiss. Does cost-sensitive learning beat sampling for classifying rare classes? In Proceedings of the 1st international workshop on Utility-based data mining, UBDM ’05, pages 69–77, New York, NY, USA, 2005. ACM.
  • [26] Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. Mlsmote. Know.-Based Syst., 89(C):385–397, November 2015.
  • [27] Weizhong Lin and Dong Xu. Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types. Bioinformatics, 32(24):3745–3752, 08 2016.
  • [28] Emily M. Hand, Carlos D. Castillo, and Rama Chellappa. Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 6878–6885, 2018.
  • [29] Q. Dong, S. Gong, and X. Zhu. Imbalanced deep learning by minority class incremental rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(6):1367–1381, June 2019.
  • [30] Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, Jan 2006.
  • [31] S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, and R. Togneri.

    Cost-sensitive learning of deep feature representations from imbalanced data.

    IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573–3587, Aug 2018.
  • [32] Matjaz Kukar and Igor Kononenko. Cost-sensitive learning with neural networks. In Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), pages 445–449. John Wiley & Sons, 1998.
  • [33] Charles X. Ling and Victor S. Sheng. Cost-Sensitive Learning, pages 231–235. Springer US, Boston, MA, 2010.
  • [34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [35] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [36] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-aware deeply cascaded embedding. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [37] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [38] Augustus Odena. Semi-supervised learning with generative adversarial networks. CoRR, abs/1606.01583, 2016.
  • [39] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • [40] Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. CoRR, abs/1701.04862, 2017.
  • [41] Huizhong Chen, Andrew Gallagher, and Bernd Girod. Describing clothing by semantic attributes. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, Computer Vision – ECCV 2012, pages 609–623, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [42] Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, Till Quack, and Luc Van Gool. Apparel classification with style. In Kyoung Mu Lee, Yasuyuki Matsushita, James M. Rehg, and Zhanyi Hu, editors, Computer Vision – ACCV 2012, pages 321–335, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • [43] Charles Corbiere, Hedi Ben-Younes, Alexandre Rame, and Charles Ollion. Leveraging weakly annotated data for fashion image retrieval and label prediction. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
  • [44] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [45] J.A.K. Suykens and J. Vandewalle.

    Least squares support vector machine classifiers.

    Neural Processing Letters, 9(3):293–300, Jun 1999.
  • [46] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [47] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. CoRR, abs/1703.07737, 2017.
  • [48] Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. Deep imbalanced attribute classification using visual attention aggregation. In The European Conference on Computer Vision (ECCV), September 2018.
  • [49] Emily L Denton, Soumith Chintala, arthur szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1486–1494. Curran Associates, Inc., 2015.
  • [50] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • [51] Animesh Karnewar, Oliver Wang, and Raghu Sesha Iyengar. MSG-GAN: multi-scale gradient GAN for stable image synthesis. CoRR, abs/1903.06048, 2019.
  • [52] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
  • [53] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov.

    Good semi-supervised learning that requires a bad gan.

    In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6510–6520. Curran Associates, Inc., 2017.
  • [54] Guo-Jun Qi, Liheng Zhang, Hao Hu, Marzieh Edraki, Jingdong Wang, and Xian-Sheng Hua. Global versus localized generative adversarial nets. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [55] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
  • [56] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
  • [57] Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. Deep convolutional ranking for multilabel image annotation. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [59] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
  • [60] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
  • [61] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [62] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.