Aggregation and Finetuning for Clothes Landmark Detection

05/01/2020 ∙ by Tzu-heng Lin, et al. ∙ 5

Landmark detection for clothes is a fundamental problem for many applications. In this paper, a new training scheme for clothes landmark detection: Aggregation and Finetuning, is proposed. We investigate the homogeneity among landmarks of different categories of clothes, and utilize it to design the procedure of training. Extensive experiments show that our method outperforms current state-of-the-art methods by a large margin. Our method also won the 1st place in the DeepFashion2 Challenge 2020 - Clothes Landmark Estimation Track with an AP of 0.590 on the test set, and 0.615 on the validation set. Code will be publicly available at .



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Last decade saw great improvement in computer vision with the unprecedented performance of deep learning algorithms. Keypoints detection for human

[9] is one of the many problems which have been well studied in the literature [13, 7, 11, 2, 5, 8]

. However, when it comes to landmarks detection for clothes, fewer fundamental studies have been conducted. Normally, the best performing method is directly using state-of-the-art models from human pose estimation.

In the field of clothes landmark detection, there are mainly 3 public available datasets so far. DeepFashion [10] contains 4-8 landmarks across 50 categories per image, FashionAI [14] contains 24 landmarks across 5 categories per image. The recent released DeepFashion2 [6] defines 294 landmarks from 13 categories, which is currently the most informative and challenging dataset.

Different from human pose estimation, the clothes landmark detection dataset usually contains more than one category of instances. Thus, the problem is not only dependent on the accuracy of landmark detection, it is also largely affected by the performance of object detection. Also, the number of landmarks defined is significantly larger than human keypoints, which makes the problem even harder.

To address the above problem, we propose the Aggregation and Finetuning scheme for clothes landmark detection. We investigate the homogeneity of different landmarks and aggregate landmarks with similar definition. This reduces the number of landmarks needed to learn, generates more data for each landmark, and makes the network converges faster. We further propose to finetune the keypoints detector on data of each category independently. This largely boosts the landmark detection performance of clothes categories with insufficient amount of labeled data. In that follows, we will introduce our method in Section 2, show our experimental results in Section 3, and Section 4 will conclude the paper.

2 Method

Figure 1: DeepFashion2 Dataset [6]

In this paper, we focus on the DeepFashion2 [6] dataset, which contains in total 294 different landmarks from 13 clothes categories as shown in Figure 1. We now introduce our Aggregation and Finetuning scheme.

2.1 Aggregation

Conventionally, one would treat each of the 294 landmarks independently, and a deep learning model is often designed for generating 294 pieces of heatmaps for each landmark [6]. We argue that the above method is intuitive yet unreasonable. Among the 294 landmarks from different clothes categories, there are actually landmarks with very similar definitions. For example, collars for the tops and collars for the dresses should have similar definitions. If we are able to aggregate similar landmarks from different categories, then the amount of training data of the landmarks can be increased considerably. Thus, we manually aggregate similar landmarks and eventually result in 81 aggregated landmarks. The keypoints detector is then trained to only output 81 pieces of heatmaps.

2.2 Finetuning

After training a universal model for the aggregated landmarks for all clothes categories, we propose to finetune the models for each category independently. There are mainly two motivations for doing this. Firstly, there are only about 10-30 landmarks for each category, training on data with other landmarks would distract the learning of these landmarks. Secondly, there is severe data imbalance situation in the dataset (cf. Table 3), training a unified model for all categories could be harmful for the categories with very few labels. To apply this finetuning procedure, we start from the universal model trained in the aggregation step. Then only data from the specific clothes category is used to finetune the model for that category. After this finetuning procedure, we would have 13 different models specialized for each of the categories.

3 Experiment

det model Cascade [1] Cascade [1] HTC [3] HTC [3] HTC [3] HTC [3] Ground-truth
0.707 0.707 0.764 0.764 0.764 0.764 1.000
hflip train
hflip test
0.556 0.559 0.579 0.584 0.612 0.614 0.652
Table 2: Ablation study.
category #train #val w/o ft w/ ft
all 312,186 52,490 0.764 0.584 0.612
short sleeve top 71,645 12,556 0.867 0.734 0.736
long sleeve top 36,064 5,966 0.814 0.660 0.670
short sleeve outwear 543 142 0.540 0.382 0.386
long sleeve outwear 13,457 2,011 0.823 0.605 0.619
vest 16,095 2,113 0.761 0.590 0.595
sling 1,985 322 0.656 0.470 0.576
shorts 36,616 4,167 0.784 0.625 0.655
trousers 55,387 9,586 0.810 0.560 0.572
skirt 30,835 6,522 0.818 0.601 0.627
short sleeve dress 17,211 3,127 0.807 0.689 0.693
long sleeve dress 7,907 1,477 0.659 0.496 0.509
vest dress 17,949 3,352 0.812 0.592 0.634
sling dress 6,492 1,149 0.773 0.586 0.686
Table 3: Per category performance on finetuning strategy.

In this section, we conduct various experiments to answer the following research questions:

  • [leftmargin=*]

  • RQ1: How does our method perform compared with the current state-of-the-art models?

  • RQ2: Is object detection a bottleneck for the performance?

  • RQ3: How effective is the proposed Aggregation and Finetuning training scheme?

3.1 Implementation details

Generally, we use a two stage method to tackle the problem of clothes landmark detection. Firstly, an object detection model is used for detecting clothes in each image. Then, a keypoints detector is used for detecting the landmarks in each detected objects. We apply our proposed Aggregation and Finetuning scheme on the keypoints detector. We use the Hybrid Task Cascade [3, 4] with ResNeXt-101-64x4d as our object detection model, and HRNet-w48 [13] as our keypoints detector.

3.2 Results

Qualitative results (RQ1)

We first compare our method with other methods in Table 1. The results shown is an ensemble of two models () from Table 2). We can see that our method outperforms others significantly both on the validation set and the test set.

Effect of object detection performance (RQ2)

Next, we want to see if object detection performance is the bottleneck of the problem. Thus, we compare the performance of our model with object detection instances from models or from ground-truth annotations in Table 2. We found that the performance with instances from an object detection model () is significantly lower than the one with ground-truth instances (). We also observe a considerable improvement if we change our object detection model to a better one ( from 0.559 to 0.579). Unlike human pose estimation [9], where object detection does not affect much performance of keypoints, object detection models clearly plays a more crucial role in clothes landmark detection.

Ablation study (RQ3)

Lastly, we want to see how each part of the proposed Aggregation and Finetuning scheme helps. Firstly, we could observe a 0.003 (0.556 to 0.559) increase for the aggregation step in Table 2. Then, as mentioned just now, switching to a better object detection model can gain an increase of 0.02 (0.559 to 0.579). Lastly, our finetuning strategy can gain an increase of 0.028 (0.584 to 0.612). If we take a closer look at the per category performance in Table 3, we can see that the model is actually suffering from the low of categories with only few training labels. After applying the finetuning strategy, the performance of these categories improves significantly (e.g. sling, sling dress). These experiments validate the effectiveness of our method. However, the category short sleeve outwear with only 543 training samples improves only from 0.382 to 0.386. This implies that when the amount of training data is too few, our method also fails to generalize well.

4 Conclusion

In this paper, we investigate the problem of clothes landmark detection. We utilize the homogeneity of landmarks between different categories of clothes. By leveraging the proposed Aggregation and Finetuning scheme, our method achieves state-of-the-art performance on the challenging DeepFashion2 [6] dataset. Future works include incorporating more clothes knowledge in the models, and effective methods on training with insufficient amount of labeled data.