. However, when it comes to landmarks detection for clothes, fewer fundamental studies have been conducted. Normally, the best performing method is directly using state-of-the-art models from human pose estimation.
In the field of clothes landmark detection, there are mainly 3 public available datasets so far. DeepFashion  contains 4-8 landmarks across 50 categories per image, FashionAI  contains 24 landmarks across 5 categories per image. The recent released DeepFashion2  defines 294 landmarks from 13 categories, which is currently the most informative and challenging dataset.
Different from human pose estimation, the clothes landmark detection dataset usually contains more than one category of instances. Thus, the problem is not only dependent on the accuracy of landmark detection, it is also largely affected by the performance of object detection. Also, the number of landmarks defined is significantly larger than human keypoints, which makes the problem even harder.
To address the above problem, we propose the Aggregation and Finetuning scheme for clothes landmark detection. We investigate the homogeneity of different landmarks and aggregate landmarks with similar definition. This reduces the number of landmarks needed to learn, generates more data for each landmark, and makes the network converges faster. We further propose to finetune the keypoints detector on data of each category independently. This largely boosts the landmark detection performance of clothes categories with insufficient amount of labeled data. In that follows, we will introduce our method in Section 2, show our experimental results in Section 3, and Section 4 will conclude the paper.
In this paper, we focus on the DeepFashion2  dataset, which contains in total 294 different landmarks from 13 clothes categories as shown in Figure 1. We now introduce our Aggregation and Finetuning scheme.
Conventionally, one would treat each of the 294 landmarks independently, and a deep learning model is often designed for generating 294 pieces of heatmaps for each landmark . We argue that the above method is intuitive yet unreasonable. Among the 294 landmarks from different clothes categories, there are actually landmarks with very similar definitions. For example, collars for the tops and collars for the dresses should have similar definitions. If we are able to aggregate similar landmarks from different categories, then the amount of training data of the landmarks can be increased considerably. Thus, we manually aggregate similar landmarks and eventually result in 81 aggregated landmarks. The keypoints detector is then trained to only output 81 pieces of heatmaps.
After training a universal model for the aggregated landmarks for all clothes categories, we propose to finetune the models for each category independently. There are mainly two motivations for doing this. Firstly, there are only about 10-30 landmarks for each category, training on data with other landmarks would distract the learning of these landmarks. Secondly, there is severe data imbalance situation in the dataset (cf. Table 3), training a unified model for all categories could be harmful for the categories with very few labels. To apply this finetuning procedure, we start from the universal model trained in the aggregation step. Then only data from the specific clothes category is used to finetune the model for that category. After this finetuning procedure, we would have 13 different models specialized for each of the categories.
|det model||Cascade ||Cascade ||HTC ||HTC ||HTC ||HTC ||Ground-truth|
|category||#train||#val||w/o ft||w/ ft|
|short sleeve top||71,645||12,556||0.867||0.734||0.736|
|long sleeve top||36,064||5,966||0.814||0.660||0.670|
|short sleeve outwear||543||142||0.540||0.382||0.386|
|long sleeve outwear||13,457||2,011||0.823||0.605||0.619|
|short sleeve dress||17,211||3,127||0.807||0.689||0.693|
|long sleeve dress||7,907||1,477||0.659||0.496||0.509|
In this section, we conduct various experiments to answer the following research questions:
RQ1: How does our method perform compared with the current state-of-the-art models?
RQ2: Is object detection a bottleneck for the performance?
RQ3: How effective is the proposed Aggregation and Finetuning training scheme?
3.1 Implementation details
Generally, we use a two stage method to tackle the problem of clothes landmark detection. Firstly, an object detection model is used for detecting clothes in each image. Then, a keypoints detector is used for detecting the landmarks in each detected objects. We apply our proposed Aggregation and Finetuning scheme on the keypoints detector. We use the Hybrid Task Cascade [3, 4] with ResNeXt-101-64x4d as our object detection model, and HRNet-w48  as our keypoints detector.
Qualitative results (RQ1)
Effect of object detection performance (RQ2)
Next, we want to see if object detection performance is the bottleneck of the problem. Thus, we compare the performance of our model with object detection instances from models or from ground-truth annotations in Table 2. We found that the performance with instances from an object detection model () is significantly lower than the one with ground-truth instances (). We also observe a considerable improvement if we change our object detection model to a better one ( from 0.559 to 0.579). Unlike human pose estimation , where object detection does not affect much performance of keypoints, object detection models clearly plays a more crucial role in clothes landmark detection.
Ablation study (RQ3)
Lastly, we want to see how each part of the proposed Aggregation and Finetuning scheme helps. Firstly, we could observe a 0.003 (0.556 to 0.559) increase for the aggregation step in Table 2. Then, as mentioned just now, switching to a better object detection model can gain an increase of 0.02 (0.559 to 0.579). Lastly, our finetuning strategy can gain an increase of 0.028 (0.584 to 0.612). If we take a closer look at the per category performance in Table 3, we can see that the model is actually suffering from the low of categories with only few training labels. After applying the finetuning strategy, the performance of these categories improves significantly (e.g. sling, sling dress). These experiments validate the effectiveness of our method. However, the category short sleeve outwear with only 543 training samples improves only from 0.382 to 0.386. This implies that when the amount of training data is too few, our method also fails to generalize well.
In this paper, we investigate the problem of clothes landmark detection. We utilize the homogeneity of landmarks between different categories of clothes. By leveraging the proposed Aggregation and Finetuning scheme, our method achieves state-of-the-art performance on the challenging DeepFashion2  dataset. Future works include incorporating more clothes knowledge in the models, and effective methods on training with insufficient amount of labeled data.
Zhaowei Cai and Nuno Vasconcelos.
Cascade r-cnn: Delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6154–6162, 2018.
-  Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008, 2018.
-  Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. In CVPR, 2019.
-  Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
-  Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7103–7112, 2018.
-  Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang, Xiaoou Tang, and Ping Luo. A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In CVPR, 2019.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017.
-  Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision (ECCV), pages 740–755. Springer, 2014.
-  Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 1096–1104, 2016.
-  Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Posefix: Model-agnostic general human pose refinement network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7773–7781, 2019.
-  Alexey Sidnev, Alexey Trushkov, Maxim Kazakov, Ivan Korolev, and Vladislav Sorokin. Deepmark: One-shot clothing detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pages 0–0, 2019.
-  Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
-  Xingxing Zou, Xiangheng Kong, Waikeung Wong, Congde Wang, Yuguang Liu, and Yang Cao. Fashionai: A hierarchical dataset for fashion understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 0–0, 2019.