DeepMark++: CenterNet-based Clothing Detection

06/01/2020 ∙ by Alexey Sidnev, et al. ∙ HUAWEI Technologies Co., Ltd. 11

The single-stage approach for fast clothing detection as a modification of a multi-target network, CenterNet, is proposed in this paper. We introduce several powerful post-processing techniques that may be applied to increase the quality of keypoint localization tasks. The semantic keypoint grouping approach and post-processing techniques make it possible to achieve a state-of-the-art accuracy of 0.737 mAP for the bounding box detection task and 0.591 mAP for the landmark detection task on the DeepFashion2 validation dataset. We have also achieved the second place in the DeepFashion2 Challenge 2020 with 0.582 mAP on the test dataset. The proposed approach can also be used on low-power devices with relatively high accuracy without requiring any post-processing techniques.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent studies, keypoints, which are also referred to as landmarks, have proved to be one of the most distinctive and robust representations of visual analysis. The class of keypoint-based methods in computer vision includes the detection and further processing of keypoints. These methods can be utilized in tasks such as object detection, pose estimation, facial landmark recognition, and more.

The performance of models that operate with keypoints highly depends on the number of unique landmarks defined in the task, which may be considerably large for some modern datasets. Deepfashion2, one of the newest fashion datasets, provides annotations for 13 classes, each being characterized by a certain set of keypoints, with 294 unique ones in total.

Figure 1: Speed-accuracy trade-off for object detection and landmark estimation on the DeepFashion2 validation dataset [2]. NMS post-processing is applied to every model.

In this paper, we propose a study on the clothing landmark detection task on the the DeepFashion2 dataset, as well as an approach to deal with it efficiently.

2 Related work

In general, there are numerous applications for the keypoints estimation task. For instance, keypoints can be used directly to identify human pose [7] or locate facial landmarks [10], and as the main part of the object detection pipeline [4]. Keypoint-based object detection methods are gaining popularity in recent papers, especially because they are simpler, faster, and more accurate than the corresponding bounding box-based detectors.

Previous approaches, such as [5], required anchor boxes to be manually designed to train detectors. A series of anchor-free object detectors were then developed, with the aim of predicting the bounding box's keypoints, instead of trying to fit an object to an anchor. Without relying on manually designed anchors to match objects, CornerNet's [4] performance on MS COCO datasets improved significantly. Subsequently, several other variants of keypoint detection-based one-stage detectors came into existence, one of which was CenterNet [11].

This paper focuses on clothing landmark prediction and clothing detection tasks using the DeepFashion2 datasets [2]. The baseline approach for landmark estimation was built on Mask R-CNN [3] due to its two-stage nature, which is substantially heavy and difficult to use on low-power devices. We aim to propose a lightweight architecture that is not as heavy. These requirements are perfectly met by CenterNet, which operates directly with keypoints.

3 Proposed approach

Our approach is based on the CenterNet [11] architecture (see Figure 2). It solves two tasks simultaneously: object detection and keypoint location estimation.

The DeepFashion2 dataset contains 13 classes; therefore, 13 channels are used to predict the probabilities of object centers for all classes (Center heatmap in Figure 

2). An object center is defined as the center of a bounding box. Two additional channels in the output feature map x and y are used to refine the center coordinates, and both width and height are predicted directly.

Figure 2: Scheme of the proposed approach.

The fashion landmark estimation task involves estimating 2D keypoint locations for each item of clothing in one image. The coarse locations of the keypoints are regressed as relative displacements from the box center (Coarse keypoints in Figure 2). To refine a keypoint location, a heatmap with probabilities is used for each keypoint type. Local maximum with high confidence in the heatmap is used as a refined keypoint position. Similar to the detection case, two additional channels x and y are used to obtain more precise landmark coordinates. During model inference, each coarse keypoint location is replaced with the closest refined keypoint position.

3.1 Semantic keypoint grouping

One of the first steps involved in solving keypoint detection tasks is defining the model output. The number of keypoints for every category varies from 8 for a skirt to 39 for a long sleeve outerwear in the DeepFashion2 dataset. The total number of unique keypoints is 294. The simple approach is to concatenate keypoints from every category and deal with each keypoint separately. Directly predicting 294 keypoints leads to a huge number of output channels: ( from coarse keypoints, from keypoint heatmap).

It is evident that certain clothing landmarks are a subset of others. For example, shorts do not require unique keypoints because they can be represented by a subset of trousers keypoints. The semantic grouping rule is defined as follows: identical semantic meaning keypoints (collar center, top sleeve edge, etc.) with different categories can be merged into one group. This approach enables the formation of 62 groups and reduces the number of output channels from 901 to 205.

Figure 3: GPU memory consumption and training iteration time on RTX 2080ti. The input resolution is , the batch size is 32 for both DLA-34 and ResNet-50, and 8 for Hourglass. Time in ms was measured for 1 optimization step: batch loading to GPU, forward pass and backward pass. GPU memory was measured by using the nvidia-smi tool.

The semantic grouping approach reduces training and memory consumption times by up to 26% and 28%, respectively, without accuracy drops (see Figure 3). The latter reduction enables the use of larger batches during model training.

3.2 Post-processing techniques

We have developed 4 post-processing techniques that increase the model's accuracy without compromising performance.

3.2.1 Center rescoring

The first technique is a recalculation of the detection confidence score using keypoint scores from the keypoint heatmap. Let be the original detection confidence score from the center heatmap and be the average score of refined keypoints for the predicted category from the keypoint heatmap. The final detection confidence scores are calculated through the following formula:


where .

3.2.2 Heatmap rescoring with Gaussian kernel

The second technique is a general approach that can be applied to any keypoint-based architecture. Let be a heatmap with the center or keypoint scores. Taking the training procedure into account, you can expect the 8-connected neighbors of each item to be related to the same object. This fact can be used to improve the estimation of each heatmap value. Therefore, we applied the following formula:


where is the convolution operation, is the

Gaussian kernel with the standard deviation

. Experimental results show that in our model, the proposed technique improves the localization of peaks and their values that correspond to object centers or keypoints and their scores. A similar operation has been considered in [9] as a part of the proposed method.

3.2.3 Keypoint location refinement

The third technique is a recalculation of the refined keypoint locations using coarse positions. Let be the refined keypoint location from the heatmap and be coarse positions predicted as offsets from object centers. The final keypoint locations were calculated through the following expression:


where .

3.2.4 Keypoint heatmap rescoring

The fourth technique is a keypoint heatmap rescoring with the from a coarse keypoint location. Let be a heatmap with zero values by default. We set into the in the coarse keypoint position and fill neighbor values with 2D Gaussian function with standard deviation , where and are the object size. The keypoint heatmap is rescored through the following expression:


3.3 Multi-inference strategies

We consider 2 extra inference strategies: fusing model outputs from original and flipped images with equal weights; fusing model results with the original image downscaled/upscaled through certain multipliers. The proposed techniques increase accuracy but require several model inferences, significantly affecting the entire processing time.

3.4 Keypoint Refinement Network

At the final stage, detection results are refined with the PoseFix [6]

model-agnostic pose refinement method. The method learns the typical error distributions of any other pose estimation method and corrects mistakes at the testing stage.

We trained a set of 13 PoseFix models by using the number of classes in the DeepFashion2 dataset. The inference results of our method on the training set are used to train each of the 13 models. Subsequently, we applied trained PoseFix models to the result.

Post-processing Infer. time,
Base 0.529 0.720 62
NMS 0.530 0.722 62
Technique (1) 0.538 0.717111Certain techniques can increase and reduce simultaneously. Note that bounding box detection and keypoint estimation results for the same object may have different IoU and OKS with the ground truth, for example, when bounding box was detected correctly, but keypoints were not. In this case, technique (1) involves lowering a score for false positive keypoints, which is advisable. The corresponding true positive bounding box also suffers from this lowered score. 64
Technique (2) 0.533 0.720 73
Technique (3) 0.536 0.720 62
Technique (4) 0.534 0.720 93
Table 1: Different post-processing techniques applied independently to Hourglass . The technique numbers correspond to the numbers in section 3.2.

4 Results

All experiments were performed on the publicly available DeepFashion2 Challenge dataset [2], which contains 191,961 images in the training set and 32,153 images in the validation set.

Mask R-CNN [2] 0.529 0.638
DeepMark [8] 0.532 0.723
DAFE [1] 0.549
Hourglass 0.583 0.735
Hourglass 0.591 0.737
Table 2: Accuracy comparison of the proposed and alternative approaches with the DeepFashion2 validation dataset.

We used the CenterNet MS COCO model for object detection as the initial checkpoint and performed experiments with Hourglass backbone and Adam optimizer to achieve the state-of-the-art results (Table 2) for object detection and keypoint estimation tasks on the DeepFashion2 validation dataset. The Hourglass 

model was trained for 100 epochs with a batch size = 46 images. Learning rate schedule: 1e-3 - 65 epochs, 4e-4 - 20 epochs, 4e-5 - 15 epochs. Hourglass 

model was fine-tuned from Hourglass  for 25 epochs with a batch size = 22 images: 2e-5 - 20 epochs, 1e-5 - 5 epochs.

Base 0.529 0.520 0.720 0.695
Post-processing 0.545 0.540 0.713 0.698
NMS 0.548 0.549 0.718 0.712
Flip 0.561 0.563 0.731 0.727
Multiscale 0.568 0.578 0.735 0.737
PoseFix 0.583 0.591 0.735 0.737
Table 3: Clothing detection and landmark estimation. Hourglass with and resolution were used in the experiments. The next technique is added to each of the previous ones. The post-processing refers to applying all techniques from section 3.2.

We considered 5 fast post-processing techniques: bounding box non-maximum suppression and 4 techniques from section 3.2. The individual (Table 1) and combined (Table 3) effectiveness of each technique has been shown. During all experiments, our target was to increase the keypoint estimation accuracy instead of the object detection accuracy. Due to this reason, object detection increased only by 0.015 , but all the techniques added more than 0.07 mAP to .

The parameters of the post-processing techniques , and were determined through grid searching with step 0.05 on a small validation subset (1285 images).

Approach Parameter value
Technique (1)
HG technique (2) Center heatmap
Keypoint heatmap
HG technique (2) Center heatmap
Keypoint heatmap
Technique (3)
Mutiscale Multipliers: 0.85, 0.95, 1.1
Table 4: Parameters of post-processing and multi-inference techniques.

5 Conclusion

This new approach is proposed as an adaptation of CenterNet [11] for clothing landmark estimation tasks. The state-of-the-art accuracy was achieved on the DeepFashion2 dataset by applying several post-processing techniques: clothing detection hit 0.735 and clothing landmark estimation – 0.591 . The proposed approach can also be used without post-processing techniques. It takes 24 ms per image on RTX 2080ti for DLA-34 , and yields considerably high accuracy (0.5 and 0.714 ) for clothing detection tasks (see Figure 1).


  • [1] M. Chen, Y. Qin, L. Qi, and Y. Sun (2019-10) Improving fashion landmark detection by dual attention feature enhancement. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Table 2.
  • [2] Y. Ge, R. Zhang, X. Wang, X. Tang, and P. Luo (2019) DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5337–5345. Cited by: DeepMark++: CenterNet-based Clothing Detection, Figure 1, §2, Table 2, §4.
  • [3] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: Link, 1703.06870 Cited by: §2.
  • [4] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2, §2.
  • [5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [6] G. Moon, J. Y. Chang, and K. M. Lee (2018) PoseFix: model-agnostic general human pose refinement network. External Links: 1812.03595 Cited by: §3.4.
  • [7] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. External Links: 1603.06937 Cited by: §2.
  • [8] A. Sidnev, A. Trushkov, M. Kazakov, I. Korolev, and V. Sorokin (2019-10) DeepMark: one-shot clothing detection. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Table 2.
  • [9] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu (2019) Distribution-aware coordinate representation for human pose estimation. arXiv preprint arXiv:1910.06278. Cited by: §3.2.2.
  • [10] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §2.
  • [11] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: DeepMark++: CenterNet-based Clothing Detection, §2, §3, §5.