In recent studies, keypoints, which are also referred to as landmarks, have proved to be one of the most distinctive and robust representations of visual analysis. The class of keypoint-based methods in computer vision includes the detection and further processing of keypoints. These methods can be utilized in tasks such as object detection, pose estimation, facial landmark recognition, and more.
The performance of models that operate with keypoints highly depends on the number of unique landmarks defined in the task, which may be considerably large for some modern datasets. Deepfashion2, one of the newest fashion datasets, provides annotations for 13 classes, each being characterized by a certain set of keypoints, with 294 unique ones in total.
In this paper, we propose a study on the clothing landmark detection task on the the DeepFashion2 dataset, as well as an approach to deal with it efficiently.
2 Related work
In general, there are numerous applications for the keypoints estimation task. For instance, keypoints can be used directly to identify human pose  or locate facial landmarks , and as the main part of the object detection pipeline . Keypoint-based object detection methods are gaining popularity in recent papers, especially because they are simpler, faster, and more accurate than the corresponding bounding box-based detectors.
Previous approaches, such as , required anchor boxes to be manually designed to train detectors. A series of anchor-free object detectors were then developed, with the aim of predicting the bounding box's keypoints, instead of trying to fit an object to an anchor. Without relying on manually designed anchors to match objects, CornerNet's  performance on MS COCO datasets improved significantly. Subsequently, several other variants of keypoint detection-based one-stage detectors came into existence, one of which was CenterNet .
This paper focuses on clothing landmark prediction and clothing detection tasks using the DeepFashion2 datasets . The baseline approach for landmark estimation was built on Mask R-CNN  due to its two-stage nature, which is substantially heavy and difficult to use on low-power devices. We aim to propose a lightweight architecture that is not as heavy. These requirements are perfectly met by CenterNet, which operates directly with keypoints.
3 Proposed approach
The DeepFashion2 dataset contains 13 classes; therefore, 13 channels are used to predict the probabilities of object centers for all classes (Center heatmap in Figure2). An object center is defined as the center of a bounding box. Two additional channels in the output feature map x and y are used to refine the center coordinates, and both width and height are predicted directly.
The fashion landmark estimation task involves estimating 2D keypoint locations for each item of clothing in one image. The coarse locations of the keypoints are regressed as relative displacements from the box center (Coarse keypoints in Figure 2). To refine a keypoint location, a heatmap with probabilities is used for each keypoint type. Local maximum with high confidence in the heatmap is used as a refined keypoint position. Similar to the detection case, two additional channels x and y are used to obtain more precise landmark coordinates. During model inference, each coarse keypoint location is replaced with the closest refined keypoint position.
3.1 Semantic keypoint grouping
One of the first steps involved in solving keypoint detection tasks is defining the model output. The number of keypoints for every category varies from 8 for a skirt to 39 for a long sleeve outerwear in the DeepFashion2 dataset. The total number of unique keypoints is 294. The simple approach is to concatenate keypoints from every category and deal with each keypoint separately. Directly predicting 294 keypoints leads to a huge number of output channels: ( from coarse keypoints, from keypoint heatmap).
It is evident that certain clothing landmarks are a subset of others. For example, shorts do not require unique keypoints because they can be represented by a subset of trousers keypoints. The semantic grouping rule is defined as follows: identical semantic meaning keypoints (collar center, top sleeve edge, etc.) with different categories can be merged into one group. This approach enables the formation of 62 groups and reduces the number of output channels from 901 to 205.
The semantic grouping approach reduces training and memory consumption times by up to 26% and 28%, respectively, without accuracy drops (see Figure 3). The latter reduction enables the use of larger batches during model training.
3.2 Post-processing techniques
We have developed 4 post-processing techniques that increase the model's accuracy without compromising performance.
3.2.1 Center rescoring
The first technique is a recalculation of the detection confidence score using keypoint scores from the keypoint heatmap. Let be the original detection confidence score from the center heatmap and be the average score of refined keypoints for the predicted category from the keypoint heatmap. The final detection confidence scores are calculated through the following formula:
3.2.2 Heatmap rescoring with Gaussian kernel
The second technique is a general approach that can be applied to any keypoint-based architecture. Let be a heatmap with the center or keypoint scores. Taking the training procedure into account, you can expect the 8-connected neighbors of each item to be related to the same object. This fact can be used to improve the estimation of each heatmap value. Therefore, we applied the following formula:
where is the convolution operation, is the
Gaussian kernel with the standard deviation. Experimental results show that in our model, the proposed technique improves the localization of peaks and their values that correspond to object centers or keypoints and their scores. A similar operation has been considered in  as a part of the proposed method.
3.2.3 Keypoint location refinement
The third technique is a recalculation of the refined keypoint locations using coarse positions. Let be the refined keypoint location from the heatmap and be coarse positions predicted as offsets from object centers. The final keypoint locations were calculated through the following expression:
3.2.4 Keypoint heatmap rescoring
The fourth technique is a keypoint heatmap rescoring with the from a coarse keypoint location. Let be a heatmap with zero values by default. We set into the in the coarse keypoint position and fill neighbor values with 2D Gaussian function with standard deviation , where and are the object size. The keypoint heatmap is rescored through the following expression:
3.3 Multi-inference strategies
We consider 2 extra inference strategies: fusing model outputs from original and flipped images with equal weights; fusing model results with the original image downscaled/upscaled through certain multipliers. The proposed techniques increase accuracy but require several model inferences, significantly affecting the entire processing time.
3.4 Keypoint Refinement Network
At the final stage, detection results are refined with the PoseFix 
model-agnostic pose refinement method. The method learns the typical error distributions of any other pose estimation method and corrects mistakes at the testing stage.
We trained a set of 13 PoseFix models by using the number of classes in the DeepFashion2 dataset. The inference results of our method on the training set are used to train each of the 13 models. Subsequently, we applied trained PoseFix models to the result.
|Technique (1)||0.538||0.717111Certain techniques can increase and reduce simultaneously. Note that bounding box detection and keypoint estimation results for the same object may have different IoU and OKS with the ground truth, for example, when bounding box was detected correctly, but keypoints were not. In this case, technique (1) involves lowering a score for false positive keypoints, which is advisable. The corresponding true positive bounding box also suffers from this lowered score.||64|
All experiments were performed on the publicly available DeepFashion2 Challenge dataset , which contains 191,961 images in the training set and 32,153 images in the validation set.
|Mask R-CNN ||0.529||0.638|
We used the CenterNet MS COCO model for object detection as the initial checkpoint and performed experiments with Hourglass backbone and Adam optimizer to achieve the state-of-the-art results (Table 2) for object detection and keypoint estimation tasks on the DeepFashion2 validation dataset. The Hourglass
model was trained for 100 epochs with a batch size = 46 images. Learning rate schedule: 1e-3 - 65 epochs, 4e-4 - 20 epochs, 4e-5 - 15 epochs. Hourglassmodel was fine-tuned from Hourglass for 25 epochs with a batch size = 22 images: 2e-5 - 20 epochs, 1e-5 - 5 epochs.
We considered 5 fast post-processing techniques: bounding box non-maximum suppression and 4 techniques from section 3.2. The individual (Table 1) and combined (Table 3) effectiveness of each technique has been shown. During all experiments, our target was to increase the keypoint estimation accuracy instead of the object detection accuracy. Due to this reason, object detection increased only by 0.015 , but all the techniques added more than 0.07 mAP to .
The parameters of the post-processing techniques , and were determined through grid searching with step 0.05 on a small validation subset (1285 images).
This new approach is proposed as an adaptation of CenterNet  for clothing landmark estimation tasks. The state-of-the-art accuracy was achieved on the DeepFashion2 dataset by applying several post-processing techniques: clothing detection hit 0.735 and clothing landmark estimation – 0.591 . The proposed approach can also be used without post-processing techniques. It takes 24 ms per image on RTX 2080ti for DLA-34 , and yields considerably high accuracy (0.5 and 0.714 ) for clothing detection tasks (see Figure 1).
-  (2019-10) Improving fashion landmark detection by dual attention feature enhancement. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Table 2.
DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5337–5345. Cited by: DeepMark++: CenterNet-based Clothing Detection, Figure 1, §2, Table 2, §4.
-  (2017) Mask R-CNN. CoRR abs/1703.06870. External Links: Cited by: §2.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2, §2.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
-  (2018) PoseFix: model-agnostic general human pose refinement network. External Links: Cited by: §3.4.
-  (2016) Stacked hourglass networks for human pose estimation. External Links: Cited by: §2.
-  (2019-10) DeepMark: one-shot clothing detection. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Table 2.
-  (2019) Distribution-aware coordinate representation for human pose estimation. arXiv preprint arXiv:1910.06278. Cited by: §3.2.2.
-  (2014) Facial landmark detection by deep multi-task learning. In European conference on computer vision, pp. 94–108. Cited by: §2.
-  (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: DeepMark++: CenterNet-based Clothing Detection, §2, §3, §5.