The 2D humanoid pose estimation problem aims to detect and localize keypoints and parts and infer the limb connections to reconstruct the existing human poses from images. The human pose estimation problem’s importance arises from the fact that this task has many applications in various areas such as human-computer interaction and action recognition. In this work, we address the real-time pose estimation problem for humanoid robots (see Fig. 1). The shape similarity between humanoid robots and persons is a double-edged sword. On the one hand, it enables us to start with existing methods designed for persons, but on the other hand, it adds additional difficulty to our problem not to confuse humans with humanoid-robots, especially in adult-size league. In general, the attempts to address the pose estimation problem for multiple persons]can be categorized as either top-down or bottom-up approaches.
In the top-down models, the procedure includes two distinct steps. The first step is to detect individual people, while the particular pose is estimated in the next step. One of these models’ disadvantages is that the performance of the model is tightly correlated to the person detector performance. Although the state-of-the-art (SOTA) results are derived from this type of models, including Cascaded Pyramid Network  and High-Resolution Net (HRNet) , the runtime of such approaches is negatively affected by the number of persons present, as a single-person pose estimator is run for each detection. Hence, the computational cost linearly increases with more persons, so the performance is often not real-time.
In contrast, bottom-up approaches detect body joints and group them into individuals simultaneously; therefore, they are less dependant on the number of persons in the image. One of the bottom-up method’s main challenges is to group the detected keypoints in a real-time manner accurately. Recent approaches [2, 19, 11] utilize a greedy algorithm to group the detected keypoints into individual instances. Moreover, the bottom-up method’s performance is more vulnerable to the different scales of the persons in a given image compared to the top-down approaches. To alleviate this issue, previous works exploit the scale search method  or rely on high-resolution input size . These solutions are increasing the inference time though. A time-efficient method predicting keypoints at higher resolution was introduced by Cheng et al. , narrowing the performance gap between bottom-up and top-down models.
This paper opts for a bottom-up approach designed for 2D pose estimation of multiple humanoid robots. We made this choice because, in top-down methods, the inference time is generally much higher than in bottom-up approaches, so they will not be suitable for RoboCup real-time applications. Furthermore, we wanted to avoid the cost of annotating bounding boxes. We remedy our bottom-up model scale variations problem by using feature pyramid structure  through utilizing high-resolution feature maps.
Despite the availability of several large-scale benchmark datasets such as MPII Human Pose
and MS COCO for the task of human pose estimation, we cannot fully utilize them because of differences between robots and humans, such as types and sizes. Thus, we present a new pose dataset of robots from the RoboCup Humanoid League. The code and dataset of this paper are publicly available.111 https://github.com/AIS-Bonn/HumanoidRobotPoseEstimation. In summary, we make the following contributions:
We propose a deep learning model specifically designed to address the 2D pose estimation problem for multiple humanoid robots.
We introduce a new dataset, namely the HumanoidRobotPose dataset, consisting of robots from the RoboCup Humanoid League.
We demonstrate that the proposed real-time light-weight model outperforms the SOTA bottom-up methods in our application.
2 Related Work
Although there are some works on the detection and tracking of humanoid robots [7, 6], to the best of our knowledge, there is no previous work that addresses humanoid robot’s pose estimation, which works on a variety of robots. Giambattista et al.  propose a gesture-based communication between Nao robot that utilizes OpenPose  for Nao robot pose estimation. Note that pose estimation on a single standardized Nao robot type is significantly easier than what we need in the Humanoid League. We have to address various unseen robots with different colors, kinematic shapes, and sizes.
Most of the existing top-down methods exploit human detector models such as Feature Pyramid Networks , and Faster R-CNN . Papandreou et al.  propose one of the first top-down models which employ the Faster R-CNN for the person detector step and present a new representation for keypoints, which is a mixture of binary activation heatmap and the corresponding offset. The most recent top-down approaches which obtain SOTA accuracy are the Cascaded Pyramid Network (CPN) introduced by Chen et al. , where multi-scale feature maps from different layers of the GlobalNet are integrated with an online hard keypoint mining loss for difficult-to-detect joints, and the model presented by Sun et al.  that improves the heatmap estimation using high-resolution representations and multiple branches of different resolutions.
The recent architectures [2, 16, 18, 19, 12] take advantage of the confidence maps to detect the keypoints. Kreiss et al.  introduce a combination of confidence maps and vectorial parts for keypoints detection. Moreover, there are different approaches for encoding the part association used in the SOTA bottom-up models. OpenPose  introduces the Part Affinity Fields (PAFs) method to learn the body parts associations by encoding the location and direction, offset regression that uses the displacements of the keypoints [18, 19], and tag heatmap, which produces a heatmap as a tag for each keypoint heatmap [12, 16]. Pose Partition Networks  present a dense regression approach over all the keypoints to generate individuals’ partitions using the embedding spaces.
3 Pose Estimation Model
In this section, we present our real-time bottom-up approach to pose estimation of multiple humanoid robots. The aim is to predict the part coordinates and the part associations to build robot poses. In the following, we first describe the model, then explain the keypoint detection and the part association methods in detail.
3.1 Network Architecture
Following the successful results of NimbRo-Net  and NimbRo-Net2 , we decided to utilize a similar architecture. This decision ensures that later we can combine this model with NimbRo-Net2 to have a unified network for multiple tasks related to the humanoid league. The proposed network is depicted in Fig. 2.
Our model is an encoder-decoder network which takes an RGB image of size . We observe that it is required to use a deeper encoder than the decoder to create a powerful feature extractor. The encoder is a pre-trained ResNet model , in which the last fully connected and global average pooling layers are eliminated. The first layer is a
convolutional with stride
, followed by a max-pooling layer. The rest of the encoder network consists of four modules of residual blocks with higher depths and lower resolutions as the number of modules increases. Each residual block consists of two or three convolutional layers, depending on the selected ResNet architecture, followed by batch normalization and ReLU activation and a shortcut connection. More fine-grained spatial information is present in the early layers, while in the final layers, the network extracts more semantic information.
In the decoder part, we utilize lateral connections from different parts of the encoder, which allows us to maintain the high-resolution information. For every lateral connection, we apply convolution to generate a fixed number of channels. The decoder network has a feature pyramid structure involving four modules. At each level of the pyramid, the previous level’s output is fed to the transposed convolution followed by a bilinear upsampling to obtain a fixed number of higher-resolution features. These upsampled features are concatenated with the features from the corresponding lateral connection. Similar to the encoder, ReLU and batch normalization is used to get the final output of the module.
As high-resolution feature maps are essential for precise keypoint localization , we leverage two scales of the feature pyramid hierarchy, i.e., and resolutions. As a result, the keypoints heatmaps at each scale is generated by performing a final convolution on extracted features. As depicted in Fig. 2, we have two scales of keypoint heatmaps with intermediate supervision, inspired by HigherHRNet . The final keypoints heatmaps are the average over the predictions generated by these two scales after upsampling to the same resolution as the input image in order to achieve accurate high-resolution predictions. Note that only one scale of limbs heatmaps is utilized, as we observe that following the same approach as keypoints yields performance drop.
3.2 Keypoint Detection and Part Association
The ground truth heatmaps of the keypoints in a given image can be represented as the set , where , , and is the total keypoints of a robot instance, which is equal to six for our dataset (see Fig. 3 (left)). The heatmap with the resolution of includes the Gaussian heatmaps of the th part of all the robot instances. Let and be the location of the th part of the th instance, where and is the total number of existing robots with the visible th part in an image. To embed the position of the annotated
th part, we use the 2D unnormalized Gaussian distribution with the center of
and the standard deviation, which is fixed for all the parts:
Due to occlusion or proximity of the robot instances in a given image, we utilize the pixel-wise max operation on Eq. 1 to preserve the Gaussian peak of the th part for each instance.
For limbs, the ground truth heatmaps in a given image can be expressed as the set , where , , and is the total limbs of a robot instance that is five in this work (see Fig. 3 (left)). Note that the intended utility of limbs is only to encode the relations between keypoints, so they do not necessarily lie on actual robot limbs. Therefore, to encode a limb’s position, first, we compute a line segment between two keypoints and mark all of the points that lie on such limb, following the approach proposed by Li et al. . Then having these offsets, the final Gaussian heatmap of each limb is generated by an unnormalized Gaussian distribution with the standard deviation
that controls the spread of the Gaussian peak in the same way as for the keypoint heatmap. The final limb heatmap is the average of all the robots’ limb appearing in the image. In contrast to the PAFs method that encodes each limb in two channels as vector directions, we encode each limb type in a single channel. This simpler approach for encoding the limbs is enough for our application since our experiments show better performance than the PAF method.
We use the mean square error to compute the loss between the predicted heatmaps and the ground truth heatmaps for both keypoints and the limbs.
where is a binary mask with when the annotation is missing in the image, and is the scale of the predicted heatmaps. Finally, the total loss used to train the network is the sum of the keypoint loss (2) and the limb loss (3).
3.4 Post Processing
By performing Non-Maximum Suppression ( kernel) on the predicted keypoint heatmaps, we obtain the peak of each Gaussian heatmap and the location of its corresponding keypoint for the robot instances. We use the detected limb’s heatmaps to acquire the candidate connections between the keypoints as . As there are multiple robot instances in an image, it is required to group the keypoints to determine the poses corresponding to the correct individuals. Having the set of keypoints and the connection candidates, we employed the proposed greedy algorithm by Cao et al.  to solve the assignment problem and obtain the final pose of all robot instances. In this algorithm, instead of considering the fully connected graph, the goal is to obtain the minimum spanning tree of the pose instance and assign the adjacent tree nodes independently, resulting in a well-approximate solution with efficient computational cost.
This section explains the paper’s additional contribution, the HumanoidRobotPose dataset, including data collection and annotation procedures and the evaluation metrics used for this dataset.
Our goal was to collect a dataset containing both single and multiple robots to simulate the RoboCup’s real conditions. We gathered many YouTube videos from the RoboCup Humanoid League, as well as some in-house videos and ROS bags. Some videos originate from the qualification videos, which only demonstrate a specific robot; therefore, they only consist of a single pose. To include videos with multiple robots and increase the diversity of robots in the dataset, we also employ videos from drop-in games and round-robin competitions. These videos are from recent years and contain various view angles, lens distortions, brightness, and robots. Note that in most of the videos, there are humans present in the pictures, e.g., the robot handler, the referee, and audiences around the field. Overall, we annotated over manually selected frames from videos with around k robot instances. These frames include teen and adult sized robots and contain more than ten different robot types. About percent of the dataset was exclusively used for testing. Note that testing frames were collected from different videos than the training videos.
4.1 Data Annotation
For data annotation, we used Supervise.ly222https://supervise.ly, a web-based data annotations and management tool. We decided to ignore the truncated or severely occluded points in the image, which are usually considered invisible keypoints. For each robot, six keypoints are annotated, including head, trunk, hands, and feet. The head keypoint is important, for instance, to estimate the height of the goalie robot. We use these few keypoints to avoid annotation costs; however, they can be easily extended to more keypoints. We define a minimal pose representation by five limbs from these keypoints, which would be sufficient for the current soccer behavior level. The annotation for a robot instance is illustrated in Fig. 3 (left).
To show the diversity of our dataset, we visualize the variability of annotated poses in Fig. 3 (right) and statistics of the number and scales of the robot instances are presented in Fig 4. About percent of the collected frames contain single pose instances and robot instances with medium scale, i.e., segment area , where the segment area of a robot instance is measured using the size of the minimum encapsulating rectangle of the annotated keypoints. Our definition of the size scales is identical to the COCO dataset 333https://cocodataset.org/#keypoints-eval.
4.2 Evaluation Metrics
We use the Object Keypoint Similarity (OKS) metric from COCO keypoint dataset . The OKS of a robot between the detected keypoint () and its corresponding ground truth () can be written as follows:
where is a constant specific to each keypoint, which is equal for all keypoints in our dataset, is the segment area of the robot instance measured in pixels, and is the keypoint visibility flag in the ground truth ( for the invisible keypoint). The OKS metric is robust to the number of visible keypoints as it gives equal importance to the robot instances with different numbers of visible keypoints. The evaluation metrics used for the proposed dataset are as following: AP (the mean average precision over 10 OKS thresholds = [0.50:0.05:0.95]), AP50 (AP at OKS threshold = 0.50), AP75, APM for medium scale robot instances, APL for large scale instances, and AR (the mean of average recall over 10 OKS thresholds).
We compare the proposed method with SOTA bottom-up approaches on the HumanoidRobotPose dataset. These approaches are OpenPose , Associative Embedding (AE) , PifPaf , and HigherHRNet . OpenPose  utilizes confidence maps to localize the keypoints and PAFs to encode the body parts’ location and orientation. For grouping the detected keypoints, the greedy algorithm is proposed in which each part is scored, computing the line integral on the corresponding PAF. Associative Embedding (AE) , merges the stacked hourglass architecture  with associative embedding.
PifPaf  proposes Part Intensity Field to detect and localize the keypoints and Part Association Fields to associate body parts with each other.
HigherHRNet  is using an adopted top-down model as the backbone with a transposed convolution module to predict higher resolution heatmaps for the keypoints detection. Similar to the AE approach, in HigherHRNet, the associative embedding is employed to parse the poses. Following the configurations provided by the original papers, we reported the details of models and the inference time evaluated on NimbRo-OP2X robot hardware  in Table 1.
For all methods, the hyperparameters are tuned to achieve the best possible results. Our model is trained using the AdamW optimizer with learning rate of , batch size 16 and weight decay of for the total epochs. Note that the encoder is initialized by pre-trained ResNet weights on ImageNet. We conduct data augmentation that includes random horizontal flip, random rotation, random scaling, and random translation during training.
The results on the test set are reported in Table 2. The reported results are achieved without performing the flip test or the multi-scales test for preserving the methods to be real-time. Our proposed method with ResNet18 backbone outperforms the best existing methods in all metrics except for large scale when we train the models from scratch on our dataset (see Table 2). Note that compared to other baselines, our model can utilize our limited dataset better. Based on AP results of medium and large scales, our model can better handle the different scales than the other approaches. Moreover, the strict metric results demonstrate that the predicted pose instances are more accurate compared to the other methods due to the high-resolution predictions. Fig. 5 illustrates some samples of estimated poses for all the approaches.
6 Ablation Study
This section investigates different backbones for the encoder part of our model and the importance of employing multi-scale predictions in our approach. As shown in Table 3, although applying a deeper encoder helps achieve better performance, it negatively affects the inference time of the model. Moreover, AP results demonstrate that without multi-scale heatmaps, the accuracy of predicted keypoints drops.
In this paper, we presented a lightweight bottom-up model for estimating multiple humanoid robot poses in real-time. We showed that our proposed model is capable of multi-robot pose estimation on NimbRo-OP2X robot hardware and is more suitable for the RoboCup humanoid league in comparison with other SOTA models. For the future, we will use this model for advanced soccer behavior decisions like recognizing rival robots’ actions or anticipating the ball’s movement direction before the kicking motion. Since the developed model is very similar to NimbRo-Net2, we will combine them to produce a unified network for diverse perception tasks in RoboCup.
This work was partially funded by grant BE 2556/16-2 (Research Unit FOR 2535 Anticipating Human Behavior) of the German Research Foundation (DFG).
2D human pose estimation: new benchmark and state of the art analysis.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2021) OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.0.2, §2, §3.4, Figure 5, Table 1, Table 2, §5.
-  (2018) Cascaded pyramid network for multi-person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.0.1.
-  (2020) HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.1, Figure 5, Table 1, Table 2, §5, §5.
-  (2019) On field gesture-based robot-to-robot communication with nao soccer players. In Robot World Cup XXIII, Cited by: §2.
-  (2016) Real-time visual tracking and identification for a team of homogeneous humanoid robots. In Robot World Cup XX, Cited by: §2.
-  (2017) Online visual robot tracking and identification using deep lstm networks. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §2.
-  (2018) NimbRo robots winning RoboCup 2018 humanoid adultsize soccer competitions. In Robot World Cup XXII, Cited by: §3.1.
NimbRo-OP2X: Adult-sized open-source 3D printed humanoid robot. In IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), Cited by: §5.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
-  (2019) PifPaf: composite fields for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.0.2, Figure 5, Table 1, Table 2, §5, §5.
-  (2020) Simple Pose: rethinking and improving a bottom-up approach for multi-person pose estimation.. In AAAI, Cited by: §2.0.2, §3.2.
-  (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.0.1.
-  (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), Cited by: §1.
-  (2019) Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), Cited by: §5.
-  (2017) Associative embedding: end-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, Cited by: §2.0.2, Figure 5, Table 1, Table 2, §5.
-  (2016) Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), Cited by: §5.
-  (2018) Pose partition networks for multi-person pose estimation. In European Conference on Computer Vision (ECCV), Cited by: §2.0.2.
-  (2018) PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.0.2.
-  (2017) Towards accurate multi-person pose estimation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.0.1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §2.0.1.
-  (2019) RoboCup 2019 AdultSize winner NimbRo: deep learning perception, in-walk kick, push recovery, and team play capabilities. In Robot World Cup XXIII, Cited by: §3.1.
-  (2017) Benchmarking and error diagnosis in multi-instance pose estimation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §4.2.
-  (2020) Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.0.1.
-  (2018) Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), Cited by: §3.1.