While pose estimation has traditionally focused on human subjects, there has been an increased interest on animal subjects in recent years (, , ,  ). It is possible to put markers on certain trained animals such as dogs to employ marker-based motion capture techniques. Nevertheless, there are far more practical difficulties associated with this when compared with human subjects. Some animals may find markers distressing and it is impossible to place them on wild animals. Neural networks currently achieve the best results for human pose estimation, and generally require training on widely available large-scale data sets that provide 2D and/or 3D annotations (
). It is possible to put markers on certain trained animals such as dogs to employ marker-based motion capture techniques. Nevertheless, there are far more practical difficulties associated with this when compared with human subjects. Some animals may find markers distressing and it is impossible to place them on wild animals. Neural networks currently achieve the best results for human pose estimation, and generally require training on widely available large-scale data sets that provide 2D and/or 3D annotations (, , , ). However, there are currently no datasets of 3D animal data available at the same scale concerning the number of samples, variety and annotations, making comparable studies or approaches to pose prediction difficult to achieve.
In this paper, we propose a markerless approach for 3D skeletal pose-estimation of canines from RGBD images. To achieve this, we present a canine dataset which includes skinned 3D meshes, as well as synchronised RGBD video and 3D skeletal data acquired from a motion capture system which acts as ground truth. Dogs are chosen as our capture subject for several reasons: they are familiar with human contact and so generally accept wearing motion capture suits; they can be brought into the motion capture studio with ease; they respond to given directions producing comparable motions across the numerous subjects; their diverse body shape and size produces data with interesting variations in shape. We propose that our resulting dog skeleton structure is more anatomically correct when compared with that of the SMAL model and a larger number of bones in the skeleton allows more expression.
It is challenging to control the capture environment with (uncontrolled) animals - covering wide enough variability in a limited capture session proved to be challenging. Hence our method utilises the dog skeletons and meshes produced by the motion capture system to generate a large synthetic dataset. This dataset is used to train a predictive network and generative model using 3D joint data and the corresponding projected 2D annotations. Using RGB images alone may not be sufficient for pose prediction, as many animals have evolved to blend into their environment and similarly coloured limbs can result in ambiguities. On the other hand, depth images do not rely on texture information and give us the additional advantage of providing surface information for predicting joints. We choose to use the Microsoft Kinect v2 as our RGBD depth sensor, due to its wide availability and the established area of research associated with the device. Images were rendered from our synthetically generated 3D dog meshes using the Kinect sensor model of Li et al.  to provide images with realistic Kinect noise as training data to the network.
Details of the dataset generation process are provided in Section 3.2. Despite training the network with purely synthetic images, we achieve high accuracy when tested on real depth images, as discussed in Section 4.1. In addition to this, Section 4.3, we found that training the network only with dogs still allowed it to produce plausible results on similarly rendered quadrupeds such as horses and lions.
The joint locations predicted by deep networks may contain errors. In particular, they do not guarantee that the estimated bone lengths remain constant throughout a sequence of images of the same animal and may also generate physically impossible poses. To address these limitations, we adopt a prior on the joint pose configurations -- a Hierarchical Gaussian Process Latent Variable Model (H-GPLVM) . This allows the representation of high-dimensional non-linear data in lower dimensions, while simultaneously exploiting the skeleton structure in our data. In summary, our main contributions are:
Prediction of 3D shape as PCA model parameters, 3D joint locations and estimation of a kinematic skeleton of canines using RGBD input data.
Combination of a stacked hour glass CNN architecture for initial joint estimation and a H-GPLVM to resolve pose ambiguities, refine fitting and convert joint positions to a kinematic skeleton.
A novel dataset of RGB and RGBD canine data with skeletal ground truth estimated from a synchronised 3D motion capture system and a shape model containing information of both real and synthetic dogs. This dataset and model are available at 111 https://github.com/CAMERA-Bath/RGBD-Dog..
2 Related work
2D Animal Pose Estimation. Animal and insect 2D pose and position data is useful in a range of behavioural studies. Most solutions to date use shallow trained neural network architectures whereby a few image examples of the animal or insect of interest are used to train a keyframe-based feature tracker, e.g. LEAP Estimates Animal Pose , DeepLabCut (, ) and DeepPoseKit (). Cao et al.  address the issue of the wide variation in interspecies appearance by presenting a method for cross-domain adaption when predicting the pose of unseen species. By creating a training dataset by combining a large dataset of human pose (MPII Human Pose ), the bounding box annotations for animals in Microsoft COCO , and the authors’ animal pose dataset, the method achieves good pose estimation for unseen animals.
3D Animal Pose Estimation. Zuffi et al.  introduce the Skinned Multi-Animal Linear model (SMAL), which separates animal appearance into PCA shape and pose-dependent shape parameters (e.g. bulging muscles), created from a dataset of scanned toy animals. A regression matrix calculates joint locations for a given mesh. SMAL with Refinement (SMALR)  extends the SMAL model to extract fur texture and achieves a more accurate shape of the animal. In both methods, silhouettes are manually created when necessary, and manually selected keypoints guide the fitting of the model. In SMAL with learned Shape and Texture (SMALST)  a neural network automatically regresses the shape parameters, along with the pose and texture of a particular breed of zebra from RGB images, removing the requirement of silhouettes and keypoints.
Biggs et al.  fit the SMAL model to sequences of silhouettes that have been automatically extracted from the video using Deeplab  .
A CNN is trained to predict 2D joint locations, with the training set generated using the SMAL model.
Quadractic programming and genetic algorithms choose the best 2D joint positions.
SMAL is then fit to the joints and silhouettes.
. A CNN is trained to predict 2D joint locations, with the training set generated using the SMAL model. Quadractic programming and genetic algorithms choose the best 2D joint positions. SMAL is then fit to the joints and silhouettes.
In training our neural network, we also generate synthetic RGBD data from a large basis of motion capture data recorded from the real motion of dogs as opposed to the SMAL model and its variants where the pose is based from toy animals and a human-created walk cycle.
Pose Estimation with Synthetic Training Data. In predicting pose from RGB images, it is generally found that training networks with a combination of real and synthetic images provides a more accurate prediction than training with either real or synthetic alone (, , ). Previous work with depth images has also shown that synthetic training alone provides accurate results when tested on real images  . Random forests have been used frequently for pose estimation from depth images.
These include labelling pixels with human body parts (
. Random forests have been used frequently for pose estimation from depth images. These include labelling pixels with human body parts (), mouse body parts () and dense correspondences to the surface mesh of a human model (). Sharp et al.  robustly track a hand in real-time using the Kinect v2.
Recently, neural networks have also been used in pose estimation from depth images. Huang & Altamar  generate a dataset of synthetic depth images of human body pose and use this to predict the pose of the top half of the body. Mueller et al.  combine two CNNs to locate and predict hand pose. A kinematic model is fit to the 3D joints to ensure temporal smoothness in joint rotations and bone lengths are consistent across the footage.
In our work, we use motion capture data from a selection of dogs to generate a dataset of synthetic depth images. This dataset is used to train a stacked hourglass network, which predicts joint locations in 3D space. Given the joints predicted by the network, a PCA model can be used to predict the shape of an unknown dog, and a H-GPLVM is used to constrain the joint locations to those which are physically plausible. We believe ours is the first method to train a neural network to predict 3D animal shape and pose from RGBD images, and to compare our pipeline results to 3D ground truth which is difficult to obtain for animals and has therefore as yet been unexplored by researchers.
Our pipeline consists of two stages; a prediction stage and refinement stage. In the prediction stage, a stacked hourglass network by Newell et al.  predicts a set of 2D heatmaps for a given depth image. From these, 3D joint positions are reconstructed. To train the network, skeleton motion data was recorded from five dogs performing the same five actions using a Vicon optical motion capture system (Section 3.1). These skeletons pose a mesh of the respective dog which are then rendered as RGBD images by a Kinect noise-model to generate a large synthetic training dataset (Section 3.2). We provide more detail about the network training data and explain 3D joint reconstruction from heatmaps in Section 3.3. In the refinement stage, a H-GPLVM  trained on skeleton joint rotations is used to constrain the predicted 3D joint positions (Section 3.4). The resulting skeleton can animate a mesh, provided by the user or generated from a shape model, which can then be aligned to the depth image points to further refine the global transformation of the root of the skeleton. We compare our results with the method of Biggs et al.  and evaluate our method with ground truth joint positions in synthetic and real images in Section 4. Figures 2 and 3 outline the prediction and refinement stages of our approach respectively.
3.1 Animal Motion Data Collection
As no 3D dog motion data is available for research, we first needed to collect a dataset. A local rescue centre provided 16 dogs for recording. We focused on five dogs that covered a wide range of shape and size. The same five actions were chosen for each dog for the training/validation set, with an additional arbitrary test sequence also chosen for testing. In addition to these five dogs, two dogs were used to evaluate the pipeline and were not included in the training set. These dogs are shown in Figure 4.
A Vicon system with 20 infrared cameras was used to record the markers on the dogs’ bespoke capture suits. Vicon recorded the markers at 119.88 fps, with the skeleton data exported at 59.94 fps. Up to 6 Kinect v2s were also simultaneously recording, with the data extracted using the libfreenect2 library . Although the Kinects recorded at 30fps, the use of multiple devices at once reduced overall frame rate to 6fps in our ground truth set. However, this does not affect the performance of our prediction network. Further details on recording can be found in the supplementary material (Sec. 2.1).
3.2 Synthetic RGBD Data Generation
Our template dog skeleton is based on anatomical skeletons .
Unlike humans, the shoulders of dogs are not constrained by a clavicle and so have translational as well as rotational freedom  .
The ears are modelled with rigid bones and also given translational freedom, allowing the ears to move with respect to the base of the skull.
In total, there are 43 joints in the skeleton, with 95 degrees of freedom.
The neutral mesh of each dog was created by an artist, using a photogrammetric reconstruction as a guide.
Linear blend skinning is used to skin the mesh to the corresponding skeleton, with the weights also created by an artist.
. The ears are modelled with rigid bones and also given translational freedom, allowing the ears to move with respect to the base of the skull. In total, there are 43 joints in the skeleton, with 95 degrees of freedom. The neutral mesh of each dog was created by an artist, using a photogrammetric reconstruction as a guide. Linear blend skinning is used to skin the mesh to the corresponding skeleton, with the weights also created by an artist.
To create realistic Kinect images from our skinned 3D skeletons, we follow a similar process from InteriorNet . Given a 3D mesh of a dog within virtual environment, we model unique infrared dot patterns projected on to the object, and further achieve dense depth using stereo reconstruction. This process is presumed to retain most of characteristics of Kinect imaging system including depth shadowing and occlusion. A comparison of real versus synthetic Kinect images is shown in Figure 5.
Up to 30 synthetic cameras were used to generate the depth images and corresponding binary mask for each dog. Details of the image and joint data normalisation for the generation of ground truth heatmaps are given in the supplementary material. The size of the dataset is doubled by using the mirrored version of these images, giving a total number of 650,000 images in the training set and 180,000 images in the validation set. An overview of data generation can be seen in the ‘‘Train" section of Figure 2.
3.3 Skeleton Pose Prediction Network
In order to use the stacked-hourglass framework, we represent joints as 2D heatmaps.
Input to the network are 256x256 greyscale images, where 3D joints are defined in this coordinate space.
Given an input image, the network produces a set of 129 heatmaps , each being 64x64 pixels in size.
Each joint in the dog skeleton is associated with three heatmaps, the indices of which is known: , , , representing the xy-, yz- and xz-coordinates of respectively.
This set provided the most accurate results in our experiments.
To produce the heatmaps required to train the network, are transformed to a 64x64 image coordinates.
Let be these transformed coordinates, where .
We generate 2D gaussians in the heatmaps centred at the xy-, yz- and xz-coordinates of , with a standard deviation of one pixel.
Inspired by Biggs et al.
, with a standard deviation of one pixel. Inspired by Biggs et al., symmetric joints along the sagittal plane of the animal (i.e. the legs and ears) produce multi-model heatmaps. Further technical details on heatmap generation may be found in the suplementary material.
Our neural network is a 2-stacked hourglass network by Newell et al.  .
This particular network was chosen as the successive stages of down-sampling and up-scaling allow the combination of features at various scales.
By observing the image at global and local scales, the global rotation of the subject can be more easily determined, and the relationship between joints can be utilised to produce more accurate predictions.
We implement our network using PyTorch, based on code provided by Yang
. This particular network was chosen as the successive stages of down-sampling and up-scaling allow the combination of features at various scales. By observing the image at global and local scales, the global rotation of the subject can be more easily determined, and the relationship between joints can be utilised to produce more accurate predictions. We implement our network using PyTorch, based on code provided by Yang
3.3.1 3D Pose Regression from 2D Joint Locations
Given the network-generated heatmaps, we determine the value of , the location of each joint in the x-, y-, and z-axis in 64x64 image coordinates. Each joint is associated with three heatmaps: , , . For joints that produce unimodal heatmaps, the heatmap with the highest peak from the set of , , determines the value of two of the three coordinates, with the remaining coordinate taken from the map with the second highest peak.
For joints with multi-modal heatmaps, we repeat this step referring first to the highest peak in the three heatmaps, and then to the second highest peak. This process results in two potential joint locations for all joints that form a symmetric pair (, ). If the XY position of the predicted coordinate of is within a threshold of the XY position of , we assume that the network has erroneously predicted the same position for both joints. In this case, the joint with the highest confidence retains this coordinate, and the remaining joint is assigned its next most likely joint.
Once has been determined, the coordinates are transformed into . Prior to this step, as in Newell et al. , a quarter pixel offset is applied to the predictions in . We first determine, within a 4-pixel neighbourhood of each predicted joint, the location of the neighbour with the highest value. This location dictates the direction of the offset applied. The authors note that the addition of this offset increases the joint prediction precision. Finally, is scaled to fit a 256x256 image, resulting in . The image scale and translation acquired when transforming the image for network input is inverted and used to transform the xy-coordinates of into , the projections in the full-size image. To calculate the depth in camera space for each joint in , the image and joint data normalisation process is inverted and applied. is transformed into using the intrinsic parameters of the camera and the depth of each predicted joint.
3.4 Pose Prior Model
While some previous pose models represent skeleton rotations using a PCA model, such as the work by Safonova et al. , we found that this type of model produces poses that are not physically possible for the dog. In contrast, a Gaussian Process Latent Variable Model (GPLVM)  can model non-linear data and allows us to represent our high dimensional skeleton on a low dimensional manifold. A Hierarchical GPLVM (H-GPLVM)  exploits the relationship between different parts of the skeleton. Ear motion is excluded from the model. As ears are made of soft tissue, they are mostly influenced by the velocity of the dog, rather than the pose of other body parts. This reduces the skeleton to from 95 to 83 degrees of freedom. Bone rotations are represented as unit quaternions, and the translation of the shoulders are defined with respect to their rest position. Mirrored poses are also included in the model. The supplementary material contains further technical specifications for our hierarchy (Sec. 2.3).
We remove frames that contain similar poses to reduce the number of frames included in the training set . The similarity of two quaternions is calculated using the dot product, and we sum the results for all bones in the skeleton to give the final similarity value. Given a candidate pose, we calculate the similarity between it and all poses in . If the minimum value for all calculations is above a threshold, the candidate pose is added to . Setting the similarity threshold to 0.1 reduces the number of frames in a sequence by approximately 50-66%. The data matrix is constructed from and normalised. Back constraints are used when optimising the model, meaning that similar poses are located in close proximity to each other in the manifold.
3.4.1 Fitting the H-GPLVM to Predicted Joints
A weight is associated with each joint predicted by the network to help guide the fitting of the H-GPLVM.
Information about these weights is given in the supplementary material. To find the initial coordinate in the root node of H-GPLVM, we use k-means clustering to sample 50 potential coordinates.
Keeping the root translation fixed, we find the rotation which minimises the Euclidean distance between the
To find the initial coordinate in the root node of H-GPLVM, we use k-means clustering to sample 50 potential coordinates. Keeping the root translation fixed, we find the rotation which minimises the Euclidean distance between thenetwork-predicted joints and the model-generated joints. The pose and rotation with the smallest error is chosen as the initial values for the next optimisation step.
The H-GPLVM coordinate and root rotation are then refined.
In this stage, joint projection error is included, as it was found this helped with pose estimation if the network gave a plausible 2D prediction, but noisy 3D prediction.
The vector generated by the root node of the model provides the initial coordinates of the nodes further along the tree.
All leaf nodes of the model, root rotation and root translation are then optimised simultaneously.
The H-GPLVM coordinate and root rotation are then refined. In this stage, joint projection error is included, as it was found this helped with pose estimation if the network gave a plausible 2D prediction, but noisy 3D prediction. The vector generated by the root node of the model provides the initial coordinates of the nodes further along the tree. All leaf nodes of the model, root rotation and root translation are then optimised simultaneously.
During the fitting process, we seek to minimise the distance between joint locations predicted by the network and those predicted by the H-GPLVM: Equation 3.4.1 defines the corresponding loss function:
Here, is the number of joints in the skeleton, is the set of predicted joint locations from the network, is the set of weights associated with each joint, is the perspective projection function and is the influence of 2D information when fitting the model. Let be the set of n-dimensional coordinates for the given node(s) of the H-GPLVM and be the function that takes the set , root rotation , root translation , shoulder translations , and produces a set of 3D joints. Figure 3 shows the result of process.
4 Evaluation and Results
To evaluate our approach, we predict canine shape and pose from RGBD data on a set of five test sequences, one for each dog. Each sequence was chosen for the global orientation of the dogs to cover a wide range, both side-views and foreshortened views, with their actions consisting of a general walking/exploring motion. In each case we predict shape and pose and compare these predictions to ground truth skeletons as obtained from a motion-capture system (see Section 3.1). More detailed analysis of experiments as well as further technical details of experimental setup -- as well as video results - may be found in the supplementary material.
As no previous methods automatically extract dog skeleton from depth images, we compare our results with Biggs et al. , which we will refer to as the BADJA result.
We note that the authors’ method requires silhouette data only and therefore it is expected that our method produces the more accurate results .
Both algorithms are tested on noise-free images.
We use two metrics to measure the accuracy of our system: Mean Per Joint Position Error (MPJPE) and Probability of Correct Keypoint (PCK).
MPJPE measures Euclidean distance and is calculated after the roots of the two skeletons are aligned.
A variant PA MPJPE uses Procrustes Analysis to align the predicted skeleton with the ground truth skeleton.
PCK describes the situation whereby the predicted joint is within a threshold from the true value.
The threshold is
. Both algorithms are tested on noise-free images. We use two metrics to measure the accuracy of our system: Mean Per Joint Position Error (MPJPE) and Probability of Correct Keypoint (PCK). MPJPE measures Euclidean distance and is calculated after the roots of the two skeletons are aligned. A variant PA MPJPE uses Procrustes Analysis to align the predicted skeleton with the ground truth skeleton. PCK describes the situation whereby the predicted joint is within a threshold from the true value. The threshold is, where is the area of the image with non-zero pixel values and = 0.05. The values range from [0,1], where 1 means that all joints are within the threshold. PCK can also be used for 3D prediction , where the threshold is set to half the width of the person’s head. As we can only determine the length of the head bone, we set the threshold to one and we scale each skeleton such that the head bone has a length of two units. To compare the values of MPJPE and PCK 3D, we also use PA PCK 3D, where the joints are aligned as in PA MPJPE, and then calculate PCK 3D. Due to the frequent occlusion of limbs of the dogs, the errors are reported in the following groups: All -- all joints in the skeleton; Head -- the joints contained in the neck and head; Body -- the joints contained in the spine and four legs; -- Tail: the joints in the tail. Figure 6 shows the configuration of the two skeletons used and the joints that belong to each group. Our pipeline for each dog contains a separate neural network, H-GPLVM and shape model, such that no data from that particular dog is seen by its corresponding models prior to testing.
Table 1 contains the PA MPJPE and PA PCK 3D results for the comparison. Comparing these results with the MPJPE and PCK 3D results, for our method, the PA MPJPE decreases the error by an average 0.416 and PA PCK 3D increases by 0.233. For BADJA, the MPJPE PA decreases the error by an average 1.557 and PA PCK 3D increases by 0.523, showing the difficulty of determining the root rotation from silhouette alone, as is the case using BADJA.
4.1 Applying the Pipeline to Real Kinect Footage
Running the network on real-world data involves the additional step of generating a mask of the dog from the input image. We generate the mask from the RGB image for two reasons: (1) RGB segmentation networks pre-trained to detect animals are readily available, (2) the RGB image has a higher resolution than the depth image and contains less noise, particularly when separating the dogs’ feet from the ground plane. As such, the mask is generated from the RBG image before being transformed using a homography matrix into depth-image coordinates. A combination of two pretrained networks are used to generate the mask: Mask R-CNN  and Deeplab . More details are included in the supplementary material. We display 3D results in Table 2, for cases where the neutral shape of the dog is unknown and known. Examples of skeletons are shown in Figure 8.
4.2 Shape Estimation of Unknown Dogs
If the skeleton and neutral mesh for the current dog is unknown beforehand -- as is the case in all our results apart from the ’known shape’ result in Table 2 -- a shape model is used to predict this information. The model is built from 18 dogs: five dogs are used to train the CNN and were created by an artist, an additional six dogs were also created by the artist, three dogs are scans of detailed toy animals, and four are purchased photogrammetry scans. All dogs are given a common pose and mesh with a common topology. The PCA model is built from the meshes, bone lengths and the joint rotations required to pose the dog from the common pose into its neutral standing position. The first four principal components of the model are used to find the dog with bone proportions that best match the recorded dog. This produces the estimated neutral mesh and skeleton of the dog.
4.3 Extending to Other Quadruped Species
We tested our network on additional 3D models of other species provided by Bronstein et al. (, ). Images of the models are rendered as described in Section 3.2. The training data for the network consists of the same five motions for the five training dogs. As no ground truth skeleton information is provided for the 3D models, we evaluate the performance based on visual inspection. The example results provided in the first three columns of Figure 9 show that the network performs well when the pose of a given animal is similar to that seen in the training set, even if the subject is not a dog. However, when the pose of the animal is very different from the range of poses in the training set, prediction degrades, as seen in the last three columns of Figure 9. This provides motivation for further work.
5 Conclusion and Future Work
We have presented a system which can predict 3D shape and pose of a dog from depth images. We also present to the community a data set of dog motion from multiple modalities - motion capture, RGBD and RGB cameras -- of varying shapes and breeds. Our prediction network was trained using synthetically generated depth images leveraging this data and is demonstrated to work well for 3D skeletal pose prediction given real Kinect input. We evaluated our results against 3D ground truth joint positions demonstrating the effectiveness of our approach. Figure 9 shows the potential in extending the pipeline to other species of animals. We expect that a more pose-diverse training set would produce results more accurate than the failure cases in Figure 9. Apart from the option to estimate bone length over multiple frames, our pipeline does not include temporal constraints, which would lead to more accurate and smoother predict sequences of motion. At present, mask generation requires an additional pre-processing step and is based on the RGB channel of the Kinect. Instead, the pose-prediction network could perform a step where the dog is extracted from the depth image itself. This may produce more robust masks, as extraction of the dog would no longer rely on texture information. As General Adversarial Networks (GANs) are now considered to produce state-of-the-art results, we intend to update our network to directly regress joint rotations and combine this with a GAN to constrain the pose prediction.
Acknowledgement. This work was supported by the Centre for the Analysis of Motion, Entertainment Research and Applications (EP/M023281/1), the EPSRC Centre for Doctoral Training in Digital Entertainment (EP/L016540/1) and the Settlement Research Fund (1.190058.01) of the Ulsan National Institute of Science & Technology.
-  (2014-06) 2D human pose estimation: new benchmark and state of the art analysis. In , Cited by: §2.
-  (2018) Creatures great and smal: recovering the shape and motion of animals from video. In Asian Conference on Computer Vision, pp. 3–19. Cited by: §1, §2, §3.3, §3, Figure 6, Figure 7, Table 1, §4.
-  (2018) Creatures great and smal: recovering the shape and motion of animals from video. In Asian Conference on Computer Vision, pp. 3–19. Cited by: §3.1.
libfreenect2: open-source library for Kinect v2 depth camera, release 0.1.1. External Links: Cited by: §3.1.
-  (2006) Efficient computation of isometry-invariant distances between surfaces. SIAM Journal on Scientific Computing 28 (5), pp. 1812–1836. Cited by: Figure 9, §4.3.
-  (2007) Calculus of nonrigid surfaces for geometry and texture manipulation. IEEE Transactions on Visualization and Computer Graphics 13 (5), pp. 902–913. Cited by: Figure 9, §4.3.
-  (2019) Cross-domain adaptation for animal pose estimation. External Links: Cited by: §1, §2.
-  Carnegie mellon university motion capture database. Note: http://mocap.cs.cmu.eduAccessed: 2019-08-05 Cited by: §1.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §2, §4.1.
-  (2016) Synthesizing training images for boosting human 3d pose estimation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 479–488. Cited by: §2.
-  (2019) Do dogs have a collarbone?. Note: https://dogdiscoveries.com/do-dogs-have-a-collarbone/Accessed: 2019-08-07 Cited by: §3.2.
-  (2014) Dog anatomy workbook. Trafalgar Square. Cited by: §3.2.
-  (2019) Fast and robust animal pose estimation. bioRxiv, pp. 620245. Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.1.
Pose estimation on depth images with convolutional neural network. Cited by: §2.
-  (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1325–1339. Cited by: §1.
-  (2010) Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Note: doi:10.5244/C.24.12 Cited by: §1.
-  (2017) A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 853–862. Cited by: §2.
Hierarchical gaussian process latent variable models.
Proceedings of the 24th international conference on Machine learning, pp. 481–488. Cited by: §3.4, §3.
Gaussian process latent variable models for visualisation of high dimensional data. In Advances in neural information processing systems, pp. 329–336. Cited by: §1, §3.4.
-  (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference (BMVC), Cited by: §1, Figure 5, §3.2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Figure 2, §2.
DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience. External Links: Cited by: §2.
-  (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV), pp. 506–516. Cited by: §4.
-  (2017) Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: §2.
-  (2015) Mouse pose estimation from depth images. arXiv preprint arXiv:1511.07611. Cited by: §2.
-  (2019) Using deeplabcut for 3d markerless pose estimation across species and behaviors.. Nature protocols. Cited by: §2.
-  (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §3.3.1, §3.3, §3.
-  (2019) Fast animal pose estimation using deep neural networks. Nature methods 16 (1), pp. 117. Cited by: §2.
-  (2016) Mocap-guided data augmentation for 3d pose estimation in the wild. In Advances in neural information processing systems, pp. 3108–3116. Cited by: §2.
-  (2004) Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces. In ACM Transactions on Graphics (ToG), Vol. 23, pp. 514–521. Cited by: §3.4.
-  (2015) Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642. Cited by: §2.
-  (2013) Real-time human pose recognition in parts from single depth images. Communications of the ACM 56 (1), pp. 116–124. Cited by: §2.
-  (2010) Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision 87 (1-2), pp. 4. Cited by: §1.
-  (2012) The vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 103–110. Cited by: §2.
-  (2017) Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117. Cited by: §2.
-  (2019) A pytorch toolkit for 2d human pose estimation.. Note: https://github.com/bearpaw/pytorch-poseAccessed: 2019-08-07 Cited by: §3.3.
-  (2019) Three-d safari: learning to estimate zebra pose, shape, and texture from images "in the wild". External Links: Cited by: §1, §2.
-  (2018) Lions and tigers and bears: capturing non-rigid, 3d, articulated shape from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3955–3963. Cited by: §1, §2.
-  (2017) 3D menagerie: modeling the 3d shape and pose of animals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6365–6373. Cited by: §2.
In this supplementary material we give additional technical details on our approach.We provide details on our dog data set, which will be made available to the research community. We conduct additional experiments to test the pipeline for occlusions, exploiting depth information when solving for the shape of the dog, and compare the neural network in the pipeline with two other networks. Finally we compare the expression of our dog shape model with that of SMAL zuffi20173dSUPP.
2.1 Animal Motion Data Collection
Each recorded dog wore a motion capture suit, on which was painted additional texture information. The number of markers on the suit related to the size of a given dog, and ranged from 63 to 82 markers. We show an example of marker locations in Figure 1. These locations were based on reference to those on humans and biological study. Vicon Shogun was used to record the dogs with 20 cameras at 119.88fps, the locations of which are shown in Figure 2.
For each dog, this data is available in the form of 3D marker locations, the solved skeleton, the neutral mesh of the dog and corresponding Linear Blend Skinning weights, multi-view HD RGB footage recorded at 59.97 fps, and multi-view RGB and RGB-D images from the Microsoft Kinect recording at approximately 6 fps. The HD RGB footage will be available in 4K resolution on request. The number of cameras used per dog varied between eight to ten for the HD RGB cameras and five to six for the Kinects. Visualisation of this data can be seen in Figure 4. The frame count for each sequence of each dog is given in Table 1.
The number of real Kinect RGB and depth images recorded from all cameras for all five motions of the five dogs is 1,950. The number of 4K RGB images recorded from all cameras for all five motions of the five dogs is 73,748. In total, 8,346 frames of skeleton motion data were recorded using the Vicon Shogun software.
The number of real Kinect RGB and depth images recorded from all cameras for all five motions of the five dogs is 1,950. The number of 4K RGB images recorded from all cameras for all five motions of the five dogs is 73,748. In total, 8,346 frames of skeleton motion data were recorded using the Vicon Shogun software.
In comparison with other available datasets of skeleton-annotated dog images, Biggs et al. biggs2018creaturesSUPP provide 20 landmarks for 218 frames from video sequences. The size of the images are either 1920x1080 or 1280x720 pixels. Cao et al. cao2019crossdomainSUPP provide 20 landmarks for 1,771 dogs in 1,398 images of various sizes.
|Average # Frames per Camera (Vicon,Kinect)|
2.2 Data Augmentation
Our synthetic dataset is generated from the result of applying the processed skeleton motion to the neutral mesh using linear blend skinning. The same 20 virtual cameras were used to generate synthetic images for all five dogs, along with cameras using the extrinsic parameters of the 8 to 10 Sony RGB cameras used to record each dog. For each motion, two sets of images were generated. In the first set, the root of the skeleton contains the rotation and translation of the dog in the scene, and the second set of images are generated where the root has fixed rotation and translation. Another version of the two sets was created by mirroring the images, giving our final synthetic dataset approximately 834,650 frames.
2.3 Network Architecture
We use the network of Newell et al. newell2016stackedSUPP based on the implementation provided by Yang cnnGitSUPP. We provide a diagram of the network in Figure 5 and direct the user to the paper by Newell et al. newell2016stackedSUPP for full details of the network components.
2.4 Data Normalisation for the Generation of Training Heatmaps
3D Joint locations of the skeletons are defined in camera space and 2D joint locations, are their projected values in the synthetic image. Only images where all joints in are within the image bounds were included in the dataset. The images are shaped to fit the network inputs by following the steps outlined in Algorithm 1, producing images that are 256x256 pixels in size.
The bounding box of the transformed 256x256 image, and the bounding box of the original mask are used to calculate the scale and translation required to transform the dog in the 256x256 image back to its position in the original full-size RGBD image. are also transformed using Algorithm 1, producing . Finally, the z-component in is added as the z-component in , giving . The x- and y- components of lie in the range [0,255]. We transform the z-component to lie in the same range by using Algorithm 2. In Algorithm 2, we make two assumptions:
The root joint of the skeleton lies within a distance of 8 metres from the camera, the maximum distance detected by a Kinect v2 kinectApiSUPP
Following Sun et al. sun2018integralSUPP, the remainder of the joints are defined as offsets from the root joint, normalised to lie within two metres. This is to allow the algorithm to scale to large animals such as horses, etc.
2.5 Pose Prior Model
We use a Hierarchical Gaussian Process Latent Variable Model (H-GPLVM) lawrence2007hierarchicalSUPP to represent high-dimensional skeleton motions lying in a lower-dimensional latent space.
Figure 6 shows how the structure of the H-GPLVM relates to the structure of the dog skeleton: The latent variable representing the fully body controls the tail, legs, spine, and head variables, while the four legs are further decomposed into individual limbs. Equation 1 shows the corresponding joint distribution.
shows the corresponding joint distribution.
where to are the rotations (and translations, if applicable) of the joints in the tail, back left leg, front left leg, back right leg, front left leg, spine and head respectively and to are the nodes in the model for each respective body part, is the node for all four legs, and is the root node.
Let Y be the motion data matrix of frames and dimension , , containing the data of to .
is the radial basis function that depends on the is the normal distribution. Then,
is the radial basis function that depends on the-dimensional latent variables that correspond to . [, ] define the start and end index of columns in that contain the data for .
is the normal distribution. Then,
where denotes the -th column of .
2.5.1 Joint-specific Weights When Fitting the Model
When fitting the H-GPLVM to the network-predicted joints, each of these joints has an associated weight to guide fitting. This is a elementwise-multiplication of two sets of weights, and . is user-defined and inspired by the weights used by the Vicon software. Specifically, these are [5,5,5,0.8,0.5,0.8,1,1,1,0.8,0.5,0.8,1,1,1, 0.8,0.5,0.5,0.8,1,1,0.1,0,0.1,0,0.8,1,1,1,1,0.8,1,1,1,1,1,1,1,1, 1,1,1,1]. This has the effect of giving the root and spine the highest weight (5), the end of each limb has a higher weight (1) than the base of the limb (0.8). Each joint in the tail is given equal weights (1). As the ears were not included in the model, a weight of 0 was given to the ear tips, and 0.1 given to the base of the ears, in order to slightly influence head rotation.
Prior to the fitting stage, the shape and size of the dog skeleton has either been provided by the user or generated by the PCA shape model. The bone lengths of this skeleton can be calculated. For the current frame, we calculate the length of the bones in the skeleton as predicted by the network, . The deviation from is then calculated as . is calculated as the inverse of this deviation, capped to be within the range [0,1].
3 Evaluation and Results
3.1 Ground Truth for BADJA Comparison
In order to compare our results with BADJA biggs2018creaturesSUPP, we need to calculate the ground truth joints positions of the SMAL skeleton, . Using WrapX wrapxSUPP, an off-the-shelf mesh registration software package, the neutral mesh of the SMAL model is registered to the neutral mesh of each of the 5 dogs , producing the mesh . We can then represent as barycentric coordinates of . Using these barycentric coordinates, given in a pose, , we compute the corresponding . The BADJA joint regressor then produces from .
The renderer of BADJA  mirrors the projection of the predicted skeleton . This means that for the 2D result, the identity of joints on the left side of are swapped with their corresponding paired joints on the right. For 3D comparison, we mirror with respect to the camera. Next, we find the scales, and , such that when applied to and respectively, the head length of both skeletons is 2 units. We apply these scales and also apply to , the ground-truth skeleton that is in our configuration. Finally, our predicted skeleton is scaled to have the same head length as .
3.2 Comparision to BADJA
We include the 2D results when comparing the results of our pipeline with that of Biggs et al. biggs2018creaturesSUPP in Table 2.
3.3 Applying the Pipeline to Real Kinect Footage
Running the network on real-world data involves the additional step of generating a mask of the dog from the input image.
Two pre-trained networks are used to generate the mask: Mask R-CNN he2017maskSUPP and Deeplab deeplabv3plus2018SUPP.
Both were trained on the COCO dataset lin2014microsoftSUPP and implemented in Tensorflow.
During testing, it was found that although Deeplab provided a more accurate mask than Mask R-CNN, it would at times fail to detect any dog in the image, both when the dog is wearing a motion capture suit and when not.
It would also fail to reliably separate the dog from its handler.
In our experiments, Mask R-CNN detected the dog in the vast majority of images, although the edge of the mask was not as accurate as that provided by Deeplab.
Therefore, the image is first processed by Mask R-CNN and the bounding box produced is then used to initialise the input image to Deeplab where it is refined, if possible.
A comparison of the masks is shown in Figure
lin2014microsoftSUPP and implemented in Tensorflow. During testing, it was found that although Deeplab provided a more accurate mask than Mask R-CNN, it would at times fail to detect any dog in the image, both when the dog is wearing a motion capture suit and when not. It would also fail to reliably separate the dog from its handler. In our experiments, Mask R-CNN detected the dog in the vast majority of images, although the edge of the mask was not as accurate as that provided by Deeplab. Therefore, the image is first processed by Mask R-CNN and the bounding box produced is then used to initialise the input image to Deeplab where it is refined, if possible. A comparison of the masks is shown in Figure7. A homography matrix is automatically generated from the Kinect which, when applied to the RGB mask, produces the mask for the depth image.
Table 3 contains the 2D results when applying our pipeline to real Kinect footage.
3.4 Exploiting Depth Information to Solve Shape
In this section, different methods for fitting the shape will be described. In all cases, the shape is represented as model parameters to the PCA shape model. The results of each method are displayed in Table 4. Each entry in the ‘‘Method" column is described below.
In general, our pipeline method of solving for shape by referring to bone lengths over a sequence (Original) provided the best results. This has the effect of keeping the shape constant for all frames. We compare the accuracy of the pipeline when the shape of the dog is allowed to change on a per-frame basis, during the H-GPLVM refinement stage (Method1).
Additionally, we compare the accuracy with our minimisation function takes into account the distance between the mesh produced by the model-generated skeleton to the reconstructed depth points. When fitting the model-generated skeleton to the network-predicted joints, we have a one-to-one correspondence as we know the identity of each joint predicted. This is not the case for the vertices on the generated mesh and the reconstructed depth points. Matches are made from the generated mesh to the Kinect points, and vice versa using Algorithm 3, where the angle threshold is set to 70 degrees, giving the two sets of matches and . Two tests are performed: the first creates the matches only once during fitting the model (Method2), and the second repeats the matching stage after minimisation up to 3 times provided that the error between the two set of joints reduces by at least 5% (Method3). Finally, two tests were performed with mutual matches only, ie, the matches that appear in both and . This match is performed only once (Method 4) or repeated up to 3 times provided that the error between the two set of joints reduces by at least 5% (Method5).
3.5 Robustness to Occlusions
The training data for the network is free from occlusions. To test the pipeline for robustness to occlusions, we apply a mask of a randomly located square. This square is 75 pixels in size which is approximately 30% of the image width/height. As expected, Table 5 shows that the results of the pipeline perform worse with the masked image as opposed to the original image. However, the H-GPLVM is able to reduce the error of the joint locations.
3.6 Comparison With Other Networks
We compare the network result of our pipeline, which uses the stacked-hourglass network of Newell et al. newell2016stackedSUPP, with the networks of Sun et al. sun2018integralSUPP and Moon et al. moon2018v2vSUPP.
The networks were given the given the same training, validation and test data and trained for the same number of epochs.
moon2018v2vSUPP. The networks were given the given the same training, validation and test data and trained for the same number of epochs. Tables6 and 7 show that the network of Newell et al. newell2016stackedSUPP produced more accurate predictions in both 2D and 3D.
|Dog6||Newell et al.||MPJPE||14.754||7.496||10.099||36.559|
|2-7||Sun et al.||MPJPE||30.219||37.329||27.602||29.513|
|2-7||Moon et al.||MPJPE||16.791||14.148||14.779||26.383|
|Dog7||Newell et al.||MPJPE||8.758||6.461||5.811||20.390|
|2-7||Sun et al.||MPJPE||11.904||10.381||7.870||26.412|
|2-7||Moon et al.||MPJPE||14.693||10.593||15.479||17.358|
|Dog6||Newell et al.||MPJPE||0.866||0.491||0.776||1.523|
|2-7||Sun et al.||MPJPE||1.594||1.561||1.723||1.341|
|2-7||Moon et al.||MPJPE||0.896||0.879||0.912||0.867|
|Newell et al.||MPJPE||0.563||0.364||0.507||0.939|
|2-7||Sun et al.||MPJPE||0.889||0.698||0.810||1.372|
|2-7||Moon et al.||MPJPE||0.901||0.667||1.017||0.832|
The method of Moon et al. moon2018v2vSUPP predicts 3D joint positions based on the voxel representation of the depth image. The author’s pipeline first uses the DeepPrior++ network of Oberweger and Lepetit oberweger2017deepprior++SUPP to predict the location of a reference point based on the centre of mass of the voxels. This reference point used to define the other joints in the skeleton and is more feasible to predict than the root of the skeleton itself. Due to memory and time constraints, the training data for this network contained the synthetic jump sequence of a single dog as seen by 28 cameras.
To test the result of this network, we calculate the mean euclidean distance from the reference point to the root of the ground-truth skeleton across all frames . We compare this to the distance from the center of mass of the voxels to the root. First we test the network on a single camera of a synthetic trot sequence of the training dog. The mean distance for the reference point was 302.64mm and mean distance for the center of mass was 302.55mm. Next we tested the network on two real Kinect sequences where again the reference point increased the error of the center of mass point by approximately 0.1mm. As a result, the center of mass was used as the reference point for each image when training the network of Moon et al. moon2018v2vSUPP, rather than that predicted by DeepPrior++.
3.7 Comparison of Our Shape Model with SMAL
As the skeleton configuration of the two shape models are different, the SMAL model cannot be directly fit to network-predicted joints. Instead, to compare the models, we fit each model to the neutral dog mesh and skeleton of each dog in the set of Dog1-Dog5. For each dog, the average SMAL mesh is registered to the original dog mesh and the corresponding joint locations are calculated using the SMAL joint regressor. A different version of our shape model is created for each test where the information for the test dog is removed from the shape model.
We aim to find the shape parameters for each model that produces the mesh that most accurately represents each dog. As the scale of the SMAL model differs to the dog meshes, the overall scale of both models is also optimised in this process along with the shape parameters. For each model, we report the error result as the mean euclidean distance from each joint in the skeleton as produced by the model and the ground-truth joint in millimetres. We report the same error for each vertex in the meshes. These are shown in each row of Table 8. We perform tests where the models fit to only joint information (the first row of Table 8), fit to only vertex information (the second row) and both joint and vertex information (the third row).
This assumes that the pose of the model and that of the test dog are identical, which may not the case. As such, we then performed tests where the pose can change, i.e. we now solve for scale, shape parameters and pose parameters when fitting the model. The steps described above are repeated, and the results are reported in the final three rows of Table 8.
In general, the SMAL model achieved better results when the pose of the dog was fixed, whereas our model achieved better results when the pose was allowed to move. We believe this is due to each animal in the SMAL model having a similar neutral pose to each other whereas the neutral pose in our model is dog-specific.
|Model Fit To||Errors - Ours||Errors - SMAL|
|joints & mesh||45.190||26.915||37.636||23.242|
|joints & mesh||17.255||10.1585||22.175||14.689|