1. Introduction
There is a rapidly growing need for diverse, highquality, animationready characters and avatars in the areas of games, films, mixed Reality and social media. Handcrafted character “rigs”, where users create an animation “skeleton” and bind it to an input mesh (or “skin”), have been the workhorse of articulated figure animation for over three decades. The skeleton represents the articulation structure of the character, and skeletal joint rotations provide an animator with direct hierarchical control of character pose.
We present a deeplearning based solution for automatic rig creation from an input 3D character. Our method predicts both a skeleton and skinning that match animator expectations (Figures
1, 10). In contrast to prior work that fits predefined skeletal templates of fixed joint count and topology to input 3D meshes [Baran and Popović, 2007], our method outputs skeletons more tailored to the underlying articulation structure of the input. Unlike pose estimation approaches designed for particular shape classes, such as humans or hands
[Shotton et al., 2011; Moon et al., 2018; Huang et al., 2018; Pavlakos et al., 2017; Haque et al., 2016; Xu et al., 2017], our approach is not restricted by shape categorization or fixed skeleton structure. Our network represents a generic model of skeleton and skin prediction capable of rigging diverse characters (Figures 1,10).Predicting an animation skeleton and skinning from an arbitrary single static 3D mesh is an ambitious problem. As shown in Figure 2, animators create skeletons whose number of joints and topology vary drastically across characters depending on their underlying articulation structure. Animators also imbue an implicit understanding of creature anatomy into their skeletons. For example, character spines are often created closer to the back rather than the medial surface or centerline, mimicking human and animal anatomy (Figure 2, cat); they will also likely introduce a proportionate elbow joint into cylindrical armlike geometry (Figure 2, teddy bear). Similarly when computing skinning weights, animators often perceive structures as highly rigid or smoother (Figure 2, snail). An automatic rigging approach should ideally capture this animators’ intuition about underlying moving parts and deformation. A learning approach is well suited for this task, especially if it is capable of learning from a large and diverse set of rigged models.
While animators largely agree on the skeletal topology and layout of joints for an input character, there is also some ambiguity both in terms of number and exact joint placement (Figure 3). For example, depending on animation intent, a hand may be represented using a single wrist joint or at a finer resolution with a hierarchy of hand joints (Figure 3, top row). Spine and taillike articulations may be captured using a variable number of joints (Figure 3, bottom row). Thus, another challenge for a rigging method is to allow easy and direct control over the levelofdetail for the output skeleton.
To address the above challenges, we designed a deep modular architecture (Figure 4). The first module is a graph neural network, trained to predict an appropriate number of joints and their placement, to capture the articulated mobility of the input character. As skeletal joint resolution can depend on the intended animation task, we provide users an optional parameter that can control the levelofdetail of the output skeleton (Figure 5
). A second module learns to predict a hierarchical tree structure (animation skeletons avoid cycles as a design choice) connecting the joints. The output bone structure is a function of joints predicted from the first stage and shape features of the input character. Subsequently, a third module, produces a skinning weight vector per mesh vertex, indicating the degree of influence it receives from different bones. This stage is also based on a graph neural network operating on shape features and intrinsic distances from mesh vertices to the predicted bones.
Our evaluation is threefold: we show that RigNet is better than prior art when quantitatively compared to animator rigs (Tables 1, 2); qualitatively we show our rigs to be expressive and animationready (Figure 1 and accompanying video); and technically, we evaluate the impact of various algorithm choices on our output rigs (Tables 3, 4, 5).
In summary, the contribution of this paper is an automated, endtoend solution to the fundamentally important and challenging problem of character rigging. Our technical contributions include a neural mesh attention and differentiable clustering scheme to localize joints, a graph neural network for learning mesh representations, and a network that learns connectivity of graph nodes (in our case, skeleton joints). Our approach significantly outperforms purely geometric approaches [Baran and Popović, 2007], and learningbased approaches that provide partial solutions to our problem i.e., perform only mesh skinning [Liu et al., 2019], or only skeleton prediction for volumetric inputs [Xu et al., 2019].
2. Related Work
In the following paragraphs, we discuss previous approaches for producing animation skeletons, skin deformations of 3D models, and graph neural networks.
Skeletons.
Skeletal structures are fundamental representations in graphics and vision [Marr and Nishihara, 1978; Dickinson et al., 2009; Tagliasacchi et al., 2016]. Shape skeletons vary in concept from precise geometric constructs like the medial axis representations [Blum, 1973; Amenta and Bern, 1998; Attali and Montanvert, 1997; Siddiqi and Pizer, 2008], curvilinear representations or mesoskeletons [Singh and Fiume, 1998; Au et al., 2008; Cao et al., 2010; Tagliasacchi et al., 2009; Huang et al., 2013; Yin et al., 2018], to piecewise linear structures [Katz and Tal, 2003; Zhu and Yuille, 1996; Siddiqi et al., 1999; Hilaga et al., 2001]. Our work is mostly related to animatorcentric skeletons [MagnenatThalmann et al., 1988], which are designed to capture the mobility of an articulated shape. As discussed in the previous section, apart from shape geometry, the placement of joints and bones in animation skeletons is driven by the animator’s understanding of character’s anatomy and expected deformations.
The earliest approach to automatic rigging of input 3D models is the pioneering method of “Pinocchio” [Baran and Popović, 2007]. Pinocchio follows a combination of discrete and continuous optimization to fit a predefined skeleton template to a 3D model, and also performs skinning through heat diffusion. Fitting tends to fail when the input shape structure is incompatible with the selected template. Handcrafting templates for every possible structural variation of an input character is cumbersome. More recently, inspired by 3D pose estimation approaches [Haque et al., 2016; Pavlakos et al., 2017; Newell et al., 2016; Ge et al., 2018; Moon et al., 2018; Huang et al., 2018; Wan et al., 2018], Xu et al. [Xu et al., 2019] proposed learning a volumetric network for producing skeletons, without skinning, from input 3D characters. Preprocessing the input mesh to a coarser voxel representation can: eliminate surface features (like elbow or knee protrusions) useful for accurate joint detection and placement; alter the input shape topology (like proximal fingers represented as a voxel mitten); or accumulate approximation errors. RigNet compares favorably to these methods (Figure 8, Table 1), without requiring predefined skeletal templates, preprocessing or lossy conversion between shape representations.
Skin deformations.
A wide range of approaches have also been proposed to model skin deformations, ranging from physicsbased methods [Kim et al., 2017; Mukai and Kuriyama, 2016; Si et al., 2015; Komaritzan and Botsch, 2018, 2019], geometric methods [Kavan and Žára, 2005; Kavan et al., 2007; Kavan and Sorkine, 2012; Wareham and Lasenby, 2008; Jacobson et al., 2011; Bang and Lee, 2018; Dionne and de Lasa, 2013, 2014], to datadriven methods that produce skinning from a sequence of examples [Loper et al., 2015; Le and Deng, 2014; James and Twigg, 2005; Qiao et al., 2018]. Given a single input character, it is common to resort to geometric methods for skin deformation, such as Linear Blend Skinning (LBS) or Dual Quaternion Skinning (DQS) [Kavan et al., 2007; Le and Hodgins, 2016] due to their simplicity and computational efficiency. These methods require input skinning weights per vertex which are either interactively painted and edited [Bang and Lee, 2018], or automatically estimated based on handengineered functions of shape geometry and skeleton [Baran and Popović, 2007; Kavan and Sorkine, 2012; Wareham and Lasenby, 2008; Jacobson et al., 2011; Bang and Lee, 2018; Dionne and de Lasa, 2013, 2014]. It is difficult for such geometric approaches to account for any anatomic considerations implicit in input meshes, such as the disparity between animator and geometric spines, or the skin flexibility/rigidity of different articulations.
Datadriven approaches like ours, however, can capture anatomic insights present in animatorcreated rigs. Neuroskinning [Liu et al., 2019] attempts to learn skinning from an input family of 3D characters. Their network performs graph convolution by learning edge weights within mesh neighborhoods, and outputting vertex features as weighted combinations of neighboring vertex features. Our method instead learns edge feature representations within both mesh and geodesic neighborhoods, and combines them into vertex representations inspired by the edge convolution scheme of [Wang et al., 2019]. Our network input uses intrinsic shape representations capturing geodesic distances between vertices and bones, rather than relying on extrinsic features, such as Euclidean distance. Unlike Neuroskinning, our method does not require any input joint categorization during training or testing. Most importantly, our method proposes a complete solution (skeleton and skinning) with better results (Tables 1, 2).
We note that our method is complementary to physicsbased or deep learning methods that produce nonlinear deformations, such as muscle bulges, on top of skin deformations [Mukai and Kuriyama, 2016; Bailey et al., 2018; Luo et al., 2018], or rely on input bones and skinning weights to compute other deformation approximations [Jeruzalski et al., 2019]. These methods require input bones and skinning weights that are readily provided by our method.
Graph Neural Networks.
Graph Neural Networks (GNNs) have become increasingly popular for graph processing tasks [Wu et al., 2019; Scarselli et al., 2009; Bruna et al., 2014; Henaff et al., 2015; Kipf and Welling, 2016; Defferrard et al., 2016; Li et al., 2016; Battaglia et al., 2016; Hamilton et al., 2017a, b]. Recently, GNNs have also been proposed for geometric deep learning on point sets [Wang et al., 2019], meshes [Masci et al., 2015; Hanocka et al., 2019], intrinsic or spectral representations [Bronstein et al., 2017; Boscaini et al., 2016; Monti et al., 2017; Yi et al., 2017]. Our graph neural network adapts the operator proposed in [Wang et al., 2019] to perform edge convolutions within meshbased and geodesic neighborhoods. Our network also weighs and combines representations from mesh topology, local and global shape geometry. Notably, our approach judiciously combines several other neural modules for detecting and connecting joints, with a graph neural network, to provide an integrated deep architecture for endtoend character rigging.
3. Overview
Given an input 3D mesh of a character, our method predicts an animation skeleton and skinning tailored for its underlying articulation structure and geometry. Both the skeleton and skinning weights are animatoreditable primitives that can be further refined through standard modeling and animation pipelines. Our method is based on a deep architecture (Figure 4), which operates directly on the mesh representation. We do not assume known input character class, part structure, or skeletal joint categories during training or testing. Our only assumption is that the input training and test shapes have a consistent upright and frontfacing orientation. Below, we briefly overview the key aspects of our architecture. In Section 4, we explain its stages in more detail.
Skeletal joint prediction.
The first module of our architecture is trained to predict the location of joints that will be used to form the animation skeleton. To this end, it learns to displace mesh geometry towards candidate joint locations (Figure 4a). The module is based on a graph neural network, which extracts topology and geometryaware features from the mesh to learn these displacements. A key idea of our architecture in this stage is to learn a weight function over the input mesh, a form of neural mesh attention, which is used to reveal which surface areas are more relevant for localizing joints (Figure 4b). Our experiments demonstrate that this leads to more accurate skeletons. The displaced mesh geometry tends to form clusters around candidate joint locations. We introduce a differentiable clustering scheme, which uses the neural mesh attention, to extract the joint locations (Figure 4c).
Since the final animation skeleton may depend on the task or the artists’ preferences, our method also allows optional user input in the form of a single parameter to control the levelofdetail, or granularity, of the output skeleton. For example, some applications, like crowd simulation, may not require rigging of small parts (e.g., hands or fingers), while other applications, like FPS games, rigging such parts is more important. By controlling a single parameter through a slider, fewer or more joints are introduced to capture different levelofdetail for the output skeleton (see Figure 5).
Skeleton connectivity prediction.
The next module in our architecture learns which pairs of extracted joints should be connected with bones. Our module takes as input the predicted joints from the previous step, including a learned shape and skeleton representation, and outputs a probability representing whether each pair should be connected with a bone or not (Figure 4d). We found that learned joint and shape representations are important to reliably estimate bones, since the skeleton connectivity depends not only on joint locations but also the overall shape and skeleton geometry. The bone probabilities are used as input to a Minimum Spanning Tree algorithm that prioritizes the most likely bones to form a treestructured skeleton, starting from a root joint picked from another trained neural module (Figure 4e).
Skinning prediction.
Given a predicted skeleton (Figure 4f), the last module of our architecture produces a weight vector per mesh vertex indicating the degree of influence it receives from different bones (Figure 4g). Our method is inspired by Neuroskinning [Liu et al., 2019], yet, with important differences in the architecture, bone and shape representations, and the use of volumetric geodesic distances from vertices to bones (as opposed to Euclidean distances).
Training and generalization.
Our architecture is trained via a combination of loss functions measuring deviation in joint locations, bone connectivity, and skinning weight differences with respect to the training skeletons. Our architecture is trained on input characters that vary significantly in terms of structure, number and geometry of moving parts e.g., humanoids, bipeds, quadrupeds, fish, toys, fictional characters. Our test set is also similarly diverse. We observe that our method is able to generalize to characters with different number of underlying articulating parts (Figure
10).4. Method
We now explain our architecture (Figure 4) for rigging an input 3D model at test time in detail. In the following subsections, we discuss each stage of our architecture. Then in Section 5, we discuss training.
4.1. Joint prediction
Given an input mesh , the first stage of our architecture outputs a set of 3D joint locations , where . One particular complication related to this mapping is that the number of articulating parts, and in turn, the number of joints is not the same for all characters. For example, a multiped creature is expected to have more joints than a biped. We use a combination of regression and adaptive clustering to solve for the joint locations and their number. In the regression step, the mesh vertices are displaced towards their nearest candidate joint locations. This step results in accumulating points near joint locations (Figure 4a). The second step localizes the joints by clustering the displaced points and setting the cluster centers as joint locations (Figure 4b). The number of resulting clusters is determined adaptively according to the underlying point density and learned clustering parameters. Performing clustering without first displacing the vertices fails to extract reasonable joints, since the original position of mesh vertices is often far from joint locations. In the next paragraphs, we explain the regression and clustering steps.
Regression.
In this step, the mesh vertices are regressed to their nearest candidate joint locations. This is performed through a learned neural network function that takes as input the mesh and outputs vertex displacements. Specifically, given the original mesh vertex locations , our displacement module outputs perturbed points :
(1) 
where are learned parameters of this module. Figure 4a visualizes displaced points for a characteristic example. This mapping is reminiscent of P2PNet [Yin et al., 2018] that learns to displace surface points across different domains e.g., surface points to mesoskeletons. In our case, the goal is to map mesh vertices to joint locations. An important aspect of our setting is that not all surface points are equally useful to determine joint locations e.g., the vertices located near the elbow region of an arm are more likely to reveal elbow joints compared to other vertices. Thus, we also designed a neural network function that outputs an attention map which represents a confidence of localizing a joint from each vertex. Specifically, the attention map includes a scalar value per vertex, where , and is computed as follows:
(2) 
where are learned parameters of the attention module. Figure 4b visualizes the map for a characteristic example.
Module internals.
Both displacement and attention neural network modules operate on the mesh graph. As we show in our experiments, operating on the mesh graph yields significantly better performance compared to using alternative architectures that operate on pointsampled representations [Yin et al., 2018] or volumetric representations [Xu et al., 2019]. Our networks builds upon the edge convolution proposed in [Wang et al., 2019], also known as ‘EdgeConv”. Given feature vectors at mesh vertices, the output of an EdgeConv operation at a vertex is a new feature vector encoding its local graph neighborhood: where
denotes a learned multilayer perceptron,
are its learned parameters, and is the graph neighborhood of vertex . Defining a proper graph neighborhood for our task turned out to be fruitful. One possibility is to simply use onering vertex neighborhoods for edge convolution. We instead found that this strategy makes the network sensitive to the input mesh tessellation and results in lower performance. Instead, we found that it is better to define the graph neighborhood of a vertex by considering both its onering mesh neighbors, and also the vertices located within a geodesic ball centered at it. We also found that it is better to learn separate MLPs for mesh and geodesic neighborhoods, then concatenate their outputs and process them through another MLP. In this manner, the networks learn to weigh the importance of topologyaware features over more geometryaware ones. Specifically, our convolution operator, called GMEdgeConv (see also Figure 4, bottom) is defined as follows:(3)  
(4)  
(5) 
where are the onering mesh neighborhoods of vertex , are the vertices from its geodesic ball. In all our experiments, we used a ball radius of the longest dimension of the model, which is tuned through grid search in a holdout validation set. The weights , , and are learned parameters for the above MLPs. We note that we experimented with the attention mechanism proposed in [Liu et al., 2019], yet we did not find any significant improvements. This is potentially due to the fact that EdgeConv already learns edge representations based on the pairwise functions of vertex features, which may implicitly encode edge importance.
Both the vertex displacement and attention modules start with the vertex positions as input features. They share the same internal architecture, which we call GMEdgeNet (see also Figure 4
, bottom). GMEdgeNet stacks three GMEdgeConv layers, each followed with a global maxpooling layer. The representations from each pooling layer are concatenated to form a global mesh representation. The output pervertex representations from all GMEdgeConv layers, as well as the global mesh representation, are further concatenated, then processed through a 3layer MLP. In this manner, the learned vertex representations incorporate both local and global information. In the case of the vertex displacement module, the feature representation are transformed to 3D displacements per each vertex through another MLP. In the case of the vertex attention module, the pervertex feature representations are transformed through a MLP and a sigmoid nonlinearity to produce a scalar attention value per vertex. Both modules use their own set of learned parameters for their GMEdgeConv layers and MLPs. More details about their architecture are provided in the appendix.
Clustering.
This step takes as input the displaced points along with their corresponding attention values , and outputs joints. As shown in Figure 4a, points tend to concentrate in areas around candidate joint locations. Areas with higher point density and greater attention are strong indicators of joint presence. We resort to densitybased clustering to detect local maxima of point density and use those as joint locations. In particular, we employ a variant of meanshift clustering, which also uses our learned attention map. A particular advantage of meanshift clustering is that it does not explicitly require as input the number of target clusters.
In classical meanshift clustering [Cheng, 1995], each data point is equipped with a kernel function. The sum of kernel functions results in a continuous density estimate, and the local maxima (modes) correspond to cluster centers. Meanshift clustering is performed iteratively; at each iteration, all points are shifted towards density modes. In our implementation, the kernel is also modulated by the vertex attention. In this manner, points with greater attention influence the estimation of density more. Specifically, at each meanshift iteration, each points is displaced according to the vector:
(6) 
where is the Epanechnikov kernel with learned bandwidth . We found that the Epanechnikov kernel produces better clustering results than a Gaussian kernel or a triangular kernel. The meanshift iterations are implemented through a recurrent module in our architecture, similarly to the recurrent pixel grouping in Kong and Fowlkes [2018]
, which also enables training of the bandwidth through backpropagation.
At test time, we perform meanshift iterations until convergence (i.e., no point is shifted for a Euclidean distance more than ). As a result, the shifted points “collapse” into distinct modes (Figure 4c). To extract these modes, we start with the point with highest density, and remove all its neighbors within radius equal to the bandwidth . This point represents a mode, and we create a joint at its location. Then we proceed by finding the point with the second largest density among the remaining ones, suppress its neighbors, and create another joint. This process continues until no other points remain. The output of the step are the modes that correspond to the the set of detected joints .
User control.
Since animators may prefer to have more control over the placement of joints, we allow them to override the learned bandwidth value, by interactively manipulating a slider controlling its value (Figure 5). We found that modifying the bandwidth directly affects the levelofdetail of the output skeleton. Lowering the bandwidth results in denser joint placement, while increasing it results in sparser skeletons. We note that the bandwidth cannot be set to arbitrary values e.g., a zero bandwidth value will cause each displaced vertex to become a joint. In our implementation, we empirically set an editable range from 0.01 to 0.1. The resulting joints can be processed by the next modules of our architecture to produce the bone connectivity and skinning based on their updated positions.
Symmetrization.
3D characters are often modeled based on a neutral pose (e.g., “Tpose”), and as a result their body shapes usually have bilateral symmetry. In such cases, we symmetrize joint prediction by reflecting the displaced points and attention map according to the global bilateral symmetry plane before performing clustering. As a result, the joint prediction is more robust to any small inconsistencies produced in either side.
4.2. Connectivity prediction
Given the joints extracted from the previous stage, the connectivity prediction stage determines how these joints should be connected to form the animation skeleton. At the heart of this stage lies a learned neural module that outputs the probability of connecting each pair of joints via a bone. These pairwise bone probabilities are used as input to Prim’s algorithm that creates a Minimum Spanning Tree (MST) representing the animation skeleton. We found that using these bone probabilities to extract the MST resulted in skeletons that agree with animatorcreated ones more in topology compared to simpler schemes e.g., using Euclidean distances between joints (see Figure 7 and experiments). In the following paragraphs, we explain the module for determining the bone probabilities for each pair of joints, then we discuss the cost function used for creating the MST.
Bone module.
The bone module, which we call “BoneNet”, takes as input our predicted joints along with the input mesh , and outputs the probability for connecting each pair of joints via a bone. By processing all pairs of joints through the same module, we extract a pairwise matrix representing all candidate bone probabilities. The architecture of the module is shown in Figure 6. For each pair of joints, the module processes three representations that capture global shape geometry, skeleton geometry, and features from the input pair of joints. In our experiments, we found that this combination offered the best bone prediction performance. More specifically, BoneNet takes as input: (a) a dimensional representation encoding global shape geometry, which is extracted from the maxpooling layers of GMEdgeNet (see also Figure 4, bottom), (b) a dimensional representation encoding the overall skeleton geometry by treating joints as a collection of points and using a learned PointNet to produce it [Qi et al., 2017], and (c) a representation encoding the input pair of joints. To produce this last representation, we first concatenate the positions of two joints , their Euclidean distance , and another scalar capturing the proportion of the candidate bone lying in the exterior of the mesh. The Euclidean distance and proportion are useful indicators of joint connectivity: the smaller the distance between two joints, the more likely is a bone between them. If the candidate bone protrudes significantly outside the shape, then it is less likely to choose it for the final skeleton. We transform the raw features into a dimensional bone representation through a MLP. The bone probability is computed via a 2layer MLP operating on the concatenation of these three representations, followed by a sigmoid:
(7) 
where are learned module parameters. Details about the architecture of BoneNet are provided in the appendix.
Skeleton extraction.
The skeleton extraction step aims to infer the most likely treestructured animation skeleton among all possible candidates. If we consider the choice of selecting an edge in a tree as an independent random variable, the joint probability of a tree is equal to the product of its edge probabilities. Maximizing the joint probability is equivalent to minimizing the negative log probabilities of the edges:
. Thus, by defining a dense graph whose nodes are the extracted joints, and edges have weights , we can use a MST algorithm to solve this problem. In our implementation, we use Prim’s algorithm [Prim, 1957]. Any joint can serve as a starting, or root joint for Prim’s algorithm. However, since the root joint is used to control the global character’s body position and orientation and is important for motion retargeting tasks, this stage also predicts which joint should be used as root. One common choice is to select the joint closer to the center of gravity for the character. However, we found that this choice is not always consistent with animators’ preferences (Figure 2, root nodes in the cat and dragon are further away from their centroids). Instead, we found that the selection of the root joint can also be performed more reliably using a neural module. Specifically, our method incorporates a module, which was call RootNet. Its internal architecture follows BoneNet. It takes as input the global shape representation and global joint representation (as in BoneNet). It also takes as input a joint representation learned through a MLP operating on its location and distance to the bilateral symmetry plane. The latter feature was driven by the observation that root joints are often placed along this symmetry plane. RootNet outputs the root joint probability as follows:(8) 
where are learned parameters. At test time, we select the joint with highest probability as root joint to initiate the Prim’s algorithm.
4.3. Skinning prediction
After producing the animation skeleton, the final stage of our architecture is the prediction of skinning weights for each mesh vertex to complete the rigging process. To perform skinning, we first extract a mesh representation capturing the spatial relationship of mesh vertices with respect to the skeleton. The representation is inspired by previous skinning methods [Dionne and de Lasa, 2013; Jacobson et al., 2011] that compute influences of bones on vertices according to volumetric geodesic distances between them. This mesh representation is processed through a graph neural network that outputs the pervertex skinning weights. In the next paragraphs, we describe the representation and network.
Skeletonaware mesh representation.
The first step of the skinning stage is to compute a mesh representation , which stores a feature vector for each mesh vertex and captures its spatial relationship with respect to the skeleton. Specifically, for each vertex we compute volumetric geodesic distances to all the bones i.e, shortest path lengths from vertex to bones passing through the interior mesh volume. We use a implementation that approximates the volumetric geodesic distances based on [Dionne and de Lasa, 2013]; other potentially more accurate approximations could also be used [Crane et al., 2013; Solomon et al., 2014]. Then for each vertex , we sort the bones according to their volumetric geodesic distance to it, and create an ordered feature sequence , where denotes an index to the sorted list of bones. Each feature vector concatenates the 3D positions of the starting and end joints of bone , and the inverse of the volumetric geodesic distance from the vertex to this bone (). The reason for ordering the bones wrt each vertex is to promote consistency in the resulting representation i.e., the first entry represents always the closest bone to the vertex, the second entry represents the second closest bone, and so on. In our implementation, we use the closest bones selected based on holdout validation. If a skeleton contains less than bones, we simply repeat the last bone in the sequence. The final pervertex representation is formed by concatenating the vertex position and above ordered sequence .
Skinning module
The module transforms the above skeletonaware mesh representation to skinning weights :
(9) 
where are learned parameters. The skinning network follows GMEdgeNet. The last layer outputs a dimensional pervertex feature vector, which is transformed to a pervertex skinning weight vector through a learned MLP and a softmax function. This ensures that the skinning weights for each vertex are positive and sum to . The entries of the output skinning weight vector are ordered according to the volumetric geodesic distance of the vertex to the corresponding bones.
5. Training
The goal of our training procedure is to learn the parameters of the networks used in each of the three stages of RigNet. Training is performed on a dataset of rigged characters described in Section 6.
5.1. Joint prediction stage training
Given a set of training characters, each with skeletal joints , we learn the parameters , , and bandwidth of this stage such that the estimated skeletal joints approach as closely as possible to the training ones. Since the estimated skeletal joints originate from mesh vertices that collapse into modes after mean shift clustering, we can alternatively formulate the above learning goal as a problem of minimizing the distance of collapsed vertices to nearest training joints and vice versa. Specifically, we minimize the symmetric Chamfer distance between collapsed vertices and training joints :
(10) 
The loss is summed over the training characters (we omit this summation for clarity). We note that this loss is differentiable wrt all the parameters of the joint prediction stage, including the bandwidth. The mean shift iterations of Eq. 6 are differentiable with respect to the attention weights and displaced points. This allows us to backpropagate joint location error signal to both the vertex displacement and attention network. The Epanechnikov kernel in meanshift is also a quadratic function wrt the bandwidth, which makes it possible to learn the bandwidth efficiently through gradient descent. Learning converged to a value of based on our training dataset.
We also found that adding supervisory signal to the vertex displacements before clustering helped improving training speed and joint detection performance (see also experiments). To this end, we minimize Chamfer distance between displaced points and groundtruth joints, favoring tighter clusters:
(11) 
This loss affects only the parameters of the displacement module. Finally, we found that adding supervision to the vertex attention weights also offered a performance boost, as discussed in our experiments. This loss is driven by the observation that the displacement of vertices located closer to joints are more helpful to localize them more accurately. Thus, for each training mesh, we find vertices closest to each joint at different directions perpendicular to the bones. Then we create a binary mask whose values are equal to for these closest vertices, and for the rest. We use crossentropy to measure consistency between these masks and neural attention:
Edge dropout.
During training of GMEdgeNet, for each batch, we randomly select a subset of edges within geodesic neighborhoods (in our implementation, we randomly select subsets up to edges). This sampling strategy can be considered as a form of mesh edge dropout. We found that it improved performance since it simulates varying vertex sampling on the mesh, making the graph network more robust to different tessellations.
Training implementation details
We first pretrain the parameters of attention module with the loss alone. We found that bootstrapping the attention module with this pretraining helped with the performance (see also experiments). Then we finetune , and train the parameters of the displacement module and the bandwidth using the combined loss: . For finetuning, we use the Adam optimizer with a batch size of training characters, and learning rate .
5.2. Connectivity stage training
Given a training character, we form the adjacency matrix encoding the connectivity of the skeleton i.e., if two training joints and are connected, and otherwise . The parameters of the BoneNet are learned using binary crossentropy between the training adjacency matrix entries and the predicted probabilities :
The BoneNet parameters are learned using the probabilities estimated for training joints rather than the predicted ones of the previous stage. The reason is that the training adjacency matrix is defined on training joints (and not on the predicted ones). We tried to find correspondences between the predicted joints and the training ones using the Hungarian method, then transfer the training adjacencies to pairs of matched joints. However, we did not observe significant improvements by doing this potentially due to matching errors. Finally, to train the parameters of the network used to extract the root joint, we use the softmax loss for classification.
Training implementation details.
Training BoneNet has an additional challenge due to class imbalance problem: out of all pairs of joints, only few are connected. To deal with this issue, we adopt the online hardexample mining approach from [Shrivastava et al., 2016]. For both networks, we employ the Adam optimizer with batch size and learning rate .
5.3. Skinning stage training
Given a set of training characters, each with skin weights , we train the parameters of our skinning network so that the estimated skinning weights
agree as much as possible with the training ones. By treating the pervertex skinning weights as probability distributions, we use crossentropy as loss to quantify the disagreement between training and predicted distributions for each vertex:
As in the case of the connectivity stage, we train the skinning network based on the training skeleton rather than the predicted one, since we do not have skinning weights for it. We tried to transfer skinning weights from the training bones to the predicted ones by establishing correspondences as before, but this did not result in significant improvements.
Training implementation details.
To train the skinning network, we use the Adam optimizer with a batch size of training characters, and learning rate . We also apply the edge dropout scheme during the training of this stage, as in the joint prediction stage.
6. Results
We evaluated our method and alternatives for animation skeleton and skinning prediction both quantitatively and qualitatively. Below we discuss the dataset used for evaluation, the performance measures, comparisons, and ablation study.
Dataset.
To train and test our method and alternatives, we chose the “ModelsResourceRigNetv1” dataset of 3D articulated characters from [Xu et al., 2019], which provides a nonoverlapping training and test split, and contains diverse characters ^{2}^{2}2please see also our project page: https://zhanxu.github.io/rignet. Specifically, the dataset contains rigged characters mined from an online repository [ModelsResource, 2019], spanning several categories, including humanoids, quadrupeds, birds, fish, robots, toys, and other fictional characters. Each character includes one rig (we note that the multiple rig examples of the two models of Figure 3 were made separately and do not belong to this dataset). The dataset does not contain duplicates, or remeshed versions of the same character. Such duplicates were eliminated from the dataset. Specifically, all models were voxelized in a binary grid, then for each model in the dataset, we computed the Intersection over Union (IoU) with all other models based on their volumetric representation. We eliminated duplicates or nearduplicates whose IoU of volumes was more than 95%. We also manually verified that such remeshed versions were filtered out. Under the guidance of an artist, we also verified that all characters have plausible skinning weights and deformations. We use a training, holdout validation, and test split, following a  proportion respectively, resulting in training, holdout validation, and test characters. Figure 2 shows examples from the training split. The models are consistently oriented and scaled. Meshes with fewer than vertices were subdivided; as a result all training and test meshes contained between K and K vertices. The number of joints per character varied from to , and the average is . The quantitative and qualitative evaluation was performed on the test split of the dataset.
Quantitative evaluation measures.
Our quantitative evaluation aims to measure the similarity of the predicted animation skeletons and skinning to the ones created by modelers in the test set (denoted as “reference skeletons” and “reference skinning” in the following paragraphs).
For evaluating skeleton similarity, we employ various measures following [Xu et al., 2019]:
(a) CDJ2J is the symmetric Chamfer distance between joints. Given a test shape, we measure the Euclidean distance from each predicted joint to the nearest joint in its reference skeleton, then divide with the number of predicted joints. We also compute the Chamfer distance the other way around from the reference skeletal joints to the nearest predicted ones. We denote the average of the two as CDJ2J.
(b) CDJ2B is the Chamfer distance between joints and bones. The difference from the previous measure is that for each predicted joint, we compute its distance to the nearest bone point on the reference skeleton. We symmetrize this measure by also computing the distance from reference joints to predicted bones. A low value of CDJ2B and a high value of CDJ2J mean that the predicted and reference skeletons tend to overlap, yet the joints are misplaced along the bone direction.
(c) CDB2B is the Chamfer distance between bones (line segments). As above, we define it symmetrically. CDB2B measures similarity of skeletons in terms of bone placement (rather than joints). Ideally, all CDJ2J, CDJ2B, and CDB2B measures should be low.
(d) IoU (Intersection over Union) can also be used to characterize skeleton similarity. First, we find a maximal matching between the predicted and reference joints by using the Hungarian algorithm. Then we measure the number of predicted and reference joints that are matched and whose Euclidean distance is lower then a prescribed tolerance. This is then divided with the total number of predicted and reference joints.
By varying the tolerance, we can obtain plots demonstrating IoU for various tolerance levels (see Figure 11).
To provide a single, informative value, we set the tolerance to half of the local shape diameter [Shapira et al., 2008] evaluated at each corresponding reference joint. This is evaluated by casting rays perpendicular to the bones connected at the reference joint, finding raysurface intersections, and computing the jointsurface distance averaged over all rays. The reason for this normalization is that thinner parts e.g, arms have lower shape diameter; as a result, small joint deviations can cause more noticeable misplacement compared to thicker parts like torso.
(e) Precision & Recall can also be used here.
Precision is the fraction of predicted joints that were matched and whose distance to their nearest reference one is lower than the tolerance defined above.
Recall is the fraction of reference joints that were matched and whose distance to their nearest predicted joints is lower than the tolerance.
Note that since the number of reference or predicted joints may not be the same. Unmatched predicted joints contribute no precision, and similarly unmatched reference joints contribute no recall.
(f) TreeEditDist (ED) is the tree edit distance measuring the topological difference of the predicted skeleton to the reference one. The measure is defined as the minimum number of joint deletions, insertions, and replacements that are necessary to transform the predicted skeleton into the reference one.
IoU  Prec.  Rec.  CDJ2J  CDJ2B  CDB2B  

Pinocchio  36.5%  38.7%  35.9%  7.2%  5.5%  4.7% 
Xu et al. 2019  53.7%  53.9%  55.2%  4.5%  2.9%  2.6% 
Ours  61.6%  67.6%  58.9%  3.9%  2.4%  2.2% 
To evaluate skinning, we use the reference skeletons for all methods, and measure similarity between predicted and reference skinning maps:
(a) Precision & Recall are measured by finding the set of bones that influence each vertex significantly, where influence corresponds to a skinning weight larger than a threshold (, as described in [Liu et al., 2019]). Precision is the fraction of influential bones based on the predicted skinning among the ones defined based on the reference skinning. Recall is the fraction of the influential bones based on the reference skinning matching the ones found from the predicted skinning.
(b) L1norm measures the L1 norm of the difference between the predicted skinning weight vector and the reference one for each mesh vertex. We compute the average L1norm over each test mesh.
(c) dist measures the Euclidean distance between the position of vertices deformed based on the reference skinning and the predicted one. To this end, given a test shape, we generate different random poses, and compute the average and max distance error over the mesh vertices.
All the above skeleton and skinning evaluation measures are computed for each test shape, then averaged over the the test split.
Prec.  Rec.  avg L1  avg dist  max dist  

BBW  68.3%  77.6 %  0.69  0.0061  0.055 
GeoVoxel  72.8%  75.1 %  0.65  0.0057  0.049 
NeuroSkinning  76.3%  74.7 %  0.57  0.0053  0.043 
Ours  82.3%  80.8%  0.39  0.0041  0.032 
Competing methods.
For skeleton prediction, we compare our method with Pinocchio [Baran and Popović, 2007] and [Xu et al., 2019]. Pinocchio fits a template skeleton for each model. The template is automatically selected among a set of predefined ones (humanoid, short quadruped, tall quadruped, and centaur) by evaluating the fitting cost for each of them, and choosing the one with the least cost. [Xu et al., 2019] is a learning method trained on the same split as ours, with hyperparameters tuned in the same validation split. For skinning weights prediction, we compare with the BoundedBiharmonic Weights (BBW) method [Jacobson et al., 2011], NeuroSkinning [Liu et al., 2019] and the geometric method from [Dionne and de Lasa, 2013], called “GeoVoxel”. For the BBW method, we adopt the implementation from libigl [Jacobson et al., 2018]
, where the mesh is first tetrahedralized, then the bounded biharmonic weights are computed based on this volume discretization. For NeuroSkinning, we trained the network on the same split as ours and optimized its hyperparameters in the same holdout validation split. For GeoVoxel, we adopt Maya’s implementation
[Autodesk, 2019] which outputs skinning weights based on a handengineered function of volumetric geodesic distances. We set the max influencing bone number, weight pruning threshold, and dropoff parameter through holdout validation in our validation split (3 bones, 0.3 pruning threshold, and 0.5 dropoff).Comparisons.
Table 1 reports the evaluation measures for skeleton extraction between competing techniques. Our method outperforms the rest according to all measures. This is also shown in Fig.11, showing IoU on the yaxis for different tolerance levels (multipliers of local shape diameter) on the xaxis.
IoU  Prec.  Rec.  CDJ2J  CDJ2B  CDB2B  

P2PNetbased  40.6%  41.6%  42.0%  6.3%  4.6%  3.8% 
No attn  52.4%  50.9%  50.7%  4.6%  3.1%  2.7% 
Onering  59.7%  65.6%  57.4%  4.1%  2.5%  2.4% 
No vertex loss  59.3%  58.2%  57.6%  4.2%  2.7%  2.5% 
No attn pretrain  60.6%  64.0%  58.1%  4.2%  2.6%  2.4% 
Full  61.6%  67.6%  58.9%  3.9%  2.4%  2.2% 
Figure 8 visualizes reference skeletons and predicted ones for different methods for some characteristic test shapes. We observe that our method tends to output skeletons whose joints and bones are closer to the reference ones. [Baran and Popović, 2007] often produces implausible skeletons when the input model has parts (e.g., tail, clothing) that do not correspond well to the used template. [Xu et al., 2019] tends to misplace joints around areas, such as elbows and knees, since voxel grids tend to lose surface detail.
Table 2 reports the evaluation measures for skinning. Our numerical results are significantly better than BBW, NeuroSkinning, and GeoVoxel according to all the measures. Figure 9 visualizes the skinning weights produced by our method, GeoVoxel, and NeuroSkining that were found to be the best alternatives according to our numerical evaluation. Ours tends to agree more with the artistspecified skinning. On the top example, arms are close to torso in terms of Euclidean distance, and to some degree also in geodesic sense. Both NeuroSkining and GeoVoxel overextend the skinning weights to a larger area than the arm. In order to match the GeoVoxel’s output to the artistcreated one, all its parameters need to be manually tuned per test shape, which is laborious. Our method combines bone representations and vertexskeleton intrinsic distances in our mesh network to produce skinning that better separates articulating parts. In the bottom example, a jaw joint is placed close to the lower lip to control the jaw animation. Most vertices on the front face are close to this joint in terms of both geodesic and Euclidean distances. This results in higher errors for both NeuroSkinning and GeoVoxel, even if the latter is manually tuned. Our method produces a sharper map capturing the part of the jaw.
Class. Acc.  CDB2B  ED  

Euclidean edge cost  61.2%  0.30%  5.0 
bone descriptor only  71.9%  0.22%  4.2 
bone descriptor+skel. geometry  80.7%  0.12%  2.9 
Full stage  83.7%  0.10%  2.4 
Prec  Rec.  avgL1  avgdist.  maxdist.  
No geod. dist.  80.0%  79.3%  0.41  0.0044  0.054 
Ours  82.3%  80.8%  0.39  0.0041  0.032 
Ablation study.
We present the following ablation studies to demonstrate the influence from different design choices of our method.
(a) Joint prediction ablation study: Table 3 presents evaluation of variants of our joint detection stage trained in the same split and tuned in the same holdout validation split as our original method. We examined the following variants: “P2PNetbased” uses the same architecture as P2PNet [Yin et al., 2018], which relies on PointNet [Qi et al., 2017]
for displacing points (vertices in our case). After displacement, meanshift clustering is used to extract joints as in our method. We experimented with the loss from their approach, and also the same loss as in our joint detection stage (excluding the attention mask loss, since P2PNet does not use attention). The latter choice worked better. The architecture was trained and tuned in the same split as ours.
“No attn” is our method without the attention module, thus all vertices have the same weight during clustering. “Onering” is our method where GMEdgeConv uses only onering neighbors of each vertex without considering geodesic neighborhoods. “No vertex loss” does not use vertex displacement supervision with the Chamfer distance loss of Eq. 11 during training. It uses supervision from clustering only based on the loss of Eq.10. “No attn pretrain” does not pretrain the attention network with our created binary mask. We observe that removing any of these components, or using an architecture based on P2PNet, leads to a noticeable performance drop. In particularly, the attention module has a significant influence on the performance of our method.
(b) Connectivity prediction ablation study. Table 4 presents evaluation of alternative choices for our BoneNet. In these experiments, we examine the performance of the connectivity module when it is given as input
the reference joints instead of the predicted ones. In this manner, we specifically evaluate the design choices for the connectivity stage i.e., our evaluation here is not affected from any wrong predictions of the joint detection stage. Here, we report the binary classification accuracy (“Class. Acc.”) i.e., whether the prediction to connect each pair of given joints agrees with the groundtruth connectivity. We also report edit distance (ED) and bonetobone Chamfer distance (CDB2B), since these measures are specific to bone evaluation. We first show the performance when the MST connects joints based on Euclidean distance as cost (see “Euclidean edge cost”). We also evaluate the effect
of using only the bone
descriptor
without
the skeleton geometry encoding () and without shape encoding () (see “bone descriptor only”, and Eq.7). We also evaluate the effect
of using the bone
descriptor
with
the skeleton geometry encoding but without shape encoding (see “bone descriptor+skel. geometry”). The best performance is achieved when all three shape, skeleton, and bone representations are used as input to BoneNet.
We also observed the same trend in RootNet, where we evaluate the accuracy of predicting the root joint correctly. Skipping the skeleton geometry and shape encoding results in accuracy of . Adding the skeleton encoding increases it to . Using all three shape, skeleton, and joint representations achieves the best accuracy of .
(c) Skinning prediction ablation study. Table 5 presents the case of removing the volumetric geodesic distance feature from input to our skinning prediction network. We observe a noticeable performance drop without it. Still, it is interesting to see that even without it, our method is better than competing methods (Table 2). We also experimented with different choices of i.e., the number of closest bones used in our skinning prediction. Fig.12 shows the average L1norm difference of skinning weights for in our test set. Lowest error is achieved when (we noticed the same behavior and minimum in our validation split).
7. Limitations and Conclusion
We presented a method that automatically rigs input 3D character models. To the best of our knowledge, our method represents a first step towards a learningbased, complete solution to character rigging, including skeleton creation and skin weight prediction. We believe that our method is practical in various scenarios. First, we believe that our method is useful for casual users or novices, who might not have the training or expertise to deal with modeling and rigging interfaces. Another motivation for using our method is the widespread effort for democratization of 3D content creation and animation that we currently observe in online asset libraries provided with modern game engines (e.g., Unity). We see our approach as such one step towards further democratization of character animation. Another scenario of use for our method is when a large collection of 3D characters need to be rigged. Processing every single model manually would be cumbersome even for experienced artists.
Our approach does have limitations, and exciting avenues for future work. First, our method currently uses a perstage training approach. Ideally, the skinning loss could be backpropagated to all stages of the network to improve joint prediction. However, this implies differentiating volumetric geodesic distances and skeletal structure estimation, which are hard tasks. Although we trained our method such that it is more robust to different vertex sampling and tessellations, invariance to mesh resolution and connectivity is not guaranteed. Investigating the performance of other mesh neural networks (e.g., spectral) here, could be impactful. There are few cases where our method produces undesirable effects, such as putting extra arm joints (Figure 13, top). Our dataset also has limitations. It contains one rig per model. Many rigs often do not include bones for small parts, like feet, fingers, clothing and accessories, which makes our trained model less predictive of these joints (Figure 13, bottom). Enriching the dataset with more rigs could improve performance, though it might make the mapping more multimodal than it is at present. A multiresolution approach that refines the skeleton in a coarsetofine manner may instead be fruitful. Our current bandwidth parameter explores one mode of variation. Exploring a richer space to interactively control skeletal morphology and resolution is another interesting research direction. Finally, it would also be interesting to extend our method to handle skeleton extraction for point cloud recognition or reconstruction tasks.
Acknowledgements.
This research is partially funded by NSF (EAGER1942069) and NSERC. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative. We thank Gopal Sharma, Difan Liu, and Olga Vesselova for their help and valuable suggestions. We also thank anonymous reviewers for their feedback.References
 Surface reconstruction by voronoi filtering. In Proc. Symposium on Computational Geometry, Cited by: §2.
 Computing and simplifying 2d and 3d continuous skeletons. Comput. Vis. Image Underst. 67 (3). Cited by: §2.
 Skeleton extraction by mesh contraction. ACM Trans. on Graphics 27 (3). Cited by: §2.
 Maya, version. Note: www.autodesk.com/products/autodeskmaya/ Cited by: §6.
 Fast and deep deformation approximations. ACM Trans. on Graphics 37 (4). Cited by: §2.
 Spline interface for intuitive skinning weight editing. ACM Trans. on Graphics 37 (5). Cited by: §2.
 Automatic rigging and animation of 3d characters. ACM Trans. on Graphics 26 (3). Cited by: §1, §1, §2, §2, §6, §6.
 Interaction networks for learning about objects, relations and physics. In Proc. NIPS, Cited by: §2.
 Biological shape and visual science (part i). Journal of Theoretical Biology 38 (2). Cited by: §2.

Learning shape correspondence with anisotropic convolutional neural networks
. In Proc. NIPS, Cited by: §2.  Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4). Cited by: §2.
 Spectral networks and locally connected networks on graphs. In Proc. ICLR, Cited by: §2.
 Point cloud skeletons via laplacian based contraction. In Proc. SMI, Cited by: §2.
 Mean shift, mode seeking, and clustering. IEEE Trans. Pat. Ana. & Mach. Int. 17 (8). Cited by: §4.1.
 Geodesics in heat: a new approach to computing distance based on heat flow. ACM Trans. on Graphics 32 (5). Cited by: §4.3.
 Convolutional neural networks on graphs with fast localized spectral filtering. arXiv:1606.09375. Cited by: §2.
 Object categorization: computer and human vision perspectives. Cited by: §2.
 Geodesic binding for degenerate character geometry using sparse voxelization. IEEE Trans. Vis. & Comp. Graphics 20 (10). Cited by: §2.
 Geodesic voxel binding for production character meshes. In Proc. SCA, Cited by: §2, §4.3, §4.3, §6.
 Pointtopoint regression pointnet for 3d hand pose estimation. In Proc. ECCV, Cited by: §2.
 Inductive representation learning on large graphs. In Proc. NIPS, Cited by: §2.
 Representation learning on graphs: methods and applications. IEEE Data Eng. Bull. 40 (3). Cited by: §2.
 MeshCNN: a network with an edge. ACM Trans. on Graphics 38 (4). Cited by: §2.
 Towards viewpoint invariant 3d human pose estimation. In Proc. ECCV, Cited by: §1, §2.
 Deep convolutional networks on graphstructured data. arXiv:1506.05163. Cited by: §2.
 Topology matching for fully automatic similarity estimation of 3d shapes. In Proc. ACM SIGGRAPH, Cited by: §2.
 Structureaware 3d hourglass network for hand pose estimation from single depth image. In Proc. BMVC, Cited by: §1, §2.
 L1medial skeleton of point cloud. ACM Trans. on Graphics 32 (4). Cited by: §2.
 Bounded biharmonic weights for realtime deformation. ACM Trans. on Graphics 30 (4). Cited by: §2, §4.3, §6.
 libigl: a simple C++ geometry processing library. Note: https://libigl.github.io/ Cited by: §6.
 Skinning mesh animations. ACM Trans. on Graphics. Cited by: §2.
 NASA: neural articulated shape approximation. arXiv:1912.03207. Cited by: §2.
 Hierarchical mesh decomposition using fuzzy clustering and cuts. ACM Trans. on Graphics 22 (3). Cited by: §2.
 Skinning with dual quaternions. In Proc. I3D, Cited by: §2.
 Elasticityinspired deformers for character articulation. ACM Trans. on Graphics 31 (6). Cited by: §2.
 Spherical blend skinning: a realtime deformation of articulated models. In Proc. I3D, Cited by: §2.
 Datadriven physics for human soft tissue animation. ACM Trans. on Graphics 36 (4). Cited by: §2.
 Semisupervised classification with graph convolutional networks. arXiv:1609.02907. Cited by: §2.
 Projective skinning. Vol. 1. Cited by: §2.
 Fast projective skinning. In Proc. MIG, Cited by: §2.
 Recurrent pixel embedding for instance grouping. In Proc. CVPR, Cited by: §4.1.
 Robust and accurate skeletal rigging from mesh sequences. ACM Trans. on Graphics 33 (4). Cited by: §2.
 Realtime skeletal skinning with optimized centers of rotation. ACM Trans. on Graphics 35 (4). Cited by: §2.
 Gated graph sequence neural networks. Cited by: §2.
 NeuroSkinning: automatic skin binding for production characters with deep graph networks. ACM Trans. on Graphics. Cited by: §1, §2, §3, §4.1, §6, §6.
 SMPL: a skinned multiperson linear model. ACM Trans. on Graphics 34 (6). Cited by: §2.
 DeepWarp: dnnbased nonlinear deformation. IEEE Trans. Vis. & Comp. Graphics. Cited by: §2.
 Jointdependent local deformations for hand animation and object grasping. In Proc. Graphics Interface ’88, Cited by: §2.
 Representation and recognition of the spatial organization of threedimensional shapes. Royal Society of London. Series B, Containing papers of a Biological character 200. Cited by: §2.
 Geodesic convolutional neural networks on riemannian manifolds. In Proc. ICCV Workshops, Cited by: §2.
 The modelsresource, https://www.modelsresource.com/. Cited by: §6.
 Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR, Cited by: §2.
 V2Vposenet: voxeltovoxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proc. CVPR, Cited by: §1, §2.
 Efficient dynamic skinning with lowrank helper bone controllers. ACM Trans. on Graphics 35 (4). Cited by: §2, §2.
 Stacked hourglass networks for human pose estimation. In Proc. ECCV, Cited by: §2.
 Coarsetofine volumetric prediction for singleimage 3d human pose. In Proc. CVPR, Cited by: §1, §2.
 Shortest connection networks and some generalizations. The Bell Systems Technical Journal 36 (6). Cited by: §4.2.
 PointNet++: deep hierarchical feature learning on point sets in a metric space. Cited by: §4.2, §6.
 Learning bidirectional lstm networks for synthesizing 3d mesh animation sequences. arXiv:1810.02042. Cited by: §2.
 The graph neural network model. IEEE Trans. on Neural Networks 20 (1). Cited by: §2.
 Consistent mesh partitioning and skeletonisation using the shape diameter function. Visual Computer 24 (4). Cited by: §6.
 Realtime human pose recognition in parts from single depth images. In Proc. CVPR, Cited by: §1.
 Training regionbased object detectors with online hard example mining. In Proc. CVPR, Cited by: §5.2.
 Realistic biomechanical simulation and control of human swimming. ACM Trans. on Graphics 34 (1). Cited by: §2.
 Medial representations: mathematics, algorithms and applications. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 1402086571 Cited by: §2.
 Shock graphs and shape matching. Int. J. Comp. Vis. 35 (1). Cited by: §2.
 Wires: a geometric deformation technique. In Proc. ACM SIGGRAPH, Cited by: §2.
 Earth mover’s distances on discrete surfaces. ACM Trans. on Graphics 33 (4). Cited by: §4.3.
 3D skeletons: a stateoftheart report. Computer Graphics Forum. Cited by: §2.
 Curve skeleton extraction from incomplete point cloud. ACM Trans. on Graphics 28 (3). Cited by: §2.
 Dense 3d regression for hand pose estimation. In Proc. CVPR, Cited by: §2.
 Dynamic graph cnn for learning on point clouds. ACM Trans. on Graphics. Cited by: §2, §2, §4.1.
 Bone glow: an improved method for the assignment of weights for mesh deformation. In Proc. the 5th International Conference on Articulated Motion and Deformable Objects, Cited by: §2.
 A comprehensive survey on graph neural networks. arXiv:1901.00596. Cited by: §2.
 Liex: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. Int. J. Comp. Vis. 123 (3). Cited by: §1.
 Predicting animation skeletons for 3d articulated models via volumetric nets. In Proc. 3DV, Cited by: §1, §2, §4.1, §6, §6, §6, §6.
 SyncSpecCNN: synchronized spectral CNN for 3D shape segmentation. In Proc. CVPR, Cited by: §2.
 P2Pnet: bidirectional point displacement net for shape transform. ACM Trans. on Graphics 37 (4). Cited by: §2, §4.1, §4.1, §6.
 FORMS: a flexible object recognition and modeling system. Int. J. Comp. Vis. 20. Cited by: §2.
Appendix A Appendix: Architecture details
Table 6 lists the layer used in each stage of our architecture along with the size of its output map. We also note that our project page with source code, datasets, and supplementary video is available at:
https://zhanxu.github.io/rignet.
Joint Prediction Stage  

Layers  Input  Output 
GMEdgeConv  
GMEdgeConv  
GMEdgeConv  
concat()  
MLP ([832, 1024])  
max_pooling & tilt  
concat()  
MLP([1859, 1024, 256, 3])  
Connectivity Stage  
GMEdgeConv  
GMEdgeConv  
GMEdgeConv  
concat()  
MLP ([448, 512, 256, 128])  
max_pooling & tile  
MLP([3, 64, 128, 1024])  
max_pooling & tilt  
MLP([1024, 256, 128])  
MLP([8, 32, 64, 128, 256]))  
concat()  
MLP([512, 128, 32, 1])  
Skinning Stage  
MLP([38, 128, 64])  
GMEdgeConv  
max_pooling & tilt  
MLP([512, 512, 1024])  
GMEdgeConv  
GMEdgeConv  
concat()  
MLP([1280, 1024, 512, 5]) 