Active Learning of Neural Collision Handler for Complex 3D Mesh Deformations

by   Qingyang Tan, et al.
University of Maryland

We present a robust learning algorithm to detect and handle collisions in 3D deforming meshes. Our collision detector is represented as a bilevel deep autoencoder with an attention mechanism that identifies colliding mesh sub-parts. We use a numerical optimization algorithm to resolve penetrations guided by the network. Our learned collision handler can resolve collisions for unseen, high-dimensional meshes with thousands of vertices. To obtain stable network performance in such large and unseen spaces, we progressively insert new collision data based on the errors in network inferences. We automatically label these data using an analytical collision detector and progressively fine-tune our detection networks. We evaluate our method for collision handling of complex, 3D meshes coming from several datasets with different shapes and topologies, including datasets corresponding to dressed and undressed human poses, cloth simulations, and human hand poses acquired using multiview capture systems. Our approach outperforms supervised learning methods and achieves 93.8-98.1% accuracy compared to the groundtruth by analytic methods. Compared to prior learning methods, our approach results in a 5.16%-25.50% lower false negative rate in terms of collision checking and a 9.65%-58.91% higher success rate in collision handling.




LCollision: Fast Generation of Collision-Free Human Poses using Learned Non-Penetration Constraints

We present LCollision, a learning-based method that synthesizes collisio...

History-free Collision Response for Deformable Surfaces

Continuous collision detection (CCD) and response methods are widely ado...

Multimaterial Front Tracking

We present the first triangle mesh-based technique for tracking the evol...

A Bio-inspired Collision Detecotr for Small Quadcopter

Sense and avoid capability enables insects to fly versatilely and robust...

GraphDistNet: A Graph-based Collision-distance Estimator for Gradient-based Trajectory

Trajectory optimization (TO) aims to find a sequence of valid states whi...

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes

The analysis of deforming 3D surface meshes is accelerated by autoencode...

A multiresolution Discrete Element Method for triangulated objects with implicit timestepping

Simulations of many rigid bodies colliding with each other sometimes yie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to model or simulate deformable meshes is becoming an important topic in computer vision and computer graphics, with rich applications in real-time physics simulation

(Holden et al., 2019), animation synthesis (Qiao et al., 2020), and cross-domain model transformation (Cudeiro et al., 2019). Central to these methods are generative models that map high-dimensional deformed 3D meshes with rich details into low-dimensional latent spaces. These generative models can be trained from high-quality groundtruth datasets, and they infer visually or physically plausible meshes in real time. These 3D datasets can also be generated using physics simulations (Narain et al., 2012; Tang et al., 2012) or reconstructed from the physical world using multi-view capture systems (Smith et al., 2020). In general, 3D deformable meshes are more costly to acquire, so 3D mesh datasets typically come in smaller sizes than image or text datasets. Inference models trained using such small datasets can suffer from over-fitting and generate meshes with various visual artifacts. For example, human pose embedding networks (Tan et al., 2018b; Gao et al., 2018) can have excessive deformations, and interaction networks (Battaglia et al., 2016) can result in non-physically-based object motions.

Our main goal is to resolve a major source of visual artifacts: self-collisions. Instead of acquiring more data, we argue that domain-specific knowledge could also be utilized to significantly improve the accuracy of inference models. There have been several prior research works along this line. For example, (Yang et al., 2020b) exploited the fact that near articulated meshes can be divided into multiple components, and they train a recursive autoencoder to stitch the components together. (Zheng et al., 2021) utilized the locality of secondary physics motions to learn re-targetable and scalable real-time dynamics animation. Recently, (Tan et al., 2021) studied learning-based collision avoidance for 3D meshes corresponding to human poses. They proposed a deep architecture to detect collisions and used numerical optimizations to resolve detected collisions. However, (Tan et al., 2021) used a large mesh dataset to obtain stable performance of neural collision detection. Indeed, a deformed 3D mesh typically involves more than elements (voxels, points, triangles) where any pair of two elements can have collisions. Therefore, a huge amount of data is required to present the inference model with enough examples of collisions between all possible element pairs.

Main Results: We present a robust method to train neural collision handler for complex, 3D deformable meshes using active learning. Our key observation is that the distribution of penetrating meshes can have a long tail and active learning is an effective method for modeling the tail Geifman and El-Yaniv (2017). Specifically, most penetrating mesh primitives have large overlapping patches consisting of the central part of the distribution, but many other meshes have small patches that penetrate each other, forming the tail. In fact, most 3D mesh datasets do not focus on generating samples in the tail and cannot be used to train stable collision detectors. In order to overcome these issues, our approach combines three main ideas: 1) We use active learning to progressively insert new samples into the dataset. With the help of exact collision detectors, collision labels for the new samples are automatically generated; 2) We use a risk-seeking approach to prioritize samples in the tail, so that the inserted samples can best help the neural collision detector improve its accuracy; 3) We use different loss functions for samples far from and close to the decision boundary. Our overall approach is shown in fig:pipeline. We apply our training method on the same neural collision detector as (Tan et al., 2021), and we compare with their supervised learning approach on five deformable mesh datasets corresponding to dressed and undressed human poses, cloth simulations, and human hand poses. By comparison, our approach exhibits much higher data efficacy, shows higher accuracy (up to ) in terms of collision detection, and uses much fewer training samples (up to fewer). Given a dataset of the same size, our neural collision pipeline reduces the false negative rate by on average, and we successfully resolve more self-colliding meshes. Overall, ours is the first practical method for neural collision handling for complex 3D meshes.

2 Related Work

Our method is designed for general deformable meshes with fixed topology. We first learn a latent-space of meaningful mesh deformations and then use active learning to train a neural collision detector that identifies a collision-free subspace. We review related work in these areas.

Generative Model of Dense 3D Shapes: Categorized by shape representations, generative models can be based on point clouds (Qi et al., 2017), volumetric grids (Wu et al., 2015), multi-charts Groueix et al. (2018), surface meshes (Tan et al., 2018a), or semantic data structures Liu et al. (2019). We use mesh-based representations with fixed topologies, because most collision detection libraries are designed for meshes. Note that this choice excludes several applications that require general meshes of changing topology, e.g., for modeling meshes of hierarchical structure (Yu et al., 2019) or modeling scenes with many objects (Ritchie et al., 2019). However, these applications typically involve only static meshes with no need for collision detection. There is a separate research direction on domain-specific mesh deformation representation, e.g., SMPL/STAR human models (Loper et al., 2015; Osman et al., 2020), wrinkle-enhanced cloth meshes (Lahner et al., 2018), and skeletal skinning meshes (Xu et al., 2020). There are even prior works (Fieraru et al., 2020, 2021; Muller et al., 2021) on collision detection and handling for human bodies. These domain-specific methods are typically more accurate than our representation, but by assuming general meshes, our representation can be applied to multiple domains as shown in sec:evaluation.

Collision Prediction & Handling: Although collision detection has been well studied and mature software packages are available, detecting and handling self-collisions can still be a non-trivial computational burden for large meshes. Prior methods (Pan et al., 2012; Kim et al., 2018; Govindaraju et al., 2005) use spatial hashing, bounding volume hierarchies, and GPU to accelerate the computation by pruning non-colliding primitives, but they cannot be generalized to learning methods. Handling collisions is even more challenging, and prior physically-based methods either use penalty forces coupled with discrete collision detectors (Tang et al., 2012) or hard constraints coupled with continuous collision detectors (Narain et al., 2012). All these methods rely on physics-based constraints to handle collisions. Recently, many learning methods such as (Gundogdu et al., 2019; Patel et al., 2020) have been designed to predict the cloth movement or deformation in 3D, but they do not perform collision handling explicitly.

Active Learning: An active learner alternates between drawing new samples and exploiting existing samples. These samples can be drawn guided by an acquisition function in Bayesian optimization (Niculescu et al., 2006) or from an expert algorithm (De Raedt et al., 2018). Active learning has been applied to approximate the boundary of the configuration space (Pan et al., 2013; Tian et al., 2016; Das et al., 2017), where the feasible domain of collision constraints is parameterized using kernel SVM. However, this method is limited to rigid or articulated deformation and is not applicable to general 3D deformations. More broadly, active learning has been adopted in various prior works to accelerate data labeling in image classification (Gal et al., 2017) and object detection (Aghdam et al., 2019) tasks. These methods progressively identify unlabeled images to be forwarded to experts for labelling. An alternative method for selecting the samples is identifying a coreset (Paul et al., 2014), and authors of (Sener and Savarese, 2018) propose a practical algorithm for coreset identification via k-center clustering. While these methods consider a pre-existing set of unlabeled images for training discriminative models, we assume a continuous latent space of samples for training generative models and use a risk-seeking method to identify critical new samples. Further, we automatically label the new samples using analytical collision detection algorithms, making the training method fully automatic.

Variable Definition
graph with vertices and edges
, encoder, decoder
learnable parameters
autoencoder latent code
latent region of sampling
PD penetration depth
CSE global collision state encoder
CP local collision state predictor

collision classifier

neural collision indicator
Variable Definition
ACAP feature transform function
neural collision network parameters
collision state label
collision handler objective function
dataset for learning
dataset for learning
threshold for boundary samples
CE cross-entropy loss
neural network losses
weights for each loss

3 Neural Collision Handler

In this section, we briefly review the mesh-based generative models with a neural collision handler (Tan et al., 2021), based on which we build our active learning method. All notations are summarized in the symbol table.

A mesh is represented by the graph , where is a set of vertices, and is a set of edges. We assume that all the meshes have the same topology, that is, all the meshes differ in while the connectivity stays the same. We further limit ourselves to manifold triangle meshes, i.e., each edge is incident to at most two triangles, and two triangles are adjacent if and only if they share an edge. No other assumptions are made on the mesh deformation. We denote a mesh as self-collision-free if and only if any pair of two non-adjacent triangles are not intersecting each other. Our goal is to design a mesh-based generative neural architecture where we take as input a coordinate in the latent space and output a 3D mesh without self-collisions. The latent space is defined as a low-dimensional space that can be mapped to high-dimensional meshes injectively using a learned decoder function (fig:pipeline right). Furthermore, the latent-to-mesh mapping is differentiable and supports multiple downstream applications explained in sec:app. Our method consists of two parts. First, we train a bilevel mesh autoencoder using supervised learning. Second, we train a neural collision handler using active learning and a special boundary loss.

3.1 Bilevel Autoencoder

Our bilevel autoencoder architecture maps a deformed mesh to two levels of latent codes. The latent codes are used to both recover the high-dimensional deformed mesh vertices and predict whether the deformed mesh contains self-collisions. We only want to encode intrinsic mesh information such as curvatures instead of extrinsic rigid transformations because mesh shapes are invariant to extrinsic transformation. Therefore, we first use as-consistent-as-possible (ACAP) feature transformation Gao et al. (2019)

to factor out rigid transformations. The ACAP feature vector is first brought through the level-1 autoencoder and mapped to a latent code

. Since our dataset size is small, we use a shallow autoencoder to avoid over-fitting and is subject to large embedding error. We further hypothesize that the error is sparsely distributed throughout the mesh vertices. Therefore, we use an attention mechanism trained with a sparsity prior to decompose the mesh into near-rigid sub-domains. The sparsity prior is designed such that each domain can be mapped to a single axis of the latent space, i.e., a single entry of . Afterwards, a set of level-2 autoencoders is introduced to further reduce the error (fig:pipeline left), with each autoencoder dedicated to one entry of . Their latent codes are denoted as . The ultimate mesh is reconstructed from by combining level-1 and level-2 latent codes:

where is the th decoder, with being the learnable parameters. The mesh vertices are reconstructed by inverting the ACAP transformation. Correspondingly, we have the encoder defined as , which maps the vertices of a mesh to the latent space, with being the learnable parameters.

3.2 Neural Collision Detector

Mesh A

Mesh B
Figure 2: PD (red arrow) is the locally minimal translation for mesh B (orange) to be collision-free from mesh A.

Our neural collision detector predicts whether the mesh is subject to self-collisions using latent information . The extent to which two meshes collide can be measured by the notion of Penetration Depth (PD) Zhang et al. (2014), defined by the norm of the smallest configuration change needed for a mesh to be self-collision-free, as illustrated in fig:PD. It is well-known that PD is a non-smooth function of

(esp. at the boundaries), thereby making it difficult to resolve collisions by minimizing PD. By choosing appropriate activation functions (

and CELU in our case), we design the neural collision detector to be a differentiable approximation of PD. As a result, gradient information can be propagated to a collision handler to minimize PD.

Since collisions can happen between any pairs of geometric mesh primitives, a collision detector should consider possible contacts between any pair of near-rigid sub-domains, leading to a quadratic complexity . We use a global-local detection architecture that effectively reduces the number of learnable parameters. Specifically, we introduce a global collision state encoder and a set of local collision predictors , with , which predicts whether the th sub-domain is in collision with the rest of the mesh. Finally, the collision information for all local collision predictors is summarized using a classifier network to derive a single overall collision classifier:

where is the learnable parameters. The feasible space boundary of collision-free constraints corresponds to the -levelset of . This architecture combining CSE, CP, and MLP has several learnable parameters proportional to .

3.3 Optimization-Based Collision Response

Existing collision handling techniques Tang et al. (2012); Narain et al. (2012) are mostly based on numerical optimizations, where a key challenge is to make sure that collision constraints are differentiable. (Tan et al., 2021) also uses this formulation, but it is guided by the learned collision detector , which is differentiable by construction. Suppose we take as input a randomly sampled latent code , which might not satisfy the collision-free constraints. We then need to project that latent code back to the feasible domain of collision-free meshes. We achieve this by solving the following optimization problem under neural collision-free constraints using the Augmented Lagrangian Method (ALM):


where is some objective function, which can take multiple forms, as specified by downstream applications. In the simplest case, we take as input a desired , and we can define , which is only related to latent space variables. As a more intuitive interface, the user might want to change meshes in the Cartesian space instead of the of latent space. For example, if the user wants a human hand to be at a certain position , we could define . A desirable feature of eq:handler is an invariant problem size. However many vertices a mesh has, there is only one constraint, which guarantees high test-time performance. Moreover, it has been shown in (Sun and Yuan, 2006, Theorem 10.4.3) that ALM either finds a feasible solution or returns an infeasible solution that is closest to the boundary of the feasible domain. In other words, ALM always makes a best effort to resolve collisions, even if feasible solutions are not available.

4 Active Learning Algorithm

The goal of active learning is to iteratively improve the accuracy of the neural collision detector. We assume the availability of an existing dataset of “high-quality” meshes with deformed vertices , which is used to train a pair of autoencoders () via reconstruction loss. We further assume all the groundtruth meshes are (nearly) self-collision free. However, the autoencoders can still suffer from remaining embedding error after training, and users might explore the latent space in regions that are not well covered by the training dataset. All these factors can lead to self-collisions, which should be recognized by our neural collision detector. Therefore, our network cannot be trained with alone. This is because only contains negative (collision-free) samples, while the neural collision detector must learn the decision boundary between positive and negative samples. In other words, the network must be presented with enough samples to cover all possible latent codes with both self-penetrating and collision-free meshes. We denote the training dataset of neural collision detectors as another set: , where is the groundtruth collision state label.

It has been shown in Gal et al. (2017); Aghdam et al. (2019) that many data points in a large image dataset are similar and that human labeling for each point is time-consuming and contains redundant work. In our case, the groundtruth collision state label can be generated automatically using a robust algorithm such as Pan et al. (2012), to compute PD, where a positive PD indicates self-collisions, so we can define . However, the cost to compute penetration depth, , is superlinear in the number of mesh vertices, and computing PD for an entire dataset can still be a computational bottleneck. Moreover, we are considering a continuous space of possible training data that cannot be enumerated. To alleviate the computational burden, we design a three-stage method, as illustrated in fig:pipeline. During the first stage of bootstrap, we sample an initial boundary set by which we train to approximate the true decision boundary. At the second stage of data augmentation, new training data is selected and progressively injected into a dataset. Finally, for the third stage, our neural collision detector is updated to fit the augmented dataset. The criterion for selecting the subset is critical to the performance of active learning. We observe that our neural collision predictor is used as constraints for nonlinear optimization methods so that samples far from the boundary are not used by the optimizer and only the boundary of the feasible domain (gray area in fig:pipeline right) is useful. Therefore, we propose using a Newton-type risk-seeking method to push the samples towards the decision boundary. We provide more details for each step below.

4.1 Bootstrap

Active learning would progressively populate , so prior work Aghdam et al. (2019) simply initializes the dataset to an empty set. However, we find that a good initial guess can significantly improve the convergence of training. This is because we select new data by moving (randomly sampled) latent codes towards the decision boundary of the PD function using a risk-seeking method. However, the true boundary of the collision-free constraints corresponds to the boundary of -Obstacles, which is high-dimensional and unknown to us (PD is a non-smooth function, so we cannot even use gradient information to project a mesh to the zero level-set of PD). Instead, we propose using the learned neural decision boundary, i.e., the -levelset of , as an approximation. If we initialize , the surrogate decision boundary is undefined, and the training might diverge or suffer from slow convergence. For our bootstrap training, we uniformly sample a small set of latent codes at random positions from the latent space and compute PD for each of them. We define a valid space of sampling by mapping all the data to their latent codes and compute a bounded box in the latent space:

We hypothesize that all the meshes can be embedded using our autoencoder with small error corresponding to latent codes in , so we can initialize . We then divide the data points into three subsets (, illustrated in fig:pipeline right):


Here, is the positive set consisting of samples with penetrations deeper than a threshold , is the negative set consisting of collision-free samples, and is a boundary set where samples are nearly collision-free and lie on the decision boundary. We propose using the -loss function for to approximate the decision boundary:


4.2 Data Aggregation

The accuracy of our neural collision detector can be measured by the discrepancy between the surrogate decision boundary deemed by and the true decision boundary of PD, formulated as:

which is an expectation over the true decision boundary. Here CE is the cross-entropy loss. However, it is very difficult to derive a sampled approximation of the above metric because PD is a non-smooth function whose level-set is measure-zero, which corresponds to the boundaries of C-obstacles. Instead, we propose to take expectation over the surrogate decision boundary:

Generally speaking, the -level-set of can also be measure-zero, but we have designed our neural networks to be differentiable functions. As a result, we could always project samples onto the -level-set by solving the following risk-seeking unconstrained optimization:

We adopt the quasi-Newton method and update using the following recursion:


where is some first-order approximation of the Hessian matrix, which is much faster to compute than the exact Hessian, which requires the second-order term . In summary, we would sample a new set of size from previous during each iteration of data augmentation. For each sampled , we project to the surrogate decision boundary using recursive eq:newton until the relative change within is smaller than between consecutive iterations. We also sample directly from , using random samples to discover uncovered regions, which achieves a balance between exploitation and exploration. Finally, we classify into either one of , according to eq:subset using the penetration depth.

4.3 Model Update

After has been updated, we fine-tune by updating using the following loss functions:

where are weights corresponding to each type of loss. Our first term is a regularization that enforces consistency between and true PD, defined as:

where we penalize both domain-decomposed penetration depth defined in Tan et al. (2021) and total penetration depth with weight . Our second term is a marginal ranking loss that enforces the correct ordering of penetration depth to avoid over-fitting, which is defined as:

where is the maximal allowable order violation. We use superscripts to distinguish two samples drawn from . Our third term measures the discrepancy between and PD over the entire latent space:

We update our neural collision detector with objective function

by running a fixed number of training epochs, denoted as

, with warm-started from the last iteration of the model update. Our overall training method is illustrated in alg:activeLearning.

1:  Prepare initial data set
2:  Update using reconstruction loss and
3:  Initialize by drawing samples from
4:  for  do
5:      Compute and
6:  Update using
7:  while Not converged do
8:      Draw samples from
9:      for  do
11:          while True do
12:              Update using eq:newton
13:              if  sufficiently small then
14:                 Break
17:      Draw samples from
18:      for  do
20:      Update using
Algorithm 1 Learning
Figure 3: Representative examples of collision handling on the five datasets: (a) SCAPE; (b) MIT Swing; (c) MIT Jump; (d) Skirt; (e) Hand. For each example, we show the given self-penetrating mesh (left), our result (middle, Active+bd), and Supv+bd (right).
Figure 4: We plot the accuracy of the neural collision detector against the dataset size. The baselines are trained using the same amount of data. On average, ours achieves higher accuracy than Supv. From left to right: SCAPE, Swing, Jump, Skirt, and Hand.
Figure 5: We plot the false negative rate against the dataset size. The baselines are trained using the same amount of data. On average, ours achieves a lower false negative rate than Supv. From left to right: SCAPE, Swing, Jump, Skirt, and Hand.
Figure 6: We plot the success rate of the neural collision handler against the dataset size. Our method resolves more collisions than Supv+bd. From left to right: SCAPE, Swing, Jump, Skirt, and Hand.

5 Evaluation

Datasets: We evaluate our method on five types of datasets, as illustrated in fig:examples. The first three (SCAPE Anguelov et al. (2005) with meshes each having vertices, MIT Swing Vlasic et al. (2008) with meshes each having vertices, and MIT Jump Vlasic et al. (2008) with meshes each having vertices) contain human bodies with different sets of actions and poses. We have also tested our method on a skirt dataset introduced by Yang et al. (2020a) that contains simulated skirt meshes synthesized by NVIDIA clothing tools, each of which has vertices. The skirt is deformable everywhere, and the dataset is rather small. Obtaining stable performance in this case is challenging and we observe reasonably good results using active learning. Finally, we introduce a custom dataset of human hand poses. We captured various hand poses and transitions between the poses in a multi-view capture system. We ran 3D reconstruction Galliani et al. (2015) and 3D keypoint detection Simon et al. (2017) for the captured images and registered a linear blend skinning model consisting of vertices for each frame of the data Gall et al. (2009), resulting in meshes.

SCAPE Swing Jump Skirt Hand
200000 10000 10000 5000 200000
50000 5000 5000 5000 50000
Table 1: and used by each dataset.


We implement our method using PyTorch with the same network architecture as

Tan et al. (2021) and perform experiments on a desktop machine with an NVIDIA RTX 2080Ti GPU. We begin by training using Adam with a learning rate of and a batch size of over epochs. For neural collision detector training, unless otherwise stated, we choose the following hyper-parameters: . We perform bootstrap by supervised learning on data points. We initialize and progressively inject data points into until the “elbow point” of the accuracy vs. the sample size is reached, which is detected using Satopaa et al. (2011). For each experiment, we train using Adam with a learning rate of and a batch size of over epochs. We choose suitable according to . The and used for each dataset are summarized in table:Ninit. During data aggregation, we terminate Newton’s method when the relative changes of are less than . For each subsequent iteration of active data augmentation, we fine-tune using Adam with a learning rate of and a batch size of over epochs. For collision handling, we run ALM until eq:handler is satisfied.

Collision Detection: We compare our method with two baseline algorithms. The first one Tan et al. (2021) (denoted as Supv) uses the same network architecture as ours with both the autoencoder and the collision detector trained using supervised learning, where is constructed by randomly sampled poses from and boundary set with associated loss eq:bd_loss mentioned in sec:bd is not used, i.e., . Our second baseline (denoted as Supv + bd) also trains both networks using supervised learning, but the boundary set and loss in eq:bd_loss are used. Our proposed neural collision handling pipeline also uses active learning and boundary information and is denoted as Active + bd. After iterations of active data augmentations, we have a dataset with points for training . For fairness, we re-train our two baselines using points randomly sampled from after each iteration. For all the methods, we use

of the data for training and the rest is used as a validation set for hyperparameter tuning. For each dataset, we create a test set with

samples from , which is unseen in the training stage, to evaluate the performances. The performances of neural collision detectors are evaluated based on two metrics: the fraction of successful predicates (accuracy) and the fraction of times a self-penetrating mesh is erroneously predicted as collision-free (false negative rate). False negatives are more detrimental to our applications than false positives as our collision handler only takes care of positive samples. As illustrated in fig:iterationAccuracy and fig:iterationFalseNegativeRate, our method effectively improves both metrics. The performance after active learning is summarized in table:perf. We reach accuracy compared to the groundtruth generated by the exact method Pan et al. (2012), with up to 124 speedup. On average, our method achieves higher accuracy and a lower false negative rate than Supv. In the last row of table:perf, we measure an equivalent dataset size, which is defined as the size of the dataset needed by the Supv+bd

to achieve the same accuracy as our method. We derive this number by interpolating on experimental results of the

Supv+bd. Our method achieves a similar accuracy using a smaller dataset than Supv+bd on average.

metric SCAPE Swing Jump Skirt Hand
final dataset size
accuracy (Ours (Active+bd)) 0.9383 0.9638 0.9552 0.9817 0.9692
accuracy (Supv+bd) 0.9282 0.9609 0.9500 0.9795 0.9650
accuracy (Supv) 0.9181 0.9460 0.9347 0.9660 0.9558
false neg. (Ours (Active+bd)) 0.05151 0.01485 0.02573 0.01808 0.03582
false neg. (Supv+bd) 0.05576 0.01766 0.02644 0.01956 0.03652
false neg. (Supv) 0.06914 0.01969 0.02713 0.02056 0.03727
equi. dataset size (Supv+bd)
Table 2: We summarize the accuracy and false negative rate of three methods under comparison. We also include the equivalent dataset size for the baseline to reach the same performance as our method.
Figure 7:

We plot joint distribution of relative

PD reduction (-axis) and embedding difference (-axis) over successfully collision-handled test meshes in the SCAPE dataset for our method and Supv+bd. Our method resolves more collisions (average PD reduction vs. ) while remaining closer to the input (average embedding difference 56.17 vs. 66.11) compared to Supv+bd.

Collision Handling: We plug the trained neural collision detectors into ALM and compare our method, Supv and, Supv+bd in terms of resolving self-penetrating meshes. To this end, we randomly sample self-penetrating, unseen from , and use eq:handler to derive . We compare the performance based on relative PD reduction defined as:

Collision resolution is completely successful if this value equals one, which may not always happen because ALM uses soft penalties to relax hard constraints. Thus, we consider a solution successful if the value is greater than . We plot the success rate against the dataset size in fig:iterationColResSuccRate, which shows that our method resolves more collisions than Supv+bd. Thanks to our risk-seeking data aggregation method, our method monotonically improves the collision handling success rate when more data points are injected, while Supv+bd exhibits unstable performance. Since Supv uses the same randomly sampled dataset as Supv+bd, the performance exhibits similar instability. Meanwhile, our novel boundary loss improves the results for Supv+bd, since it can better approximate the decision boundary. Another criterion for good collision handling is the embedding difference – the objective function in eq:handler. We want the output to be as close as possible to the input. We plot the relative PD reduction vs. embedding difference over successfully collision-handled test meshes in the SCAPE dataset for our method and Supv+bd in fig:PDReduction. The mean relative PD reduction for our method is and the mean embedding difference is , compared to and , respectively, for Supv+bd. The results show that our method resolves more collisions while the outputs stay closer to the input latent codes. Some exemplary results are shown in fig:examples.

6 Conclusion & Limitations

We present an active learning method for training a neural collision detector in which training data are progressively sampled from the learned latent space using a risk-seeking approach. Our approach is designed for general 3D deformable meshes, and we highlight its benefits on many complex datasets. In practice, our method outperforms supervised learning in terms of accuracy, false negative rate, and stability. As a major limitation, our collision handler does not consider physics models. This can be performed in the future by integrating a learning-based physics simulation approach such as (Zheng et al., 2021). We are also considering extensions to meshes with changing topologies, e.g., using a level-set-based mesh representation.


  • H. H. Aghdam, A. Gonzalez-Garcia, J. v. d. Weijer, and A. M. López (2019) Active learning for deep detection neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3672–3680. Cited by: §2, §4.1, §4.
  • D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. In ACM SIGGRAPH, pp. 408–416. Cited by: §5.
  • P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. arXiv preprint arXiv:1612.00222. Cited by: §1.
  • D. Cudeiro, T. Bolkart, C. Laidlaw, A. Ranjan, and M. Black (2019) Capture, learning, and synthesis of 3D speaking styles. In

    Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    pp. 10101–10111. External Links: Link Cited by: §1.
  • N. Das, N. Gupta, and M. Yip (2017) Fastron: an online learning-based model and active learning strategy for proxy collision detection. In Conference on Robot Learning, pp. 496–504. Cited by: §2.
  • L. De Raedt, A. Passerini, and S. Teso (2018) Learning constraints from examples. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 32. Cited by: §2.
  • M. Fieraru, M. Zanfir, E. Oneata, A. Popa, V. Olaru, and C. Sminchisescu (2020) Three-dimensional reconstruction of human interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • M. Fieraru, M. Zanfir, E. Oneata, A. Popa, V. Olaru, and C. Sminchisescu (2021) Learning complex 3d human self-contact. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 3. Cited by: §2.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In

    International Conference on Machine Learning

    pp. 1183–1192. Cited by: §2, §4.
  • J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel (2009)

    Motion capture using joint skeleton tracking and surface estimation

    In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1746–1753. Cited by: §5.
  • S. Galliani, K. Lasinger, and K. Schindler (2015) Massively parallel multiview stereopsis by surface normal diffusion. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 873–881. External Links: Document Cited by: §5.
  • L. Gao, Y. Lai, J. Yang, Z. Ling-Xiao, S. Xia, and L. Kobbelt (2019) Sparse data driven mesh deformation. IEEE transactions on visualization and computer graphics. Cited by: §3.1.
  • L. Gao, J. Yang, Y. Qiao, Y. Lai, P. L. Rosin, W. Xu, and S. Xia (2018) Automatic unpaired shape deformation transfer. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–15. Cited by: §1.
  • Y. Geifman and R. El-Yaniv (2017) Deep active learning over the long tail. arXiv preprint arXiv:1711.00941. Cited by: §1.
  • N. K. Govindaraju, M. C. Lin, and D. Manocha (2005) Quick-cullide: fast inter-and intra-object collision culling using graphics hardware. In IEEE Proceedings. VR 2005. Virtual Reality, 2005., pp. 59–66. Cited by: §2.
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) A papier-mâché approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224. Cited by: §2.
  • E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, and P. Fua (2019) GarNet: a two-stream network for fast and accurate 3d cloth draping. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.
  • D. Holden, B. C. Duong, S. Datta, and D. Nowrouzezahrai (2019) Subspace neural physics: fast data-driven interactive simulation. In Proceedings of the 18th annual ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 1–12. Cited by: §1.
  • Y. Kim, M. Lin, and D. Manocha (2018) Collision and proximity queries. Handbook of Discrete and Computational Geometry. Cited by: §2.
  • Z. Lahner, D. Cremers, and T. Tung (2018) Deepwrinkles: accurate and realistic clothing modeling. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 667–684. Cited by: §2.
  • L. Liu, Y. Zheng, D. Tang, Y. Yuan, C. Fan, and K. Zhou (2019) NeuroSkinning: automatic skin binding for production characters with deep graph networks. ACM Trans. Graph. 38 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §2.
  • L. Muller, A. A. A. Osman, S. Tang, C. P. Huang, and M. J. Black (2021) On self-contact and human pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9990–9999. Cited by: §2.
  • R. Narain, A. Samii, and J. F. O’Brien (2012) Adaptive anisotropic remeshing for cloth simulation. ACM Transactions on Graphics 31 (6), pp. 147:1–10. Note: Proceedings of ACM SIGGRAPH Asia 2012, Singapore External Links: Link Cited by: §1, §2, §3.3.
  • R. S. Niculescu, T. M. Mitchell, R. B. Rao, K. P. Bennett, and E. Parrado-Hernández (2006) Bayesian network learning with parameter constraints.. Journal of machine learning research 7 (7). Cited by: §2.
  • A. A. A. Osman, T. Bolkart, and M. J. Black (2020) STAR: a sparse trained articulated human body regressor. In European Conference on Computer Vision (ECCV), pp. 598–613. External Links: Link Cited by: §2.
  • J. Pan, S. Chitta, and D. Manocha (2012) FCL: a general purpose library for collision and proximity queries. In 2012 IEEE International Conference on Robotics and Automation, pp. 3859–3866. Cited by: §2, §4, §5.
  • J. Pan, X. Zhang, and D. Manocha (2013) Efficient penetration depth approximation using active learning. ACM Transactions on Graphics (TOG) 32 (6), pp. 1–12. Cited by: §2.
  • C. Patel, Z. Liao, and G. Pons-Moll (2020) TailorNet: predicting clothing in 3d as a function of human pose, shape and garment style. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • R. Paul, D. Feldman, D. Rus, and P. Newman (2014) Visual precis generation using coresets. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1304–1311. Cited by: §2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §2.
  • Y. Qiao, Y. Lai, H. Fu, and L. Gao (2020) Synthesizing mesh deformation sequences with bidirectional lstm. IEEE Transactions on Visualization and Computer Graphics (), pp. 1–1. External Links: Document Cited by: §1.
  • D. Ritchie, K. Wang, and Y. Lin (2019) Fast and flexible indoor scene synthesis via deep convolutional generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6182–6190. Cited by: §2.
  • V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan (2011) Finding a” kneedle” in a haystack: detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops, pp. 166–171. Cited by: §5.
  • O. Sener and S. Savarese (2018)

    Active learning for convolutional neural networks: a core-set approach

    In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • T. Simon, H. Joo, I. Matthews, and Y. Sheikh (2017) Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • B. Smith, C. Wu, H. Wen, P. Peluse, Y. Sheikh, J. K. Hodgins, and T. Shiratori (2020) Constraining dense hand surface tracking with elasticity. ACM Trans. Graph. 39 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • W. Sun and Y. Yuan (2006) Optimization theory and methods: nonlinear programming. Vol. 1, Springer Science & Business Media. Cited by: §3.3.
  • Q. Tan, L. Gao, Y. Lai, and S. Xia (2018a) Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5841–5850. Cited by: §2.
  • Q. Tan, L. Gao, Y. Lai, J. Yang, and S. Xia (2018b) Mesh-based autoencoders for localized deformation component analysis. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • Q. Tan, Z. Pan, and D. Manocha (2021) LCollision: fast generation of collision-free human poses using learned non-penetration constraints. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI, Cited by: Figure 1, §1, §1, §3.3, §3, §4.3, §5, §5.
  • M. Tang, D. Manocha, M. A. Otaduy, and R. Tong (2012) Continuous penalty forces. ACM Transactions on Graphics (TOG) 31 (4), pp. 1–9. Cited by: §1, §2, §3.3.
  • H. Tian, X. Zhang, C. Wang, J. Pan, and D. Manocha (2016) Efficient global penetration depth computation for articulated models. Computer-Aided Design 70, pp. 116–125. Cited by: §2.
  • D. Vlasic, I. Baran, W. Matusik, and J. Popović (2008) Articulated mesh animation from multi-view silhouettes. In ACM SIGGRAPH 2008 papers, pp. 1–9. Cited by: §5.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §2.
  • Z. Xu, Y. Zhou, E. Kalogerakis, C. Landreth, and K. Singh (2020) RigNet: neural rigging for articulated characters. ACM Trans. on Graphics 39. Cited by: §2.
  • J. Yang, L. Gao, Q. Tan, Y. Huang, S. Xia, and Y. Lai (2020a) Multiscale mesh deformation component analysis with attention-based autoencoders. External Links: 2012.02459 Cited by: Figure 1, §5.
  • J. Yang, K. Mo, Y. Lai, L. J. Guibas, and L. Gao (2020b) DSM-net: disentangled structured mesh net for controllable generation of fine geometry. CoRR abs/2008.05440. External Links: Link, 2008.05440 Cited by: §1.
  • F. Yu, K. Liu, Y. Zhang, C. Zhu, and K. Xu (2019) PartNet: a recursive part decomposition network for fine-grained and hierarchical shape segmentation. In CVPR, pp. to appear. Cited by: §2.
  • X. Zhang, Y. J. Kim, and D. Manocha (2014) Continuous penetration depth. Computer-Aided Design 46, pp. 3–13. Cited by: §3.2.
  • M. Zheng, Y. Zhou, D. Ceylan, and J. Barbic (2021) A deep emulator for secondary motion of 3d characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5932–5940. Cited by: §1, §6.