Log In Sign Up

SOMA: Solving Optical Marker-Based MoCap Automatically

Marker-based optical motion capture (mocap) is the "gold standard" method for acquiring accurate 3D human motion in computer vision, medicine, and graphics. The raw output of these systems are noisy and incomplete 3D points or short tracklets of points. To be useful, one must associate these points with corresponding markers on the captured subject; i.e. "labelling". Given these labels, one can then "solve" for the 3D skeleton or body surface mesh. Commercial auto-labeling tools require a specific calibration procedure at capture time, which is not possible for archival data. Here we train a novel neural network called SOMA, which takes raw mocap point clouds with varying numbers of points, labels them at scale without any calibration data, independent of the capture technology, and requiring only minimal human intervention. Our key insight is that, while labeling point clouds is highly ambiguous, the 3D body provides strong constraints on the solution that can be exploited by a learning-based method. To enable learning, we generate massive training sets of simulated noisy and ground truth mocap markers animated by 3D bodies from AMASS. SOMA exploits an architecture with stacked self-attention elements to learn the spatial structure of the 3D body and an optimal transport layer to constrain the assignment (labeling) problem while rejecting outliers. We extensively evaluate SOMA both quantitatively and qualitatively. SOMA is more accurate and robust than existing state of the art research methods and can be applied where commercial systems cannot. We automatically label over 8 hours of archival mocap data across 4 different datasets captured using various technologies and output SMPL-X body models. The model and data is released for research purposes at


page 1

page 4

page 18

page 19

page 20

page 21


Neural-IMLS: Learning Implicit Moving Least-Squares for Surface Reconstruction from Unoriented Point Clouds

Surface reconstruction from noisy, non-uniform, and unoriented point clo...

OSSO: Obtaining Skeletal Shape from Outside

We address the problem of inferring the anatomic skeleton of a person, i...

Automatic Estimation of Anthropometric Human Body Measurements

Research tasks related to human body analysis have been drawing a lot of...

Self-Point-Flow: Self-Supervised Scene Flow Estimation from Point Clouds with Optimal Transport and Random Walk

Due to the scarcity of annotated scene flow data, self-supervised scene ...

AMASS: Archive of Motion Capture as Surface Shapes

Large datasets are the cornerstone of recent advances in computer vision...

Capturing Detailed Deformations of Moving Human Bodies

We present a new method to capture detailed human motion, sampling more ...

Auto-labelling of Markers in Optical Motion Capture by Permutation Learning

Optical marker-based motion capture is a vital tool in applications such...

1 Introduction

Marker-based optical motion capture (mocap) systems record 2D infrared images of light reflected or emitted by a set of markers placed at key locations on the surface of a subject’s body. Subsequently, the mocap systems recover the precise position of the markers as a sequence of sparse and unordered points or short tracklets. Powered by years of commercial development, these systems offer high temporal and spatial accuracy. Richly varied mocap data from such systems is widely used to train machine learning methods in action recognition, motion synthesis, human motion modeling, pose estimation, etc. Despite this, the largest existing mocap dataset, AMASS

[30], has about 45 hours of mocap, much smaller than video datasets used in the field.

Mocap data is limited since capturing and processing it is expensive. Despite its value, there are large amounts of archival mocap in the world that have never been labeled; this is the “dark matter” of mocap. The problem is that, to solve for the 3D body, the raw mocap point cloud (MPC) must be “labeled”; that is, the points must be assigned to physical “marker” locations on the subject’s body. This is challenging because the MPC is noisy and sparse and the labeling problem is ambiguous. Existing commercial tools, e.g. [21, 29], offer partial automation, however none provide a complete solution to automatically handle variations in marker layout, i.e. number of markers used and their rough placement on the body, variation in subject body shape and gender, and variation across capture technologies namely active vs passive markers or brands of system. These challenges typically preclude cost-effective labeling of archival data, and add to the cost of new captures by requiring manual cleanup.

Automating the mocap labeling problem has been examined by the research community [14, 16, 19]. Existing methods focus on fixing the mistakes in already labeled markers through denoising [19, 8]. Recent work formulates the problem in a matching framework, directly predicting the label assignment matrix for a fixed number of markers in a restricted setup [14]. In short, the existing methods are limited to a constrained range of motions [14], a single body shape [16, 19, 8], a certain capture scenario, a special marker layout, or require a subject-specific calibration sequence [14, 21, 29]. Other methods require high-quality real mocap marker data for training, effectively prohibiting their scalability to novel scenarios [8, 14].

To address these shortcomings we take a data-driven approach and train a neural network end-to-end with self-attention components and an optimal transport layer to predict a per-frame constrained inexact matching between mocap points and labels. Having enough “real” data for training is not feasible, therefore we opt for synthetic data. Given a marker layout, we generate synthetic mocap point clouds with realistic noise, and then train a layout-specific network that can cope with realistic variations across a whole mocap dataset. While previous works have exploited synthetic data [19, 16], they are limited in terms of body shapes, motions, marker layouts, and noise sources.

Even with a large synthetic corpus of MPC, labeling a cloud of sparse 3D points, containing outliers and missing data, is a highly ambiguous task. The key to the solution lies in the fact that the points are structured, as is their variation with articulated pose. Specifically, they are constrained by the shape and motion of the human body. Given sufficient training data, our attentional framework learns to exploit local context at different scales. Furthermore, if there were no noise, the mapping between labels and points would be one-to-one. We formulate these concepts as a unified training objective enabling end-to-end model training. Specifically, our formulation exploits a transformer architecture to capture local and global contextual information using self-attention (Sec. 4.1). By generating synthetic mocap data with varying body shapes and poses, SOMA implicitly learns the kinematic constraints of the underlying deformable human body (Sec. 4.4). A one-to-one match between 3D points and markers, subject to missing and spurious data, is achieved by a special normalization technique (Sec. 4.2). To provide a common output framework, consistent with [30], we use MoSh [28, 30] as a post-processing step to fit SMPL-X [37] to the labeled points; this also helps deal with missing data caused by occlusion or dropped markers. The SOMA system is outlined in Fig. 3.

To generate training data, SOMA requires a rough marker layout that can be obtained by a single labeled frame, which requires minimal effort. Afterward, virtual markers are automatically placed on a SMPL-X body and animated by motions from AMASS [30]. In addition to common mocap noise models like occlusions [14, 16, 19], and ghost points [16, 19], we introduce novel terms to vary maker placement on the body surface and we copy noise from real marker data in AMASS (Sec. 4.4). We train SOMA once for each mocap dataset and apart from the one layout frame, we do not require any labeled real data. After training, given a noisy MPC frame as input, SOMA predicts a distribution over labels of each point, including a null label for ghost points.

We evaluate SOMA on several challenging datasets and find that we outperform the current state of the art numerically while being much more general. Additionally, we capture new MPC data using a Vicon mocap system and compare hand-labeled ground-truth to Shōgun and SOMA output. SOMA performs similarly compared with the commercial system. Finally, we apply the method on archival mocap datasets: Mixamo [10], DanceDB [4], and a previously unreleased portion of the CMU mocap dataset [11].

In summary, our main contributions are: (1) a novel neural network architecture exploiting self-attention to process sparse deformable point cloud data; (2) a system that consumes mocap point clouds directly and outputs a distribution over marker labels; (3) a novel synthetic mocap generation pipeline that generalizes to real mocap datasets; (4) a robust solution that works with archival data, different mocap technologies, poor data quality, and varying subjects and motions; (5) 220 minutes of processed mocap data in SMPL-X format, trained models, and code are released for research purposes.

2 Related Work

Learning to Process MoCap Data was first introduced by [49] in a limited scenario. More recently, [14] proposes a learning-based model for mocap labeling that directly predicts permutations of 44 input markers. The number of possible permutations is prohibitive, hence, the authors restrict them to a limited pool shown to the network during training and test time. Moreover, motions are restricted to four categories: walk, jog, jump and sit. Furthermore, [14] inherently cannot deal with ghost points. We compare directly with them and find that we are more accurate, while removing the limitations.

Solutions exist that “denoise” the possible incorrect labels of already labeled mocap markers [19, 8]

. These approaches normalize the markers to a standard body size, and rely on fragile heuristics to remove ghost points, and must first compute the global orientation of the body. Our method starts one step earlier, with

unlabeled point clouds, and outputs a fully labeled sequence, while learning to reject ghost points and dealing with varied body shapes.

Deep Learning on Point Cloud Data requires a solution to handle a variable number of unordered points, and a way to define a notion of “neighborhood”. To address these issues [13, 51] project 3D points into 2D images and [32, 59] rasterize the point cloud into 3D voxels to enable the use of traditional convolution operators. Han et al. [16] utilize this idea for labeling hand MPC data, by projecting them into multiple 2D depth images. Using these, they train a neural network to predict the 3D location of 19 known markers on a fixed-shape hand, assigning the label of a marker to a point closest to it in a disconnected graph matching step. In contrast, our pipeline directly works with mocap point clouds and predicts a distribution over labels for each point, end-to-end, without disconnected stages.

PointNet methods [7, 40] also process the 3D point cloud directly, while learning local features with permutation-invariant pooling operators. Further non-local networks [58] and self-attention-based [56] models can attend globally while learning to focus locally on specific regions of the input. This simple formulation enables learning robust features on sparse point clouds while being insensitive to variable numbers of points. SOMA is a novel application of this idea to mocap data. In Sec. 4.1 we demonstrate that by stacking multiple self-attention elements, SOMA can learn rich point features at multiple scales that enable robust, permutation-invariant, mocap labeling.

Inexact Graph Matching formulates the problem of finding an assignment between nodes of a model graph to data nodes, where the former occasionally has fewer elements than the latter. This is an NP-hard problem [1] in general and challenging for mocap due to occlusions and spurious data points. Such graph matching problems appear frequently in computer vision and are addressed either with engineered costs [5, 53, 27] or learned ones [6, 44]. Ghorbani et al. [14] apply this to mocap using an approximate solution with Sinkhorn normalization [2, 47] by assuming that the graph of a mocap frame is isomorphic to the graph of the labels. We relax this assumption by considering an inexact match between the labels and points, by opting for an optimal transport [57] solution.

Body Models

, in the form of skeletons, are widely used to constrain the labeling problem and to solve for body pose

[18, 42, 41, 34, 9, 45, 45]. Recently [50, 25]

employ variants of Kalman filtering to estimate constrained configuration of a given body skeleton in real-time but are susceptible to occlusions and ghost points. The most mature commercial solution to-date is Shōgun

[29], which can produce real-time labeling and skeletal solving. While this is an excellent product, out-of-the-box it works only with a Vicon-specific marker layout and requires subject-specific session calibration. Thus it is not a general solution and cannot be used on most archival mocap data or with customized marker layouts needed in many applications. MoSh [28, 30], goes beyond simple skeleton models and utilizes a realistic 3D body surface model learned from a large corpus of body scans [37, 60, 24]. It takes labeled mocap markers as well as a rough layout of their placement on the body, and solves for body parameters and the precise placement of markers on the body. We employ MoSh for post-processing auto-labeled mocap across different datasets into a unified SMPL-X representation.

3 The MoCap Labeling Problem

A mocap point cloud, MPC, is a time sequence with frames of 3D points


where for each time step . We visualize an MPC as a chart in Fig. 2 (top), where each row holds points reconstructed by the mocap hardware, and each column represents a frame of MPC. Each point is unlabeled but these can often be locally tracked over short time intervals, illustrated by the gray bars in the figure. Note that some of these tracklets may correspond to noise or “ghost points”. For passive marker systems like Vicon [29], a point that is occluded typically appears in a new row; i.e. with a new ID. For active marker systems like PhaseSpace [22], one can have gaps in the trajectories due to occlusion. The figure shows both types.

Figure 2: MoCap labeling problem. (top) Raw, unlabeled, MoCap Point Cloud (MPC). Each column represents a timestamp in a mocap sequence and each cell is a 3D point or a short tracklet (shown as a gray row). (middle) shows the MPC after labeling. Colors correspond to different labels in the marker layout. Red corresponds to ghost points (outliers). The red oblique lines show ghost points wrongly tracked as actual markers by the mocap system. (bottom) shows the final result, with the tracklets glued together to form full trajectories with only valid markers retained. Note that marker occlusion results in black (missing) sections.

The goal of mocap labeling is to assign each point (or tracklet) to a corresponding marker label


in the marker layout as illustrated in Fig. 2 (middle), where each color is a different label. The set of marker labels include an extra null label for points that are not valid markers, hence . These are shown as red in the figure. Valid point labels and tracklets of them are subject to several constraints: () each point can be assigned to at most one label and vice versa; () each point can be assigned to at most one tracklet; () the label null is an exception that can be matched to more than one point and can be present in multiple tracklets in each frame.

4 Soma

Figure 3: We train SOMA solely with synthetic data, Sec. 4.4. At runtime, SOMA receives an unprocessed 3D sparse mocap point cloud, , with a varying number of points. These are median centered and passed through the pipeline, consisting of self-attention layers, Sec. 4.1, and a final normalization to encourage bijective label-point correspondence, Sec. 4.2. The network outputs labels, , assigned to each point, that correspond to markers in the training marker layout, , with an additional null label. Finally, a 3D body is fit to the labeled points using MoSh, Sec. 4.3. The dimensionality of the features are , , , .

4.1 Self-Attention on MoCap Point Clouds

The SOMA system pipline is summarized in Fig. 3. The input to SOMA is a single frame of sparse, unordered, points, the cardinality of which varies with each timestamp due to occlusions and ghost points. To process such data, we exploit multiple layers of self-attention [56], with a multi-head formulation, concatenated via residual operations [17, 56]. Multiple layers increase the capacity of the model and enable the network to have a local and a global view of the point cloud, which helps disambiguate points.

We define self-attention span as the average of attention weights over random sequences picked from our validation dataset. Figure 4

visualizes the attention placed on markers at the first and last self-attention layers; the intensity of the red color correlates with the amount of attention. Note that the points are shown on a body in a canonical pose but the actual mocap point cloud data is in many poses. Deeper layers focus attention on geodesically near body regions (wrist: upper and lower arm) or regions that are highly correlated (left foot: right foot and left knee), indicating that the network has figured out the spatial structure of the body and correlations between parts even though the observed data points are non-linearly transformed in Euclidean space by articulation. In Appendix 

A, we provide further computational details and demonstrate the self-attention span as a function of network depth. Also, we present a model-selection experiment to choose the optimum number of layers.

4.2 Constrained Point Labeling

In the final stage of the architecture, SOMA predicts a non-square score matrix . To satisfy the constraints and , we employ a log-domain, stable implementation of optimal transport [44] described by [38]. The optimal transport layer depends on iterative Sinkhorn normalization [2, 47, 48], which constrains rows and columns to sum to 1 for available points and labels. To deal with missing markers and ghost points, following [44], we introduce dustbins by appending an extra last row and column to the score matrix. These can be assigned to multiple unmatched points and labels, hence respectively summing to and . After normalization, we reach the augmented assignment matrix, , from which we drop the appended row, for unmatched labels, yielding the final normalized assignment matrix .

While Ghorbani et al. [14] use a similar score normalization approach, their method cannot handle unmatched cases, which is critical to handle real mocap point cloud data, in its raw form.

4.3 Solving for the Body

Once mocap points are labeled, we “solve” for the articulated body that lies behind the motion. Typical mocap solvers [19, 21, 29] fit a skeletal model to the labeled markers. Instead, here we fit an entire articulated 3D body mesh to markers using MoSh [28, 30]. This technique gives an animated body with a skeletal structure so nothing is lost over traditional methods while yielding a full 3D body model, consistent with other recent mocap datasets [30, 52]. Here we fit the SMPL-X body model. which provides forward compatibility for datasets with hands and face captures. For more details on MoSh we refer the reader to the original paper [28, 30].

4.4 Synthetic MoCap Generation

Human Body Model. To synthetically produce realistic mocap training data with ground truth labels, we leverage a gender-neutral, state of the art statistical body model, SMPL-X [37], that uses vertex-based linear blend skinning with learned corrective blend shapes to output the global position of vertices:


Here is the axis-angle representation of the body pose where is the number of body joints of an underlying skeleton in addition to a root joint for global rotation. We use and to respectively parameterize body shape and the global translation. Compared to the original SMPL-X notation, here we discard parameters that control facial expressions, face and hand poses; i.e. respectively . We build on SMPL-X to enable extension of SOMA to datasets with face and hand markers but SMPL-X can be converted to SMPL if needed. For more details we refer the reader to [37].

MoCap Noise Model. Various noise sources can influence mocap data, namely: subject body shape, motion, marker layout and the exact placement of the markers on body, occlusions, ghost points, mocap hardware intrinsics, and more. To learn a robust model, we exploit AMASS [30] that we refit with a neutral gender SMPL-X body model and sub-sample to a unified 30 Hz. To be robust to subject body shape we generate AMASS motions for 3664 bodies from the CAESAR dataset [43]. Specifically, for training we take parameters from the following mocap sub-datasets of AMASS: CMU [11], Transitions [30] and Pose Prior [3]. For validation we use HumanEva [46], ACCAD [12], and TotalCapture [55].

Given a marker layout of a target dataset,

, as a vector of length

in which the index corresponds to the maker label and the entry to a vertex on the SMPL-X body mesh, together with a vector of marker-body distances we can place virtual markers, on the body:


Here is a matrix of vertex normals and picks the vector of elements (vertices or normals) corresponding to vertices defined by the marker layout.

With this, we produce a library of mocap frames and corrupt them with various controllable noise sources. Specifically, to generate a noisy layout, we randomly sample a vertex in the 1-ring neighborhood of the original vertex specified by the marker layout, effectively producing a different marker placement, , for each data point. Instead of normalizing the global body orientation, common to previous methods [14, 16, 19, 8], we add a random rotation to the global root orientation of every body frame to augment this value. Further, we copy the noise for each label from the real AMASS mocap markers to help generalize to mocap hardware differences. We create a database of the differences between the simulated and actual markers of AMASS and draw random samples from this noise model to add to the synthetic marker positions.

Furthermore, we append ghost points to the generated mocap frame, by drawing random samples from a 3D Gaussian distribution with mean and standard deviation equal to the median and standard deviation of the marker positions, respectively. Moreover, to simulate marker occlusions we take random samples from a uniform distribution representing the index of the markers and occlude selected markers by replacing their value with zero. The number of added ghost points and occlusions in each frame can also be subject to randomness.

Figure 4: Attention span for different markers on the body in a canonical pose. The cube shows the marker of interest and color intensity depicts the average value of attention across frames of 50 randomly selected sequences. Each column shows a different marker. At the first layer (top) we see wider attention compared to the deepest layer (bottom).

At test time, to mimic broken trajectories of passive mocap systems, we randomly choose a trajectory and break it at random timestamps. To break a trajectory we copy marker values at the onset of the breakage and create a new trajectory whose previous values up-to -the breakage are zero and the rest are replaced by the marker of interest. The original marker trajectory after breakage is replaced by zeros.

Finally, at train and test times we randomly permute the markers to create an unordered set of 3D mocap points. In contrast to [14], the permutations are random and not limited to a specific set of permutations.

4.5 Implementation Details

Loss. The total loss for training SOMA is formulated as, , where:


is the augmented assignment matrix, and is its ground-truth version. is a matrix to down-weight the influence of the over-represented class, i.e. the null label, by the reciprocal of its occurrence frequency. is regularization on the model parameters. In Appendix B we present further architecture details.

Using SOMA. The automatic labeling pipeline starts with a labeled mocap frame that can roughly resemble the marker layout of the target dataset. If the dataset has significant variations in marker layout or many displaced or removed markers, one labeled frame per each major variation is needed. We then train one model for the whole dataset. After training with synthetic data produced for the target marker layout, we apply SOMA on mocap sequences in a per-frame mode;  each frame is processed independently. On a GPU, auto-labeling runs at in non-batched mode and, for a batch of 30 frames runtime is . In cases where the mocap hardware provides tracklets of points, we assign the most frequent label for a tracklet to all of the member points; we call this tracklet labeling. For detailed examples of using SOMA, including a general model to facilitate labeling the initial frame, i.e. “label priming”, see Appendix G.

5 Experiments

Evaluation Datasets. We evaluate SOMA quantitatively on various mocap datasets with real marker data and synthetic noise; namely: BMLrub [54], BMLmovi [15], and KIT [31]. The individual datasets offer various maker layouts with different marker density, subject shape variation, body pose, and recording systems. We take original marker data from their respective public access points and further corrupt the data with controlled noise, namely marker occlusion, ghost points, and per-frame random shuffling (Sec. 4.4). For per-frame experiments, broken trajectory noise is not used. We also collect a new “SOMA dataset”, which we use for direct comparison with Vicon.

To avoid overfitting hyper-parameters to test datasets, we utilize a separate dataset for model selection and validation experiments; namely HDM05 [35], containing 215 sequences, across 4 subjects, on average using 40 markers.

Evaluation Metrics.

Primarily, we report mean and standard deviation of per-frame accuracy and F1 score in percentages. Accuracy is the proportion of correctly predicted labels over all labels, and F1 score is regarded as the harmonic-average of the precision and recall:


where recall is the proportion of correct predicted labels over actual labels and precision is regarded as the proportion of actual correct labels over predicted labels. The final F1 score for a mocap sequence is the average of the per-frame F1 scores.

5.1 Effectiveness of the MoCap Noise Generation

[trim=lr]TrainTest Acc. F1 Acc. F1 Acc. F1 Acc. F1
B 97.93 97.37 83.55 81.11 86.54 85.97 73.95 71.93
B+C 97.25 96.31 97.21 96.05 95.37 95.29 93.79 92.77
B+G 98.06 97.19 96.14 94.03 97.87 97.65 95.27 93.62
B+C+G 95.74 94.44 95.56 93.91 95.73 95.22 95.50 94.44
Table 1: Per-frame labeling with synthetic noise model evaluated on real mocap markers of HDM05 with added noise. “B, C, G” respectively stand for Base, Occlusion, and Ghost points. We report average accuracy and F1 scores.

Here we train and test SOMA with various amounts of synthetic noise. The training data is synthetically produced for the layout of HDM05 as described in Sec. 4.4. We test on original markers of HDM05 corrupted with synthetic noise. B stands for no noise, B+C for up-to 5 markers occluded per-frame, B+G for up-to 3 ghost points per-frame, and B+C+G for the full treatment of the occlusion and ghost point noise. Table 1 shows different training and testing scenarios. In general, matching the model training noise to the true noise produces the best results but training with the full noise model (B+C+G) gives competitive performance and is a good choice when the noise level is unknown. Training with more noise improves robustness.

5.2 Comparison with Previous Work

Method Number of Exact Per-Frame Occlusions
0 1 2 3 4 5 5+G
Holzreiter et al. [20] 88.16 79.00 72.42 67.16 61.13 52.10
Maycock et al. [33] 83.19 79.35 76.44 74.91 71.17 65.83
Ghorbani et al. [14] 97.11 96.56 96.13 95.87 95.75 94.90
SOMA-Real 99.08 98.97 98.85 98.68 98.48 98.22 98.29
SOMA-Synthetic 99.16 98.92 98.54 98.17 97.61 97.07 95.13
SOMA* 98.38 98.28 98.17 98.03 97.86 97.66 97.56
Table 2: Comparing SOMA with prior work on the same data. We train SOMA in three different ways: once with real data; once with synthetic markers placed on bodies of the same motions obtained from AMASS; and ultimately once with the data produced in Sec. 4.4, designated with *. See Fig. D.1 for standard deviations.

We compare SOMA to prior work in Tab. 2 under the same conditions. Specifically, we use train and test data splits of the BMLrub dataset explained by [14]. The test happens on real markers with synthetic noise. We train SOMA once with real marker data and once by synthetic markers produced by motions of the same split. Additionally, we train SOMA with the full synthetic data outlined in Sec. 4.4. The performance of other competing methods is as reported by [14]. All versions of SOMA outperform previous methods. The model (SOMA*) trained with synthetic markers and varied training poses from AMASS is competitive with the models trained on limited real data or synthetic marker data with dataset-specific motions. This is likely due to the rich variation in our noise generation pipeline. In contrast to prior work, SOMA is robust to increasing occlusion and can process ghost points without extra heuristics; i.e. last column in Tab. 2.

5.3 Performance Robustness

Per-Frame Tracklet
Datasets Acc. F1 Acc. F1
BMLrub [54] 98.15 2.78 97.75 3.23 98.77 1.58 98.65 1.89 41 3013 3757725 111
KIT[31] 94.97 2.42 95.51 2.65 95.46 1.87 97.10 2.00 53 3884 3504524 48
BMLmovi[15] 95.90 4.65 95.12 5.26 97.33 2.29 96.87 2.60 67 1863 1255447 89
Table 3: Performance of SOMA on real marker data of various datasets with large variation in number of subjects, body pose, markers, and hardware specifics. We corrupt the real marker data with additional noise, and forget the labels, turning it into a raw MPC before passing through SOMA pipeline.

Performance on Various MoCap Datasets could vary due to variations in the marker density, mocap quality, subject shape and motions. To assess such cases we take three full scale mocap datasets and corrupt the markers with synthetic noise including up to 50 broken trajectories that best mimic the situation with a realistic unlabeled mocap scenario. Additionally we evaluate tracklet labeling explained in Sec. 4.5. Table 3 shows consistent high performance of SOMA across the datasets. When tracklets are available, as they often are, tracklet labeling improves performance across the board.

Performance on Subsets or Supersets of a Specific Marker Layout could vary since this introduces “structured” noise. A superset marker layout is the set of all labels in a dataset, which may combine multiple marker layouts. A base model trained on a superset marker layout and tested on subsets would be subject to structured occlusions, while a model trained on subset and tested on the superset base mocap

would see structured ghost points. These situations commonly happen across a dataset when trial coordinators improvise on the planned marker layout for a special take with additional or fewer markers. Alternatively, markers often fall off during the trial. To quantitatively determine the range of performance variance we take the marker layout of the validation dataset, HDM05, and omit markers in progressive steps; Tab. 

4. The model, trained on subset layout and tested on base markers (superset), shows greater deterioration in performance than the base model trained on the superset and tested on reduced marker sets.

3 5 6 12
Acc. F1 Acc. F1 Acc. F1 Acc. F1
Base Model 95.35 6.52 94.40 7.37 94.08 7.04 91.75 8.32 93.41 7.84 90.73 9.33 88.25 14.04 8.80 16.07
Base MoCap 91.78 10.13 90.68 10.90 90.73 9.38 90.12 9.96 91.89 8.97 91.00 9.71 87.46 10.77 86.78 12.00
Table 4: Robustness to variations in marker layout. First row: A base model is trained with full marker layout (superset) and tested per-frame on real markers from the validation set (HDM05) with omitted markers (subset). Second row: One model is trained per varied layout (subset) and tested on base mocap markers (superset).

5.4 Ablation Studies

Version Accuracy F1
Base 95.50 5.33 94.66 6.03
 - AMASS Noise Model 94.73 5.52 93.73 6.25
 - CAESAR bodies 95.21 6.83 94.31 7.57
 - Log-Softmax Instead of Sinkhorn 91.51 10.69 90.10 11.47
 - Random Marker Placement 89.41 8.06 87.78 8.85
 - Transformer 11.36 6.54 7.54 6.22
Table 5: Ablation study of SOMA components on the HDM05 dataset. The numbers reflect the contribution of each component in overall per-frame performance of SOMA. We take the full base model and remove one component at a time.

Table 5 shows effect of various components in the final performance of SOMA on the validation dataset, HDM05. The self-attention layers and the novel random marker placement noise play the most significant role in overall performance of the model. The optimal transport layer marginally improves accuracy of the model compared to the Log-Softmax normalization.

5.5 Application

Acc. F1
Shōgun 0.00 0.11 0.00 100.0 0.00 100.0 0.00
SOMA 0.08 2.09 0.00 99.94 0.47 99.92 0.64
Table 6: SOMA vs Shōgun. On a manually labeled dataset with passive markers, we compare SOMA against a commercial tool for labeling performance and surface reconstruction.
Type Points Subjects Minutes Success Ratio
CMU-II [11] P 40-255 41 116.30 80.0
DanceDB [4] A 38 20 203.38 81.26
Mixamo [10] A 38-96 29 195.37 78.31
SOMA P 53-140 2 18.27 100.00
Total 533.32
Table 7: Processing uncleaned, unlabeled mocap datasets with SOMA. Input to the pipeline are mocap sequences with possibly varying number of points; SOMA labels the points as markers and afterwards MoSh is applied on the data to solve for the body surface. P and A stand for passive and active marker systems respectively.

Comparison with a Commercial Tool. To compare directly with Vicon’s Shōgun auto-labeling tool, we record a new “SOMA” mocap dataset with subjects, performing motion types, including dance, clap, kick, etc., using a Vicon system with infrared “Vantage [29] cameras operating at Hz. In total, we record motions and intentionally use a marker layout preferred by Vicon. We manually label this dataset and treat these labels as ground truth. We process the reconstructed mocap point cloud data with both SOMA (using tracklet labeling) and with Shōgun. The results in Tab. 6 show that SOMA achieves sub-millimeter accuracy and similar performance compared with the propriety tool while not requiring subject calibration data. In Tab. G.1 we present further details of this dataset.

Processing Real MoCap Datasets with different capture technologies and unknown subject calibration is presented in Tab.  7. For each dataset, SOMA is trained on the marker superset using only synthetic data. SOMA effectively enables running MoSh on mocap point cloud data to extract realistic bodies. The results are not perfect and we manually remove sequences that do not pass a subjective quality bar (see Appendix G and the accompanying video for examples). Table  7 indicates the percentage of successful minutes of mocap. Failures are typically due to poor mocap quality. Note that the SOMA dataset is very high quality with many cameras and, here, the success rate is 100%. For sample renders refer to the accompanying video.

6 Conclusion

SOMA addresses the problem of robustly labeling of raw mocap point cloud sequences of human bodies in motion, subject to noise and variations across subjects, motions, marker placement, marker density, mocap quality, and capture technology. SOMA solves this problem using several innovations including a novel self-attention mechanism and a matching component that deals with outliers and missing data. We train SOMA end-to-end on synthetic data using several techniques to add realistic noise that enable generalization to real data. We extensively validate the performance of SOMA showing that it is more accurate than previous research methods and comparable in accuracy to a commercial system while being significantly more flexible. SOMA is also freely available for research purposes.

Limitations and Future Work. SOMA performs per-frame MPC labeling and, hence, does not exploit temporal information. A temporal model could potentially improve accuracy. As with any learning-based method, SOMA may be limited in generalizing to new motions outside the training data. Using AMASS, however, the variability of the training data is large and we did not observe problems with generalization. By exploiting the full SMPL-X body model in synthetic data generation pipeline we plan to extend the method to label hand and face markers. Relying on feed-forward components, SOMA is extremely fast and coupled with a suitable mocap solver could potentially recover bodies in real-time from mocap point clouds.

Acknowledgments: We thank Senya Polikovsky, Markus Höschle, Galina Henz (GH), and Tobias Bauch for the mocap facility. We thank Alex Valisu, Elisha Denham, Leyre Sánchez Viñuela, Felipe Mattioni and Jakob Reinhardt for mocap cleaning. We thank GH, and Tsvetelina Alexiadis for trial coordination. We thank Benjamin Pellkofer and Jonathan Williams for website developments. Disclosure:


  • [1] M. A. Abdulrahim (1998) Parallel algorithms for labeled graph matching. Ph.D. Thesis, Colorado School of Mines, USA. Note: AAI0599838 Cited by: §2.
  • [2] R. P. Adams and R. S. Zemel (2011) Ranking via Sinkhorn Propagation. External Links: Link, 1106.1925 Cited by: §2, §4.2.
  • [3] I. Akhter and M. J. Black (2015) Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR, pp. 1446–1455. External Links: Document Cited by: §4.4.
  • [4] A. Aristidou, E. Stavrakis, M. Papaefthimiou, G. Papagiannakis, and Y. Chrysanthou (2018-12-01) Style-based motion analysis for dance composition. The Visual Computer 34, pp. 1725–1737. External Links: Link Cited by: Appendix G, Appendix G, §1, Table 7.
  • [5] A. C. Berg, T. L. Berg, and J. Malik (2005) Shape matching and object recognition using low distortion correspondences. In CVPR, Vol. 1, pp. 26–33. External Links: Document Cited by: §2.
  • [6] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola (2007) Learning graph matching. In ICCV, pp. 1–8. External Links: Document Cited by: §2.
  • [7] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In CVPR, Vol. , pp. 77–85. External Links: Document Cited by: §2.
  • [8] K. Chen, Y. Wang, S. Zhang, S. Xu, W. Zhang, and S. Hu (2021) MoCap-Solver: a neural solver for optical motion capture data. ACM Transactions on Graphics (TOG) 40 (4). External Links: Document, ISSN 0730-0301, Link Cited by: §1, §2, §4.4.
  • [9] R. Contini (1972) Body segment parameters, part II. Artificial Limbs 16 (1), pp. 1–19. External Links: Link Cited by: §2.
  • [10] A. M. M. Dataset (2019) External Links: Link Cited by: Appendix G, Appendix G, §1, Table 7.
  • [11] C. M. U. (. M. Dataset (2019) External Links: Link Cited by: Appendix G, Appendix G, §1, §4.4, Table 7.
  • [12] A. C. C. for the Arts and D. (. M. Dataset (2019) External Links: Link Cited by: §4.4.
  • [13] L. Ge, H. Liang, J. Yuan, and D. Thalmann (2016) Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In CVPR, Vol. , pp. 3593–3601. External Links: Document Cited by: §2.
  • [14] S. Ghorbani, A. Etemad, and N. F. Troje (2019) Auto-labelling of markers in optical motion capture by permutation learning. In Advances in Computer Graphics, Cham, pp. 167–178. External Links: ISBN 978-3-030-22514-8 Cited by: §1, §1, §2, §2, §4.2, §4.4, §4.4, §5.2, Table 2.
  • [15] S. Ghorbani, K. Mahdaviani, A. Thaler, K. Kording, D. J. Cook, G. Blohm, and N. F. Troje (2021) MoVi: a large multi-purpose human motion and video dataset. PLOS ONE 16 (6), pp. 1–15. External Links: Document, Link Cited by: Table 3, §5.
  • [16] S. Han, B. Liu, R. Wang, Y. Ye, C. D. Twigg, and K. Kin (2018) Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–10. External Links: Document, ISSN 0730-0301, Link Cited by: §1, §1, §1, §2, §4.4.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. External Links: Document Cited by: §4.1.
  • [18] L. Herda, P. Fua, R. Plänkers, R. Boulic, and D. Thalmann (2001) Using skeleton-based tracking to increase the reliability of optical motion capture. Human Movement Science 20 (3), pp. 313–341. External Links: Document, ISSN 0167-9457 Cited by: §2.
  • [19] D. Holden (2018) Robust solving of optical motion capture data by denoising. ACM Transactions on Graphics (TOG) 37 (4), pp. 165:1–165:12. External Links: Document, ISSN 0730-0301 Cited by: §1, §1, §1, §2, §4.3, §4.4.
  • [20] S. Holzreiter (2005) Autolabeling 3d tracks using neural networks. Clinical Biomechanics 20 (1), pp. 1–8. External Links: Document, ISSN 0268-0033, Link Cited by: Table 2.
  • [21] N. Inc. (2019) OptiTrack motion capture systems. Cited by: §1, §1, §4.3.
  • [22] P. Inc. (2019) Cited by: Appendix G, §3.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: Figure B.1.
  • [24] H. Joo, T. Simon, and Y. Sheikh (2018) Total capture: a 3D deformation model for tracking faces, hands, and bodies. In CVPR, Vol. , pp. 8320–8329. External Links: Document Cited by: §2.
  • [25] V. Joukov, J. F. S. Lin, K. Westermann, and D. Kulić (2020) Real-time unlabeled marker pose estimation via constrained extended kalman filter. In Proceedings of the International Symposium on Experimental Robotics, Cham, pp. 762–771. External Links: ISBN 978-3-030-33950-0 Cited by: §2.
  • [26] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, External Links: Link Cited by: Appendix B.
  • [27] M. Leordeanu and M. Hebert (2005) A spectral technique for correspondence problems using pairwise constraints. In ICCV, USA, pp. 1482–1489. External Links: Document, ISBN 076952334X02, Link Cited by: §2.
  • [28] M. Loper, N. Mahmood, and M. J. Black (2014) MoSh: motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG) 33 (6), pp. 1–13. External Links: Document, ISSN 0730-0301, Link Cited by: Figure C.4, Appendix G, Figure 1, §1, §2, §4.3.
  • [29] V. M. S. Ltd. (2019) Motion Capture Systems. Cited by: §1, §1, §2, §3, §4.3, §5.5.
  • [30] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black (2019) AMASS: archive of motion capture as surface shapes. In ICCV, pp. 5441–5450. External Links: Document Cited by: Figure 1, §1, §1, §1, §2, §4.3, §4.4.
  • [31] C. Mandery, Ö. Terlemez, M. Do, N. Vahrenkamp, and T. Asfour (2015) The KIT whole-body human motion database. In ICAR, pp. 329–336. External Links: Document Cited by: Table 3, §5.
  • [32] D. Maturana and S. Scherer (2015)

    VoxNet: a 3D convolutional neural network for real-time object recognition

    In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 922–928. External Links: Document Cited by: §2.
  • [33] J. Maycock, T. Rohlig, M. Schroder, M. Botsch, and H. Ritter (2015) Fully automatic optical motion tracking using an inverse kinematics approach. In IEEE-RAS International Conference on Humanoid Robots (Humanoids), Vol. , pp. 461–466. External Links: Document Cited by: Table 2.
  • [34] J. Meyer, M. Kuderer, J. Müller, and W. Burgard (2014) Online marker labeling for fully automatic skeleton tracking in optical motion capture. In ICRA, pp. 5652–5657. External Links: Document Cited by: §2.
  • [35] M. Müller (2007) Documentation mocap database HDM05. Technical report Technical Report CG-2007-2, Universität Bonn. Cited by: §5.
  • [36] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: Appendix B.
  • [37] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3D hands, face, and body from a single image. In CVPR, Vol. , pp. 10967–10977. External Links: Document Cited by: §1, §2, §4.4, §4.4.
  • [38] G. Peyré and M. Cuturi (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §4.2.
  • [39] (2019) PyTorch lightning. Vol. 3. External Links: Link Cited by: Appendix B.
  • [40] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Vol. 30, pp. . External Links: Link Cited by: §2.
  • [41] M. Ringer and J. Lasenby (2002) Multiple hypothesis tracking for automatic optical motion capture. In ECCV, Berlin, Heidelberg, pp. 524–536. External Links: ISBN 978-3-540-47969-7 Cited by: §2.
  • [42] M. Ringer and J. Lasenby (2004) A procedure for automatically estimating model parameters in optical motion capture. Image and Vision Computing 22 (10), pp. 843–850. Note: British Machine Vision Conference (BMVC) External Links: Document, ISSN 0262-8856, Link Cited by: §2.
  • [43] K. Robinette, S. Blackwell, H. A. M. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides (2002) Civilian american and european surface anthropometry resource (CAESAR) final report. Technical report Technical Report AFRL-HE-WP-TR-2002-0169, US Air Force Research Laboratory. Cited by: §4.4.
  • [44] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) SuperGlue: learning feature matching with graph neural networks. In CVPR, Vol. , pp. 4937–4946. External Links: Document Cited by: Appendix B, §2, §4.2.
  • [45] T. Schubert, A. Gkogkidis, T. Ball, and W. Burgard (2015) Automatic initialization for skeleton tracking in optical motion capture. In ICRA, pp. 734–739. External Links: Document Cited by: §2.
  • [46] L. Sigal, A. O. Balan, and M. J. Black (2010) HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV 87 (4), pp. 4–27. External Links: Document Cited by: §4.4.
  • [47] R. Sinkhorn and P. Knopp (1967) Concerning nonnegative matrices and doubly stochastic matrices.. Pacific Journal of Mathematics 21 (2), pp. 343–348. External Links: Link Cited by: §2, §4.2.
  • [48] R. Sinkhorn (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics 35 (2), pp. 876–879. External Links: ISSN 00034851, Link Cited by: §4.2.
  • [49] Y. Song, L. Goncalves, and P. Perona (2003) Unsupervised learning of human motion. TPAMI 25 (7), pp. 814–827. External Links: Document, ISSN 0162-8828, Link Cited by: §2.
  • [50] J. Steinbring, C. Mandery, F. Pfaff, F. Faion, T. Asfour, and U. D. Hanebeck (2016) Real-time whole-body human motion tracking based on unlabeled markers. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Vol. , pp. 583–590. External Links: Document Cited by: §2.
  • [51] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller (2015) Multi-view convolutional neural networks for 3D shape recognition. In ICCV, Cited by: §2.
  • [52] O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas (2020) GRAB: a dataset of whole-body human grasping of objects. In ECCV, Cham, pp. 581–600. External Links: ISBN 978-3-030-58548-8 Cited by: §4.3.
  • [53] L. Torresani, V. Kolmogorov, and C. Rother (2008) Feature correspondence via graph matching: models and global optimization. In ECCV, Berlin, Heidelberg, pp. 596–609. Cited by: §2.
  • [54] N. F. Troje (2002) Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. Journal of Vision 2 (5), pp. 2–2. External Links: Document, ISSN 1534-7362, Link, Cited by: Table 3, §5.
  • [55] M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Collomosse (2017) Total capture: 3D human pose estimation fusing video and inertial sensors. In BMVC, pp. 14.1–14.13. External Links: Document, ISBN 1-901725-60-X, Link Cited by: §4.4.
  • [56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPs, Red Hook, NY, USA, pp. 6000–6010. External Links: ISBN 9781510860964 Cited by: Appendix A, §2, §4.1.
  • [57] C. Villani (2008) Optimal transport – old and new. Vol. 338, pp. . External Links: Document Cited by: §2.
  • [58] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Vol. , pp. 7794–7803. External Links: Document Cited by: §2.
  • [59] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, pp. 1912–1920. External Links: Document Cited by: §2.
  • [60] H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar, and C. Sminchisescu (2020) GHUM & GHUML: generative 3d human shape and articulated pose models. In CVPR, Vol. , pp. 6183–6192. External Links: Document Cited by: §2.

Appendix A Self-Attention Span

Figure A.1: Attention span as a function of layer depth in meters. The grey area indicates confidence interval.
Figure A.2: Attention span for 14 markers, across all layers. Each row corresponds to a layer in ascending order, with bottom most row showing the last layer.

As explained in Sec. 4.1, to increase the capacity of the network and learn rich point features at multiple levels of abstraction, we stack multiple self-attention residual layers. Following [56], a transformer self-attention layer, Fig. 3, takes as input two vectors, the query (Q), and the key (K), and computes a weight vector that learns to focus on different regions of the input value (V), to produce the final output. In self-attention, all the three vectors (key, query and value) are projections of the same input; i.e. either 3D points or their features in deeper layers. All the projection operations are done by 1D-convolutions, therefore the input and the output features only differ in the last dimensions (number of channels). Following notation of [56]:


In a controlled experiment on the validation dataset, HDM05 with marker layout presented in Fig. 0(d), we pass the original markers (without noise) through the network and keep track of the attention weights at each layer; i.e. output after Softmax in Eqn. 9

. At each layer, the tensor shape for the attention weights is

. We concatenate frames of 50 randomly selected sequences, roughly 50000 frames, and take the maximum weight across heads and the mean over all the frames to arrive at a mean attention weight per layer; (). In Fig. 4, the weights are visualized on the body with a color red intensity for 3 markers. In the first layers, the attention span is wide and covers the entire body. In deeper layers, the attention becomes gradually more focused on the marker of the interest and its neighboring markers on the body surface. Fig. A.2 shows the attention span for more markers.

To make this observation more concrete, we compute the euclidean distance of each marker to all others on a A-Posed body to create a distance discrepancy matrix of (), and multiply the previous mean attention weights with this distance discrepancy matrix to arrive at a scalar for attention span in meters. On average we observe a narrower focus for all markers in deeper layers; Fig. A.1.

Appendix B Implementation Details

(a) Detailed SOMA Architecture
(b) Self-Attention block
(c) Conv1D-block
Figure B.1: Detailed components of SOMA model. fi and fo show the number of input and output features of the layer. is the number of points in a frame of data and is number of all labels including null. IN and OUT in (c) show the number of input and output features of the block. All convolutions are one dimensional. BN stands for batch normalization [23].

Through model selection, Sec. C, we choose iterations for Sinkhorn and as optimal choices and we empirically pick 5e-5. The model contains M parameters and full training on 8 Titan V100 GPUs takes roughly 3 hours. We implement SOMA in PyTorch [36]. We benefit from the log-domain stable implementation of Sinkhorn released by [44]. We use ADAM [26] with a base learning rate of and reduce it by a factor of when validation error plateaus with patience of epochs and train until validation error does not drop anymore for epochs. The training code is implemented in PyTorch Lightning [39] and easily extendable to run on multiple GPUs. For the LogSoftmax experiment, we replace the optimal transport layer and everything else in the architecture remains the same. In this case, the score matrix, in Fig. 3, will have an extra dimension for the null label. Fig. B.1 shows a detailed architecture of the SOMA model.

Appendix C Hyper-parameter Search

Figure C.1: Validation accuracy as a function of number of attention layers
Figure C.2: Validation accuracy as a function of number of Sinkhorn normalization steps.
Figure C.3: Training convergence with extreme ghost point distributions on the validation dataset for 40 training epochs; i.e. HDM05.
Figure C.4: Marker layout from MoSh [28] dataset with 89 markers. A model trained on this marker layout is used for rapid automatic label priming for labeling the single frame per significant marker layout variation.

To choose the optimum number of attention layers and iterations for Sinkhorn normalization we exploit the validation dataset HDM05 to perform a model selection experiment. We produce synthetic training data following the prescription of Sec. 4.4 using the marker layout of HDM05 (Fig. 0(d)) and evaluate on real markers with synthetic noise as explained in Sec. 5

. For hyperparameter evaluation, we want to eliminate random variations in the network weight initialization so we always use the same seed. In Fig. 

C.1, we train one model per given number of layers. Guided by this graph we choose 8 layers as a sensible choice for adequate model capacity, i.e. 1.44, and generalization to real markers. In Fig. C.2, we repeat the same process, this time keeping the number of layers fixed as 8, and varying the number of Sinkhorn iterations. We choose iterations that seem a good trade-off between computation time vs performance.

Appendix D Standard Deviations

Method Number of Exact Per-Frame Occlusions
0 1 2 3 4 5 5+G
SOMA-Real 1.76 1.90 2.03 2.22 2.44 2.73 3.08
SOMA-Synthetic 4.68 2.89 3.25 3.54 4.10 4.62 5.92
SOMA * 1.59 1.76 1.91 2.12 2.34 2.59 2.85
Table D.1: Accuracy standard deviation corresponding to Tab. 2 of the main paper.

In Tab. D.1, we report accuracy standard deviation of Tab. 2 as complementary material. We observe lower variation for the model trained on synthetic data using AMASS bodies. The model trained on synthetic data of limited number of bodies shows the largest variation.

Appendix E Marker Layout Variation of HDM05

Figure E.1: Modified HDM05 marker layout. Number of markers removed: (a) 3 (b) 5 (c) 9 (d) 12.

In Fig. E.1 we visualize the marker layout modifications for the experiment in Sec. 5.3.

Appendix F Stability of the Training Process

We consistently observe stable runtime and training processes. In Fig. C.3, we provide training curves for more “extreme” ghost point distributions. Specifically, we add a uniform distribution in a cubic volume in the range of

meters and skewed Gaussian with a mean location sampled uniformly from the same random volume and a random covariance matrix. We also drastically increase the number of ghost points to up to 60 per-frame. As suggested by the figure, training is stable and converges from early iterations on.

Appendix G Processing Real MoCap Data

(a) BMLmovi
(b) BMLrub
(c) KIT
(d) HDM05
Figure G.1: Marker layout of test and validation datasets.
Figure G.2: Significant variation of marker placement of DanceDB dataset on hands and foot.
Figure G.3: Sample of marker layouts used for training SOMA model for CMU-II dataset.
Name Frames Motions Acc. F1
Clap 7572 6 100.00 0.00 100.00 0.00 0.00 0.08 0.00
Dance 15023 8 99.78 0.87 99.68 1.29 0.15 1.76 0.00
Jump 9621 6 99.99 0.13 99.99 0.25 0.03 0.72 0.00
Kick 10787 6 99.59 1.18 99.48 1.50 0.75 6.92 0.00
Lift 16932 6 100.00 0.00 100.00 0.00 0.00 0.06 0.00
Random 19617 7 100.00 0.05 100.00 0.09 0.00 0.21 0.00
Run 9356 6 100.00 0.00 100.00 0.00 0.00 0.06 0.00
Sit 9829 6 100.00 0.00 100.00 0.00 0.00 0.09 0.00
Squat 11287 6 100.00 0.00 100.00 0.00 0.01 0.13 0.00
Throw 9292 6 99.99 0.15 99.99 0.22 0.00 0.09 0.00
Walk 12264 6 100.00 0.00 100.00 0.00 0.00 0.11 0.00
131580 69 99.94 0.47 99.92 0.64 0.08 2.09 0.00
Table G.1: Per-motion-class statistics of the SOMA dataset and performance of the SOMA model.

Here we elaborate on Sec. 4.5, namely on real use-case scenarios of SOMA. The marker layout of the test datasets, Sec. 5, are obtained by running MoSh on a single random frame chosen from the respective dataset. Fig. G.1 demonstrates the marker layout used for training SOMA for each dataset.

In addition to test datasets with synthetic noise, presented in Sec. 4.5, we demonstrate the real application of SOMA by automatically labeling four real mocap datasets captured with different technologies; namely: two with passive markers, SOMA and CMU-II [11], and two with active marker technology, namely DanceDB [4] and Mixamo [10]; for an overview refer to Tab. 6.

For proper training of SOMA we require one labeled frame per significant variation of the marker layout throughout the dataset. Most of the time one layout is utilized to capture the entire dataset, yet as we see next, this is not always the case, especially when the marker layout is adapted to the target motion. To reduce the effort of labeling the single frame we offer a semi-automatic bootstrapping technique. To that end, we train a general SOMA model with a marker layout containing markers selected from the MoSh dataset [28], visualized in Fig. C.4; this is a marker super-set. We choose one sequence per each of representative layouts and run the general SOMA to prime the labels; we choose one frame per auto-labeled sequence and correct any incorrect labels manually. The label priming step significantly reduces the manual effort required for labeling mocap datasets with diverse marker layouts. After this step, everything stays the same as before.

Labeling Active Marker Based MoCap should be the easiest case since the markers emit a frequency-modulated light that allows the mocap system to reliably track them. However, often the markers are placed at arbitrary locations on the body so correspondence of the frequency to the location on body is not the same throughout the dataset, hence these archival mocap datasets cannot be directly solved. This issue is further aggravated when the marker layout is unknown and changes drastically throughout the dataset. It should be noted that, for the case of active marker mocap systems, such issues could potentially be avoided by a carefully documented capture scenario, yet this is not the case with the majority of the archival data.

As an example, we take DanceDB [4], a publicly released dance-specific mocap database. This dataset is recorded by active marker technology from PhaseSpace [22]. The database contains a rich repertoire of dance motions with 13 subjects on the last access date. We observe a large variation in marker placement especially on the feet and hands, hence we manually label one random frame per each significant variation; in total 8 frames. We run the first stage of MoSh independently on each of the selected 8 frames to get a fine-tuned marker layout; a subset is visualized in Fig. G.2. It is important to note that we train only one model for the whole dataset while different marker layouts are handled as a source of noise. As presented in Tab. 6, manual evaluation of the solved sequences reveals an above success rate. The failures are mainly due to impurities in the original data, such as excessive occlusions or large marker movement on the body due to several markers coming off (e.g. the headband).

The second active marker based dataset is Mixamo [10], which is widely used by the computer vision and graphics community for animating characters. We obtained the original unlabeled mocap marker data used to generate the animations. We observe more than 50 different marker layouts and placements on the body, of which we pick 19 key variants. The automatic label priming technique is greatly helpful for this dataset.

The Mixamo dataset contains many sequences with markers on objects, i.e. props, which SOMA is not specifically trained to deal with. However, we observe stable performance even with challenging scenarios with a guitar close to the body; see the third subject from the left of Fig. 1. A large number of solved sequences were rejected mostly due to issues with the raw mocap data; e.g. significant numbers of markers flying off the body mid capture.

Labeling Passive Marker Based MoCap is a greater challenge for an auto-labeling pipeline. For these systems, markers are assigned a new ID on their reappearance from an occlusion, which results in small tracklets instead of full trajectories. The assignment of the ID to markers is random.

For the first use case, we process an archived portion of the well-known CMU mocap dataset [11] summing to minutes of mocap which has not been processed before, mostly due to cost constraints associated with manual labeling. It is worth noting that the total amount of available data is roughly 6 hours of which around 2 hours is pure MPC. Initial inspection reveals significant variations in marker layouts, with a minimum 40 markers and a maximum 62; a sample of which can be seen in Fig. G.3. Again we train one model for the whole dataset that can handle variations of these marker layouts. SOMA shows stable performance across the dataset even in presence of occasional object props as seen in Fig. 1; the second subject from the left is carrying a suitcase.

Due to extreme variation of marker layouts throughout the dataset we notice failure cases where many points could not be assigned to a marker on the body, most probably due to variation from the expected placement. As studied in Sec. 

5.3, this deteriorates the labeling performance and could result in a failure in solving the body mainly because of introduced occlusions.

In the second case, we record our own dataset with two subjects for which we pick one random frame and train SOMA for the whole dataset. In Tab. G.1 we present details of the dataset motions and per-motion-class performance of SOMA. For this dataset, we manually label it to have ground truth and then we fit the labeled data with MoSh. This provides ground truth 3D meshes for every mocap frame. The V2V error measures the average difference between the vertices of the solved body using the ground truth and using the SOMA labels. Mean V2V errors are under one and usually by an order of magnitude. Sub-millimeter accuracy is what users of mocap systems expect and SOMA delivers this.

Symbol Description
MPC MoCap Point Cloud
set of labels including the null label
a single label
vector of marker layout body vertices corresponding to labels not including null
vector of varied marker layout vertices
number of markers
set of all points
ground-truth augmented assignment matrix
predicted assignment matrix
augmented assignment matrix
score matrix
class balancing weight matrix
X markers
body vertices
marker distance from the body along the surface normal
body joints
number of attention heads
number of attention layers
Table G.2: List of Symbols

Appendix H List of Symbols

In Tab. G.2, we provide a table of mathematical symbols used throughout the paper.