Data obtained from motion capture (mocap) is used for a variety of purposes, including sports, animation, robotics and medicine. The data is obtained by tracking the movements of (typically) human performers, in either a marker-based  or a markerless 
setup. This data is often noisy and incomplete because of errors present in measurement, tracking or pose reconstruction. This occurs due to several reasons such as calibration error, sensor noise, poor sensor resolution, incorrectly affixed markers or occlusion due to body parts or clothing. Much effort is then spent in cleaning this motion data prior to use. Most often the noisy segments are identified manually, and a particular cleaning technique is applied locally. A few examples of such techniques are interpolation via spline fitting, applying simple smoothing kernels, inferring motion of missing joints from neighboring joints on rigid segments [19, 2], applying kinematic or geometric constraints [14, 18], and applying various filtering techniques [21, 7]. This variety of approaches attests that cleaning motion capture data is a crucial process, and the data is otherwise unusable.
In this paper, we present a deep recurrent framework for cleaning motion capture data (Figure 1). Our method handles both noisy signals and signals with missing data or gaps. After the supervised training phase, it requires no further annotation other than knowledge of where the gaps occurred, corresponding to intervals when markers went missing. The approach is not noise-specific or action-specific: it can be trained on any type of noise, and a heterogeneous mix of action types. The noise distribution can be completely unknown as long as noisy and clean motion pairs are available for training. Further, the model adapts on-the-fly to the test action type (e.g. “walk”, “run”, “jump”) as long as the training includes some (unlabeled) examples of this type. Finally, the approach can operate in a streaming setting, processing mocap data in real time as it arrives instead of needing to look at the complete motion clip at once.
Such a framework faces several challenges. First, different joints of a tracked character move at different and varying speeds, especially across different phases/types of actions (e.g. foot swing vs strike, or walk vs jump). Hence, naïve smoothing risks either blurring out sharp movements or not removing enough noise. Second, if the noise is significant, then an isolated short interval of the motion contains few cues to reliably recover the underlying clean signal by model fitting. Third, noise distributions can be unknown, arbitrary, and have non-zero mean. This makes handcrafting an approach for blind noise removal distinctly non-trivial. Fourth, gaps created by missing markers can be long enough to omit complex short-duration motions: such gaps cannot be filled by simple interpolation.
For a signal that has both noise and gaps, we preprocess the signal with a second B-LSTM network that synthesizes missing frames from surrounding context. This EBD network, a bidirectional variant of the Encoder-Recurrent-Decoder (ERD) architecture of Fragkiadaki et al. , is trained with data augmentation and dropout to be robust to long-duration synthesis. It fills a gap with a sequence that captures the correct trend but is still somewhat noisy. (As we show in our experiments, the EBD architecture is not solely sufficient to produce a clean signal.) The complete, noisy motion is then cleaned with our first EBF network.
We show through extensive evaluations that our framework successfully cleans both small and large amplitude noise, and fills gaps, better than a variety of baselines. We also develop a benchmark dataset of a variety of motion types  corrupted by various types of synthetic noise and gaps. We will put this dataset, as well as code and trained models, in the public domain.
2 Related Work
Rudimentary motion cleaning techniques that handle both noisy and missing samples exist in software systems available from mocap solution providers like Vicon  and 3D content creation tools like Maya  or Blender . These rely on simple interpolation and filtering, along with kinematic constraints, and have to be applied manually to the input motion signals.
In prior research, mocap cleaning has often been combined with marker tracking, labeling and reconstruction. We now discuss a variety of such approaches.
Skeleton based methods.
Herda et al.  and Hornung et al.  try to improve robustness of marker tracking/labeling in marker-based mocap by using a kinematic skeleton to assist marker reconstruction. Zordan and Van Der Horst  also use a fixed skeleton to map markers and resolve joint angle state with a dynamics model. In these methods, the skeleton-to-marker mapping is fixed and explicitly specified. They can only reconstruct markers that are briefly missing, and are not robust to noise.
Kalman filter based methods.
Kalman filters have been used to track markers by assuming rigid constraints in marker placement . Li et al.  present the BoLeRO method to enforce bone length constraints in a linear dynamical system. Aristidou and Lasenby  use improved tracking but still assume rigid limbs with markers on each limb. The use of these methods is limited to scenarios where the markers, missing or not, follow these assumptions and the noise model is known a priori.
Dimensionality reduction based methods.
Liu and McMillan  and Park and Hodgins  reconstruct missing markers by projecting motion onto its principal components. Burke and Lasenby  combine temporal smoothing with a Kalman filter and low rank matrix completion to learn an effective subspace for filtering. This fixes gaps, but the work is not robust to heavy noise. Akhter et al.  use a bilinear spatiotemporal basis to factor the motion into shape and trajectory. This does not enforce coherence and hence works only in a narrow linearly approximable regime. Xiao et al.  use L1-minimization to learn an optimal dictionary for denoising. This method requires solving an expensive optimization problem for each test motion, requires access to the entire motion clip, and is demonstrated only on tiny noise amplitudes (dB SNR, vs our tests with a default of dB). Lou and Chai  learn filter bases from a few clean samples, and then use non-linear optimization to determine filter weights and the clean motion. This does not generalize over a large variety of input motions as separate filter bases need to be learnt for each motion type. It also requires expensive test-time optimization.
Generative models and deep learning based methods.
Taylor et al. 
present a Conditional Restricted Boltzmann Machine that generates motion signals similar to learnt data and can thereby fill gaps. Fragkiadaki et al. present an ERD network to generate human motions that is easier to train and generalizes better than earlier work. It however, produces very short motion sequences of upto ms, and does not address motion cleaning. Du et al.  train a hierarchical RNN to recognize actions in motion sequences. Holden et al. [17, 16, 15] develop a variety of deep networks to synthesize and edit motion. LSTM variants have been used to denoise speech [5, 11, 26].
In contrast to existing literature, our method is not restricted to any particular kind of marker placement, skeleton, noise or motion model, since these are implicitly learnt from data. It is robust to noise and missing samples, and works for a large variety and length of motions, different kinds of unknown large-amplitude noise, and long gaps.
Our approach leverages a recurrent, bidirectional long short-term memory neural network. We are inspired by the Encoder-Recurrent-Decoder (ERD) architecture of Fragkiadaki et al. , which synthesizes motion frames given a short initial sequence. However, while the ERD architecture can synthesize plausible gross motions, these sequences are not noise-free when conditioned on noisy observations. It captures general motion trends well, but is less adept at removing fine-grained high-frequency noise. Hence, our architecture focuses on outputting a denoising filter for the motion in each frame, rather than attempting to directly produce a denoised pose. (Recent work on denoising rendered images has explored filter prediction in a feedforward convolutional framework .) As we show in our evaluation, this approach was critical in producing satisfactory results.
Further, the ERD network contains an encoder component to transform the input to a higher-dimensional space to improve performance of the subsequent recurrent layers. In contrast, our network projects the input to a lower
-dimensional manifold, similar to an autoencoder, to constrain processing to a set of plausible configurations. This simplifies the operation of the recurrent network.
Lastly, in contrast to Fragkiadaki et al., whose primary goal was motion synthesis, our recurrent architecture is bidirectional
, allowing us to utilize both past and future context to estimate the best adaptive filters to clean the noisy input. We are able to do this because we have access to the entire (noisy) temporal sequence.
Our Encoder-Bidirectional-Filter (EBF) architecture comprises an encoder module (E) to project the input pose (and lookback) to a low-dimensional manifold, a recurrent bidirectional LSTM component (B) to exploit temporal coherence, and a filter prediction module (F) that outputs the smoothing filter to be applied to each joint in the current frame.
The input representation is the vector ofjoint angles of the skeletal model at time (measured in frames), plus the global angular velocities (around the X, Y and Z axes) of the root node of the character, for a total of parameters. The network consists of fully-connected layers (Encoder), followed by bidirectional recurrent layers (B-LSTM), followed by more fully-connected layers (Filter Prediction). In our default implementation, the network outputs
values, each interpreted as the standard deviation of a Gaussian which will be used to smooth the motion of the corresponding joint in the current frame. We use
as the activation function in the encoder and filter prediction layers, except for the last filter layer, which usesto ensure that the output standard deviations are positive.
The encoder module reduces the input -dimensional pose vector to a compact -dimensional code. This is then processed by the B-LSTM with frames of lookback and frames of lookahead (for a frame-rate of fps). The B-LSTM output is mapped to a -dimensional filter vector by the network. The overall EBF architecture is illustrated in Figure 2.
The extra channels containing the angular velocities for the root joint are assumed to be clean and are used only for disambiguation. To make the training invariant to some simple transformations such as heading changes, we use angular velocities instead of absolute angles in these channels.
For training the network for non-zero mean noise, we modify the existing EBF to predict both Gaussian filter parameters as well as a bias parameter for every channel. This network predicts standard deviations and values of bias for each channel at time . The smoothed pose produced by the network is given by the weighted sum:
Here is a normalization term. Our architecture can easily accommodate more fine-grained filter models. For instance, we might directly predict general weight values over the -frame filtering window. In our experiments with such models, however, we found that the predicted filter shapes were essentially Gaussian (Figure 3). We predicted parameters of Gaussian filters in all our experiments. Nevertheless, the framework can accommodate other filter shapes should the need arise.
The EBF network is trained end-to-end by providing a sequence of noisy frames as input and the corresponding clean ground truth frames as expected output. The network uses L2 loss between motion cleaned with the predicted filters and the ground truth. This setup supports back-propagation since the filtering operation itself is a simple dot product with the
-D weight vector for each joint channel (plus a preceding set of exponentiations to compute the weights from the predicted Gaussian standard deviation). We initialize the network weights of the encoder and filter layers by sampling from a uniform distribution over. We train the network for epochs using stochastic gradient descent (SGD) with adaptive moments.
Mocap data can have gaps due to occlusions while tracking markers. For filling in the missing gaps we use an Encoder-Bidirectional-Decoder (EBD)
architecture. The encoder module is identical to that of EBF. The B-LSTM module is nearly identical, except it has fewer neurons in each hidden layer (vs in each of the forward and backward directions). The decoder module recovers the -D joint angle vector from the B-LSTM output through fully-connected layers. The network has a lookback and lookahead of
frames with a stride offrames (i.e. covering a frame interval) in each direction. This architecture is directly inspired by the (unidirectional) ERD architecture of Fragkiadaki et al. . It can reconstruct missing channels from the other joint angles by exploiting inter-joint correlations and temporal coherence.
We make the reasonable assumption that the locations of gaps are known (since we can tell when markers went missing). The gap is initially filled with values linearly interpolated from its endpoints, during both training and testing. The EBD learns to replace the linearly interpolated data with the correct sequence of values (non-gap portions at other times or for other joints are left unaltered). The completed sequence is then passed on to the EBF network for denoising.
Like EBF, the EBD network is also trained end-to-end for epochs using SGD with adaptive moments. The B-LSTM layers have an input dropout of . For successful training and robust long range prediction, we found it necessary to augment the training data with many different motions and noise samples, as described in the next section.
We collected motions from the CMU mocap database , comprising a total of walks, jogs, runs, jumps and kicks. Every motion is at fps and is stored as a BVH file.
Generating noisy train/test data.
The default “angular” noise for each joint rotation channel is sampled from a mean-zero Gaussian with standard deviation equal to times the standard deviation of the channel (dB SNR). Hereafter, we refer to this as “ noise”. We also consider motions corrupted by “spatial” noise, generated by applying 3D Gaussian noise to joint positions and optimizing to preserve bone lengths. Spatial noise models motion corruption due to errors in marker location. We test on two types of spatial noise: (a) the same variance at all joints, and (b) larger variance at wrist and ankle end effectors, modeling greater tracking error at fast moving markers. Lastly, we also test with a variety of other synthetic noise types to show the robustness of the method.
We generate missing samples or gaps in the data by sampling from a distribution over gap lengths that decays exponentially with increasing length . There is a probability of starting a gap in a specific joint channel at a specific frame in the motion. (Note that very few joint channels are likely to have gaps simultaneously under this model – this is critical for our method since it exploits correlation between joints to reconstruct missing data.)
We use randomly sampled holdouts of motions each for cross validation of our framework. In each holdout we ensure there are walks, jogs, runs, jump and kick motion, randomly selected from their respective types. For statistical robustness, all our tests, on each type of motion, are repeated times and the results are averaged across runs. In each run, we vary a motion by adding different noise and gap samples. The noise and gap distributions remain fixed across runs.
4.1 Denoising Results
Comparison of denoising methods.
We compare the performance of our EBF model (without bias outputs) for denoising motion capture data with two baseline Gaussian filters, an Encoder-Bidirectional-Decoder (EBD) model similar to Fragkiadaki et al. , and the example-based denoising method of Lou and Chai  (which the authors show to improve upon Kalman and data-driven Kalman filters). First, we plot the variation in RMS error over all joints with time. Figure 4 shows these plots for different test motions. A single EBF model trained over a mix of motion types (red curve) has lower error compared to all other competing methods. Figure 5 presents the RMS error averaged over all frames of a motion, for all motions from the holdouts. The motions have been grouped by type, for clarity. Again, we see that our EBF model outperforms the other methods.
The average RMS error across all motions in our test set is , , , and 1.46 for a Gaussian with a standard deviation of ms, a Gaussian with a standard deviation of ms, EBD, Lou and Chai , and EBF, respectively. The corresponding figures for cm spatial noise are , , , and 2.72 respectively.
Visualizations of noisy and cleaned motions are shown in the supplementary video.
Comparison of noise amplitudes.
We train multiple EBF models with different noise amplitudes. Three EBF models are trained on datasets that have , and noise respectively. We also train a single EBF with all the noisy motions used to train the previously mentioned variants. Then we test the performance of these models on each of the different noisy datasets, and plot the RMS error, averaged over all frames, for each motion from all the holdouts. It can be seen in Figure 6 (left) that for each noise amplitude, all EBF models perform almost similarly. Therefore, unless specified otherwise, we have used an EBF trained with noise in all our other experiments.
We also compare the denoising performance of our EBF models, trained and tested on (dB SNR) and (dB SNR) noise respectively, with a baseline Gaussian filter. A Gaussian with higher standard deviation is used as the baseline for higher noise since it was found to be more accurate. Similarly, we also compare the performance of EBF with a Gaussian baseline on different amplitudes of spatial noise. As can be seen in Figure 6 (middle and right) EBF always outperforms the baseline.
Performance on different noise types.
We test the performance of our EBF model (with bias outputs) on uniform noise; on noise plus a constant bias; and on noise plus a sinusoidally varying bias. The EBF is trained and tested on data containing each kind of noise in turn, and compared to a Gaussian baseline. Figure 7 shows that the EBF is able to successfully learn from data containing different types of noise, and is able to subsequently denoise it. Note that for bias-free distributions, the EBF-with-bias model learnt an almost exactly zero bias term, as expected.
Ablation studies with architecture variants.
We compare different network architectures. In the first architecture (denoted BF) we drop the E (Encoder) layer and train a network with only the B and F layers. In the second architecture (denoted NN) we train a feedforward network with fully connected dense layers, with activation in all layers except the final one. The input is a single vector of size and the output is a vector of values for joint channels. The third network is our EBF model. The variation in average RMS error for all test motions (Figure 8) shows that EBF (avg error 1.46) performs better at denoising than both the BF () and NN () networks. This indicates that the encoder module and the recurrent architecture are both important.
Extrapolation to other actions.
To test whether our framework trained on one type of motion can extrapolate to cleaning a different type, we held out each of the action categories (walk, jog, run, jump, kick) in turn for testing, and trained on the remaining categories. In this test, the average RMS error over the motions selected in the original holdouts is . This compares favorably with Gaussian smoothing () but is expectedly not as good as when the EBF is trained on all action types (1.42).
4.2 Gap Filling Results
We look at the performance of our EBF + EBD framework in the presence of missing samples. We train the networks on noisy data with missing samples. Runs of missing samples or gaps are generated in the data preparation step as explained above. We compare the performance of three methods for cleaning motion capture data with gaps and noise. In the first method, we use linear interpolation (lerp) to close the gaps, and then use a Gaussian filter to denoise the complete motion. The second method again uses lerp to close the gaps but uses an EBF to denoise it. The third method uses the trained EBD network to predict the missing samples and then uses the EBF to denoise it. Figure 9 presents the RMS error averaged over all frames of each motion from the holdouts. The combination of using EBD for gap filling, followed by EBF for denoising, performs the best for nearly all test motions.
Longer gaps, though occurring less frequently, are much harder to fill convincingly than smaller gaps. We specifically tested our gap filling method on data in which we introduced long gaps of up to seconds (i.e. 600 frames). The combination of EBD and EBF was able to successfully reconstruct the missing data in all our tests, whereas both the other methods mostly failed to do so. Examples of these long gap reconstructions can be seen in Figure 10. The remarkably accurate reconstructions can be attributed to learning correlations of the missing joint channel with all other joint channels in the skeleton, as well as temporal coherence.
4.3 Timing Results
We used TensorFlow to implement our networks, and trained on a variety of nVidia GPUs. The EBF model takes abouthours and the EBD model hours to train for a single holdout on one GPU. Each frame of a motion clip can be denoised in real-time (ms per frame), allowing live processing of streamed motion capture data (with a short delay for acquiring the lookahead frames).
We present a deep recurrent framework for cleaning motion capture data with noisy and missing samples. The approach is not specific to a particular kind of noise or action, and requires no manual tuning. Our Encoder-Bidirectional-Filter (EBF) architecture predicts an adaptive low-pass filter for each joint channel in every frame. The network implicitly builds a model for temporal coherence and correlation between joint channels and can thus predict filters that adapt to account for large amounts of different noise types. To reconstruct missing samples in noisy motions, we use an Encoder-Bidirectional-Decoder (EBD) network. The result is then piped through the EBF network for denoising. We present an extensive set of experiments that measure and validate the performance of our methods. In particular, we illustrate the framework’s ability to work with noise from different distributions, including additive and high amplitude noise, and in the presence of very large gaps.
Since the training is supervised, the network needs to see sufficient data of a new motion type before it can learn to clean it reliably. Although experiments suggest the network can extrapolate to motion types it has never seen (Section 4), the quality of the cleaned data will certainly improve if the network has seen the motion during training. We were able to train a single network on all the different motions together. However, in order to make a generic argument about how much diversity a single network can handle, more testing with a wide variety of motions is needed. Similarly, we cannot make strong claims about the ability of a trained network to generalize to unseen noise distributions. Our method does not explicitly consider the lengths of bones that connect the joints in the skeleton, therefore it should be able to clean motion captured from a different performer (than the one used to generate the training data), as long as the topology of the recovered skeleton stays the same. We would also like to try and clean motion data obtained from a variety of motion capture systems like markerless systems based on RGBD cameras or inertial sensors, and extend the approach to general time series data.
We hope our method can serve as a default push-button solution for cleaning motion data, thereby saving valuable time in the motion capture data processing pipeline.
-  I. Akhter, T. Simon, S. Khan, I. Matthews, and Y. Sheikh. Bilinear spatiotemporal basis models. Trans. Graphics, 31(2):17:1–17:12, 2012.
-  A. Aristidou and J. Lasenby. Real-time marker prediction and CoR estimation in optical motion capture. The Visual Computer, 29(1):7–26, 2013.
-  Autodesk. Maya. http://www.autodesk.com/products/maya/overview, 2017.
-  S. Bako, T. Vogels, B. Mcwilliams, M. Meyer, J. NováK, A. Harvill, P. Sen, T. Derose, and F. Rousselle. Kernel-predicting convolutional networks for denoising Monte Carlo renderings. Trans. Graph., 36(4), 2017.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe. The third CHiME speech separation and recognition challenge: Analysis and outcomes. Computer Speech & Language, 2016.
-  BlenderFoundation. Blender. https://www.blender.org/, 2017.
-  M. Burke and J. Lasenby. Estimating missing marker positions using low dimensional Kalman smoothing. J. Biomechanics, 49(9):1854–1858, 2016.
-  CMU. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/, 2017.
-  K. Dorfmüller-Ulhaas. Robust optical user motion tracking using a Kalman filter. In VRST, 2003.
Y. Du, W. Wang, and L. Wang.
Hierarchical recurrent neural network for skeleton based action recognition.In CVPR, pages 1110–1118, 2015.
-  A. El-Desoky Mousa, E. Marchi, and B. Schuller. The ICSTM+ TUM+ UP approach to the 3rd CHiME challenge: Single-channel LSTM speech enhancement with multi-channel correlation shaping dereverberation and LSTM language models. arXiv preprint arXiv:1510.00268, 2015.
-  K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrent network models for human dynamics. In ICCV, pages 4346–4354, 2015.
-  A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
-  L. Herda, P. Fua, R. Plankers, R. Boulic, and D. Thalmann. Skeleton-based motion capture for robust reconstruction of human motion. In Computer Animation, pages 77–83, 2000.
-  D. Holden, T. Komura, and J. Saito. Phase-functioned neural networks for character control. Trans. Graph., 36(4), 2017.
D. Holden, J. Saito, and T. Komura.
A deep learning framework for character motion synthesis and editing.Trans. Graphics, 35(4), 2016.
-  D. Holden, J. Saito, T. Komura, and T. Joyce. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia Technical Briefs, page 18, 2015.
-  A. Hornung, S. Sar-Dessai, and L. Kobbelt. Self-calibrating optical motion tracking for articulated bodies. In Virtual Reality, pages 75–82, 2005.
-  L. Li, J. McCann, N. Pollard, and C. Faloutsos. BoLeRO: a principled technique for including bone length constraints in motion capture occlusion filling. In SCA, pages 179–188, 2010.
-  G. Liu and L. McMillan. Estimation of missing markers in human motion capture. The Visual Computer, 22(9):721–728, 2006.
-  H. Lou and J. Chai. Example-based human motion denoising. TVCG, 16(5):870–879, 2010.
-  OrganicMotion. http://www.organicmotion.com/, 2017.
-  S. I. Park and J. K. Hodgins. Capturing and animating skin deformation in human motion. Trans. Graphics, 25(3):881–889, 2006.
-  G. W. Taylor, G. E. Hinton, and S. T. Roweis. Modeling human motion using binary latent variables. NIPS, 19:1345, 2007.
-  Vicon. https://www.vicon.com/, 2017.
-  F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Intl. Conf. Latent Variable Analysis and Signal Separation, pages 91–99, 2015.
-  J. Xiao, Y. Feng, M. Ji, X. Yang, J. J. Zhang, and Y. Zhuang. Sparse motion bases selection for human motion denoising. Signal Processing, 110:108–122, 2015.
-  V. B. Zordan and N. C. Van Der Horst. Mapping optical motion capture data to skeletal motion using a physical model. In SCA, pages 245–250, 2003.