Dilated Temporal Fully-Convolutional Network for Semantic Segmentation of Motion Capture Data

by   Noshaba Cheema, et al.
Max Planck Society

Semantic segmentation of motion capture sequences plays a key part in many data-driven motion synthesis frameworks. It is a preprocessing step in which long recordings of motion capture sequences are partitioned into smaller segments. Afterwards, additional methods like statistical modeling can be applied to each group of structurally-similar segments to learn an abstract motion manifold. The segmentation task however often remains a manual task, which increases the effort and cost of generating large-scale motion databases. We therefore propose an automatic framework for semantic segmentation of motion capture data using a dilated temporal fully-convolutional network. Our model outperforms a state-of-the-art model in action segmentation, as well as three networks for sequence modeling. We further show our model is robust against high noisy training labels.


Fine-Grained Semantic Segmentation of Motion Capture Data using Dilated Temporal Fully-Convolutional Networks

Human motion capture data has been widely used in data-driven character ...

Convolutional Gated Recurrent Networks for Video Segmentation

Semantic segmentation has recently witnessed major progress, where fully...

Weakly Supervised Semantic Segmentation Using Constrained Dominant Sets

The availability of large-scale data sets is an essential pre-requisite ...

PSSNet: Planarity-sensible Semantic Segmentation of Large-scale Urban Meshes

We introduce a novel deep learning-based framework to interpret 3D urban...

Reducing Labelled Data Requirement for Pneumonia Segmentation using Image Augmentations

Deep learning semantic segmentation algorithms can localise abnormalitie...

Unsupervised Temporal Segmentation of Repetitive Human Actions Based on Kinematic Modeling and Frequency Analysis

In this paper, we propose a method for temporal segmentation of human re...

Trace-back Along Capsules and Its Application on Semantic Segmentation

In this paper, we propose a capsule-based neural network model to solve ...

1 Our Proposed Architecture

Recurrent neural networks (RNN) are the go-to method to model time-dependent sequences. However, one of their major drawbacks is the exploding and vanishing gradient problem and the difficulty to parallelize their training. Additionally, [BKK18] have shown that temporal convolutional networks (TCN) perform just as well or even better than RNNs in sequence modeling tasks. Hence, we introduce a model, which is inspired by traditional image segmentation approaches [LSD15] and recent advances in sequence modeling [BKK18] for semantic segmentation of motion capture data.

Figure 1: Dilated convolution. Left: acausal dilation. Right: causal dilation. Systematic dilation increases the receptive field size exponentially.

In a preprocessing step, we first transform our motion capture data to an RGB image domain, much in the spirit of [LB17]. Each column of the image represents a frame in the motion sequence. The rows represent the joints and the RGB values are the scaled XYZ Euclidean coordinates of each corresponding joint. Such a motion image can be seen in Fig. 3 (top). We then pass it to our network (Fig. [). Akin to the five areas in our visual cortex (V1 - V5) [Rem12], our model has a total of five convolution layers. The initial layer consists of a traditional 2D convolutional layer which is only applied in the time dimension. To do that, we set the kernel height to the height of the image. Every layer has the same convolution width

with stride 1. The next four layers are 1D temporal acausal convolutions with dilation. The dilation rate

increases with each layer , according to . A convolutionized dense layer with a Softmax activation function is added after that. We found that a normalizing ReLU function [L17] before the Softmax layer increases accuracy.

Fig. 1 shows how dilated convolutions increase the receptive field exponentially without loss of resolution for acausal and causal convolutions. Causal convolutions are used for temporal data, where the output depends on previous samples only. Since our goal is to distinguish motions like left step (step while walking) from begin and end left step (step from/to standing position), we use acausal convolutions, as these motion types rely on past and future information.

2 Experiments and Results

Figure 2: Number of parameters (line) vs. train (dark blue) and test (red) accuracy for different convolution widths. The receptive field size with dilations after the first five layers is written in parentheses.
Train 90.05% 90.64% 87.22% 86.32% 96.11%
Test 88.69% 88.47% 85.54% 81.95% 91.64%
Table 1: Comparison against other models. ED-TCN: [L17], WaveNet: [VDO16], TDNN: [W90], LSTM: [HS97]

Our motion capture dataset consists of 70 sequences with 10 motion labels: standing, left/right step, begin/end left step, begin/end right step, reach, retrieve and turn. Our sequences reach up to 1500 frames. In all of our experiments, we use non-randomized 7-fold cross-validation. We use the Adam [KB14]

optimizer with 100 epochs for training.

In order to determine the optimal receptive field size (RFS), we test our model on different convolution kernel widths . Fig. 2 shows that even though a width (RFS: 3125 frames) covers the entire sequence, the accuracy does not differ much from using (RFS: 342 frames). A width uses

438K fewer parameters, however. Since our model has to be robust against human error due to wrongly-classified labels, we further train our model on noisy labels and test it on the true labels. Fig.

4 shows despite adding 80% noisy labels in the training data, an accuracy of over 88% is reached on the true test labels for .

We test our model () against another state-of-the-art TCN model [L17] for action segmentation and three commonly used neural network models [VDO16, HS97, W90] for sequence modeling and classification using our dataset without noisy labels, and show that our model is superior to these models (Tab. 1).

With this work, we have shown that our model provides a fruitful segmentation tool for motion capture segmentation. To support various types of motion, we further plan on increasing our motion image database and do more experiments on noisy data.

Figure 3: Top: Motion capture sequence in RGB image domain. Middle: True labels. Bottom: Our predictions.
Figure 4: Test accuracies for different receptive field sizes depending on noise level.


This work is funded by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 642841; and by the German Federal Ministry of Education and Research (BMBF) through the project Hybr-iT under the grant 01IS16026A.