1 Our Proposed Architecture
Recurrent neural networks (RNN) are the go-to method to model time-dependent sequences. However, one of their major drawbacks is the exploding and vanishing gradient problem and the difficulty to parallelize their training. Additionally, [BKK18] have shown that temporal convolutional networks (TCN) perform just as well or even better than RNNs in sequence modeling tasks. Hence, we introduce a model, which is inspired by traditional image segmentation approaches [LSD15] and recent advances in sequence modeling [BKK18] for semantic segmentation of motion capture data.
In a preprocessing step, we first transform our motion capture data to an RGB image domain, much in the spirit of [LB17]. Each column of the image represents a frame in the motion sequence. The rows represent the joints and the RGB values are the scaled XYZ Euclidean coordinates of each corresponding joint. Such a motion image can be seen in Fig. 3 (top). We then pass it to our network (Fig. [). Akin to the five areas in our visual cortex (V1 - V5) [Rem12], our model has a total of five convolution layers. The initial layer consists of a traditional 2D convolutional layer which is only applied in the time dimension. To do that, we set the kernel height to the height of the image. Every layer has the same convolution width
with stride 1. The next four layers are 1D temporal acausal convolutions with dilation. The dilation rateincreases with each layer , according to . A convolutionized dense layer with a Softmax activation function is added after that. We found that a normalizing ReLU function [L17] before the Softmax layer increases accuracy.
Fig. 1 shows how dilated convolutions increase the receptive field exponentially without loss of resolution for acausal and causal convolutions. Causal convolutions are used for temporal data, where the output depends on previous samples only. Since our goal is to distinguish motions like left step (step while walking) from begin and end left step (step from/to standing position), we use acausal convolutions, as these motion types rely on past and future information.
2 Experiments and Results
Our motion capture dataset consists of 70 sequences with 10 motion labels: standing, left/right step, begin/end left step, begin/end right step, reach, retrieve and turn. Our sequences reach up to 1500 frames. In all of our experiments, we use non-randomized 7-fold cross-validation. We use the Adam [KB14]
optimizer with 100 epochs for training.
In order to determine the optimal receptive field size (RFS), we test our model on different convolution kernel widths . Fig. 2 shows that even though a width (RFS: 3125 frames) covers the entire sequence, the accuracy does not differ much from using (RFS: 342 frames). A width uses
438K fewer parameters, however. Since our model has to be robust against human error due to wrongly-classified labels, we further train our model on noisy labels and test it on the true labels. Fig.4 shows despite adding 80% noisy labels in the training data, an accuracy of over 88% is reached on the true test labels for .
We test our model () against another state-of-the-art TCN model [L17] for action segmentation and three commonly used neural network models [VDO16, HS97, W90] for sequence modeling and classification using our dataset without noisy labels, and show that our model is superior to these models (Tab. 1).
With this work, we have shown that our model provides a fruitful segmentation tool for motion capture segmentation. To support various types of motion, we further plan on increasing our motion image database and do more experiments on noisy data.
This work is funded by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No 642841; and by the German Federal Ministry of Education and Research (BMBF) through the project Hybr-iT under the grant 01IS16026A.
- [BKK18] Bai S., Kolter J. Z., Koltun V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018).
- [HS97] Hochreiter S., Schmidhuber J.: Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- [KB14] Kingma D. P., Ba J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- [L17] Lea C., et al.: Temporal convolutional networks for action segmentation and detection. 2017 IEEE CVPR (2017), 1003–1012.
- [LB17] Laraba S., Brahimi M., et al.: 3d skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Computer Animation and Virtual Worlds 28, 3-4 (2017).
- [LSD15] Long J., Shelhamer E., Darrell T.: Fully convolutional networks for semantic segmentation. In Proceedings of IEEE CVPR (2015), pp. 3431–3440.
- [Rem12] Remington L. A.: Chapter 13 - visual pathway. In Clinical Anatomy and Physiology of the Visual System, Remington L. A., (Ed.), 3rd ed. Butterworth-Heinemann, 2012, pp. 233 – 252.
- [VDO16] Van Den Oord A., et al.: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
- [W90] Waibel A., et al.: Phoneme recognition using time-delay neural networks. In Readings in speech recognition. 1990, pp. 393–404.