1 Introduction and Related Work
A human motion can be described by a sequence of skeleton poses, where each pose keeps 3D coordinates of important body joints in a specific time moment. Such spatio-temporal data can be acquired using dedicated hardware technologies or recent pose-estimation methods capable of determining 2D or even 3D joint positions from ordinary videos. The acquired data have a great potential to be employed in many application fields, e.g., in sports to automatically detect fouls during a football game; in security to detect potential threats like a running group of people; or in computer animation to recognize previously-captured animations.
Action recognition is the most popular motion-processing task that aims at determining the class of pre-segmented actions based on a labelled set of training ones. Solving this task is challenging as the actions of the same class can be performed by different subjects in various styles, speeds, and initial body postures. The variability of actions is often decreased by applying various normalization techniques [20, 14]
. The normalized actions are then used to train a deep neural network classifier. To make deep learning more robust, the training data are enriched usingaugmentation techniques [22, 5]. However, the question is how to select a suitable combination of pre-processing techniques for a given dataset.
Traditional action-recognition methods based on handcrafted features have a limited ability to represent the complexity of spatio-temporal movement patterns and have been practically forgotten due to the progress in deep learning [21, 4]. In deep learning, the input actions are usually transformed into intermediate representations (e.g., graph structures [26, 11], 2D motion images [7, 15], or histograms 
) that are then used to train a classifier, based on convolutional neural networks (CNN)[1, 7], graph convolutional networks (GCN) [10, 11]
, or Long Short-Term Memory (LSTM) networks[23, 9].
To enhance the recognition accuracy, different classifiers or data modalities can be fused. The fusion approach in  proposes to learn features of individual 3D skeletons using CNN and then train a LSTM network on top of a sequence of such features. In 
, multi-modal features are firstly extracted from the input actions and then fused by an autoencoder network. In, the authors even propose to fuse the RGB and 3D skeleton modalities.
To sum it up, there exist over a hundred papers that propose complex neural-network classifiers and various data normalization or augmentation techniques. This implies it is computationally unrealistic to train a model for every variant of input actions generated by a specific combination of pre-processing techniques (even for a fixed neural-network architecture). Moreover, the success of each combination of pre-processing techniques depends on a given dataset.
Contributions of this Paper
Our objective is to determine the best combination of data pre-processing techniques for a given classifier and dataset. The main idea lies in training only a single independent model for each pre-processing technique and applying the fusion approach to estimate the quality of a selected combination of techniques, instead of training several orders of magnitude more classifiers. Specifically, we introduce these contributions:
We propose two effective augmentation techniques for 3D actions (called BodyModel and KeyPose augmentations);
We design an online fusion of classifiers based on a strict majority vote rule;
We introduce an algorithmic procedure for efficient evaluation of a very large number of combinations of classifiers.
In addition, for the best combination of pre-processing techniques, we can retrospectively apply the normalized and augmented variants of input data to train only a single all-in-one model. We expect that such model can be confused by many variants of training data, which can lead to a decreased recognition accuracy in comparison with the fusion of independent classifiers that are trained for the specific purpose.
2 Data Pre-Processing Techniques
We represent skeleton data of a single action as a sequence of consecutive 3D poses , where the -th pose is captured in time moment () and consists of -coordinates of tracked joints. In this paper, we use three variants of body models with joints. To improve the quality of deep learning, we apply the following normalization and augmentation techniques.
2.1 Action Normalizations
The semantically equivalent actions, i.e., belonging to the same class, can be performed by subjects (i) at different space locations, (ii) facing various directions, or (iii) having different heights. Since absolute positionings and orientations and different subject sizes rather introduce unwanted bias for recognizing daily/exercising actions [20, 14], we use the following normalization techniques.
P-normalization – To unify actions performed at different space locations, each skeleton pose is shifted into a skeleton-centric coordinate system so that the root joint is aligned to the origin .
O-normalization – To unify various subject orientations, each pose is rotated to align the line connecting both left and right hip joints to be parallel with the -axis.
S-normalization – Subjects are unified in sizes so that each skeleton is resized to the height of an “average” human.
All the three normalizations help to reduce the spatial variability in joint coordinates over the actions in the same class, and thus facilitate the neural-network training process.
2.2 Action Augmentations
Existing deep learning classifiers achieve a high recognition accuracy if a sufficiently large number of training actions is provided. However, action datasets might contain only a limited number of samples in each class, e.g., as in the HDM05 dataset  providing less than 20 actions per class. To enlarge training data, we utilize the following two augmentation techniques, originally proposed by .
Crop()-augmentation – Each action is cropped by trimming away its left and right side. The parameter (in percents) determines that poses are cut from the left side and the same amount from the right side, with respect to the action length . This technique keeps the most important middle part of actions, while the boundary parts that need not be well segmented are discarded.
Noise()-augmentation – A noise is added into 3D coordinates of all action joints simply by moving the joint position in each of // axis by a random value. The parameter (in percents) bounds the maximum size of the random value with respect to the average length of thighbone, i.e., to the maximum .
Both the Crop- and Noise-augmentation techniques are graphically illustrated in Fig. 1.
In addition, we propose the following two new augmentation techniques, illustrated in Fig. 2.
BodyModel()-augmentation – The body model with 31 joints is simplified by selecting only a subset of joints, specified within the parameter (see Fig. 2a). This should facilitate learning since the spatial complexity of each pose is reduced by ignoring some of very close joints that produce an unnecessary movement noise.
KeyPose()-augmentation – The original frame-per-second (FPS) rate is non-linearly decreased by considering only specific poses, so-called key poses. The first action pose is always considered as the key pose and the other ones are gradually determined as the closest next pose which is sufficiently dissimilar, i.e., the dissimilarity between the current and previous key pose is higher than the parameter. The dissimilarity of two poses is quantified as the sum of Euclidean distances between their corresponding pairs of 3D joint coordinates. Compared to traditional downsampling techniques with fixed FPS rates, this technique better respects the changes in movement.
While the Crop- and KeyPose-augmentation techniques deform the temporal dimension of original actions, the Crop- and BodyModel-augmentations change their spatial domain. Except for the Noise-augmentation, the other remaining three techniques can significantly reduce the original size of actions. Since each of the four augmentation techniques is parameterizable, we can generate several variants of actions using only a single technique, e.g., by applying Crop() and Crop().
3 Action Recognition using a Single Bi-LSTM Classifier
The normalization and augmentation techniques can be applied to generate a set of training actions for learning a classifier. As a classifier, we adopt a light-weight version of bidirectional Long-Short Term Memory (Bi-LSTM) network. Formally, we classify input action into one of predefined classes , where each class is characterized by a non-empty set of training actions. We first embed each 3D action pose into an -dimensional space by a linear projection (with parameters and
) followed by a ReLU activation:
where is a user-defined parameter. Learning a projection of the original data in the end-to-end training phase permits us to work with lower-dimensional data and a higher level of abstraction, with both effectiveness and efficiency advantages with respect to the original skeletons.
Each embedded pose
is then fed to both the past-to-future and future-to-past LSTM cells, which respectively produce the following hidden state vectors:
where is a user-defined parameter denoting the total feature size, i.e., the sum of dimensions of both the state vectors. The initial states and are set to zeros. The state vector , together with the consecutive embedded pose (and similarly with for the reverse direction), are given as input to the next step.
For the given action, the prediction for each class
is quantified by probabilitythat is obtained from the concatenation of hidden states and as follows:
where and are the parameters of a linear projection with outputs and denotes the result of the softmax function applied to the -th component of its argument. The class with the highest softmax value is considered as the classification result, i.e., . We optimize the parameters (, , , , and LSTM parameters) by minimizing the cross-entropy between the predictions and the targets. The whole architecture is illustrated in Fig. 3.
4 Online Fusion of Independent Classifiers
When multiple normalization and augmentation techniques are considered, training a single classifier for a given dataset is a hard task because of an extremely large number of variants how training actions can be pre-processed. Suppose techniques are available, then there are possible combinations of techniques that generate different sets of training data. For example, if (e.g., the four presented augmentation techniques and each of them parameterized in four different variants), then there are different subsets of combinations. And it is not computationally feasible to train such number of Bi-LSTM classifiers for choosing the best combination of pre-processing techniques for each specific dataset.
Instead of training combinations, our idea is to train only independent Bi-LSTM classifiers, i.e., a single classifier for each pre-processing technique. Then, we efficiently estimate the quality of a specific combination of classifiers by evaluating the test-data accuracy using an online fusion approach. In particular, each test action is classified using the following three-stage process:
The test action is normalized/augmented by the given pre-processing techniques to get the modified action instances ;
The modified action instances are independently classified by the corresponding Bi-LSTM classifiers to get the partial classification outputs;
The partial outputs are processed based on the majority vote principle to determine the final classification of .
The whole process is schematically illustrated in Fig. 4.
4.1 Strict Majority Vote Principle
Traditionally, the class which receives the largest number of votes is selected as the consensus (majority) decision . We use a much more strict version when more than half of the independent classifiers have to agree on the same class. If there is no class with votes, the classification result is considered as “unknown”. Formally, we define the strict majority vote principle for the output classes of independent classifiers as:
where stands for the multiset of indexes of partial output classes and determines the index of the class with the highest number of votes:
Noticeably, the “unknown” output is always considered as misclassification when evaluating the test-data accuracy for the specific combination of classifiers. This implies that the used strict majority vote principle cannot exceed the accuracy of the traditional vote rule . On the other hand, the result has a quite high confidence and can be very efficiently evaluated as described in the following.
4.2 Evaluating All Combinations Efficiently
Having pre-processing techniques, we need to evaluate the accuracy of combinations. Naively, for the specific combination of techniques , each test action needs to be -times classified by independent Bi-LSTM classifiers. This results in the following huge number of classifications:
Assume techniques and the batch of test actions (as later used in the experiments), we would need to perform about 610 million classifications. This would probably require roughly one month of a single graphics-card capacity.
However, each action is repeatedly processed by the same classifiers, which has to inherently lead to the same partial classifications. Therefore, we classify each action only once for each pre-processing technique, i.e., -times in total, and store the partial classification results to disk. In this way, we can keep the list of partial outputs of all test actions for each out of pre-processing techniques. We can further filter these lists to store only the true-positive matches, i.e., keeping only the actions with the correct partial class assigned.
In the evaluation phase for the given combination of techniques, we simply merge the corresponding true-positive lists and count how many times each test action is present. Then, we filter out the actions having count equal or less than , which is the condition of the strict majority vote rule (see Equation 4). This allows us to very efficiently determine the class of each test action using the fusion approach. The test-data accuracy is finally determined as the ratio between the number of retained actions and the number of all test actions.
It is important to realize that this trick enables evaluating all combinations very efficiently. In particular, by storing the partial outputs, we save exactly classifications in contrast to the naive approach. In our case, the naive number of 610 M classifications is reduced to 19 K (K).
5 Experimental Evaluation
In the following, we present a test dataset and methodology for training Bi-LSTM classifiers. Then, we determine interesting parameters of the presented normalization/augmentation techniques and their combinations, which results in definition of 16 pre-processing techniques. Next, we train 16 corresponding classifiers, efficiently evaluate all their combinations using the fusion approach, and determine the most useful techniques based on the best-performing combinations. We finally apply the best techniques to train a single all-in-one model as an alternative to independent classifiers.
We use the popular HDM05 dataset  captured at a 120 frame-per-second (FPS) rate. The dataset provides the ground truth that divides 2,345 actions into 130 classes. For our internal experiments we ignore the 8 least populated classes that provide only from 1 to 4 action samples, which results in 2,328 actions in total. The actions correspond to daily/exercising activities and significantly differ in length, ranging from 13 frames (0.1 s) to 900 frames (7.5 s). We consider the HDM05 dataset as very challenging since it contains the highest number of 130 classes, while providing only less than 20 action samples for each class on average, in comparison with other datasets.
For a fair evaluation, we apply the standard 2-fold cross validation procedure within all the experiments. We split the ground truth into two folds in a balanced way so that each fold contains 1,164 actions and roughly the same number of actions of the same class. The experiment accuracy is always determined as the average over the best accuracies achieved in both fold runs.
We train Bi-LSTM models using 10-times down-sampled data of 12 FPS rate (except for the KeyPose augmentation with a variable FPS rate), since this rate is sufficient to retain main characteristics of actions, while reducing training time a lot. The dimensionalities of the embeddings and of the combined hidden state vector are and , as they achieve a reasonable trade-off between the accuracy and performance. The training of a single model – with training actions and epochs – takes roughly minutes, when performed on the NVIDIA Quadro K graphics card.
5.3 Evaluating the Best Combinations within 16 Classifiers
To evaluate the fusion approach, we train independent Bi-LSTM classifiers. While the architecture of classifiers is still the same, they use differently pre-processed training/test data. In particular, we use 3 variants of normalizations: skeleton-centric (P), full (P+O+S), and without any normalization (–). Since the full normalization is expected to contribute to the best classification accuracy , we apply augmentation techniques only to the P+O+S-normalized data. Specifically, we apply the four augmentation techniques introduced in Section 2.2, each of them parameterized in two or three settings (e.g., settings KeyPose(10.6) and KeyPose(3.7) generate the actions of approximately 8 and 24 FPS rate, respectively). All these settings correspond to 12 different pre-processing techniques and thus 12 classifiers are trained – see the normalization and augmentation settings in rows 1–12 in Table I. Next two rows (13–14) denote classifiers with non-augmented training data, while test data cropped either by 10 %, or 20 %. The last two rows (15–16) correspond to the opposite variant when the cropped actions are used for training, while the non-augmented data for testing.
Firstly, we have trained such 16 standalone Bi-LSTM classifiers and evaluated their recognition accuracy – see the fourth column (“Accuracy”) of Table I. We can see that the best accuracy of 92.40 % is achieved when both training and test data are P+O+S-normalized and Noise(2.5 %)-augmented. Then, we have fused these 16 classifiers to efficiently evaluate all their combinations (see Section 4.2 for more details) and illustrated the results of selected combinations in the right side of Table I. In particular, we present the five combinations achieving the best results for . For each selected combination, the involvement of the specific classifiers is denoted by black points and the final fusion accuracy is reported at the bottom of the table (for better clarity, the fusion accuracies are also graphically plotted in Fig. 5). We can see that all the reported fusion accuracies are above 93.40 %, with the maximum at 94.03 %. As expected, this experiment confirms superiority of the fusion approach that clearly outperforms any of the standalone classifiers, by reducing the recognition error by 21 % with respect to the best Noise(2.5 %)-classifier.
Interestingly, by focusing on black points in Table I, we can observe that the non-normalized data and P+O+S-normalized BodyModel(14) and KeyPose(3.7) augmentations are included in most of the best-performing combinations, even if their corresponding standalone classifiers do not achieve convincing results. This demonstrates a big strength of the proposed approach that can automatically select the pre-processing techniques suitable for a given dataset.
5.4 Retrospective Learning of All-in-One Model
As soon as we efficiently identify the best combination of pre-processing techniques, a new question arises: Is it better to use the fusion of the independent classifiers, or to rather train a new all-in-one classifier by all the normalized/augmented variants of actions obtained by the best identified techniques?
It is important to realize that the best combination can involve augmentation techniques that change the data format of original poses. In our case, the majority of best-performing combinations involves the BodyModel(14) technique that modifies the pose data format by decreasing the number of skeleton joints (from original 31 to 14). And it is hardly possible to mix different pose formats when training a single model. Consequently, we select the third 5/16 combination that reaches the 93.77 % fusion accuracy and involves the 5 techniques generating the same format of training/test actions: non-, P-, and P+O+S-normalized data along with P+O+S-normalized Noise(5 %) and KeyPose(3.7) augmentations. We pre-process the actions by the given 5 techniques and use all such 5 variants of training data to train a single all-in-one model. Then, we generate the same 5 variants of test data and evaluate the recognition power of the all-in-one model in two different ways: (i) each out of the 5 test-data variants is independently evaluated, which yields 5 recognition accuracies, or (ii) the proposed fusion approach is applied with the difference that individual variants of each test action are classified by the same all-in-one model.
The results of the first way of evaluation are reported in Table II, where the best KeyPose(3.7) variant of test data reaches the accuracy of 90.64 %. This is significantly worse than the accuracy of 93.77 % achieved by the original fusion of independent classifiers. Regarding the second way of evaluation, each variant of a given test action is classified by the same all-in-one model to get the 5 partial classification outputs. Such outputs are then fused by the strict majority vote rule to get the test-action classification. In this case, we achieve the accuracy of 91.24 %, which is higher than the first way of evaluation (90.64 %), but still lower than the fusion approach with independent classifiers (93.77 %). This result allows us to answer the introductory question: It is better to fuse the results of independent models, rather than of a single all-in-one model, with respect to the same pre-processed variants of training/test data. The reason is that the useful pre-processing techniques provide orthogonal views on the input data, which one neural-network model can hardly learn.
5.5 State-of-the-Art Comparison
We have achieved the best result of 94.03 % in the fusion approach on the ground truth with 2,328 actions. Since the state-of-the-art results are reported on the ground truth with 2,345 actions, we simply consider that the rest of 17 actions are classified falsely, resulting in the accuracy of 93.35 %. This accuracy is currently the clearly-best action recognition result reported on the HDM05 dataset – see Table III.
For the action recognition task, we have proposed the general approach for fusing independent classifiers and evaluating all their combinations efficiently. This independent-fusion approach has several advantages: (i) the majority vote rule enables selecting appropriate classifiers automatically for a given action, (ii) the data format of training actions can be different for individual classifiers in contrast to a single classifier, and (iii) the used Bi-LSTM architecture is completely independent, so it can be replaced by any kind of a neural network. We believe that it is always better to train independent models for individual pre-processing techniques instead of a single all-in-one model, which can hardly learn orthogonal views on the input data.
By considering 16 variants of normalized/augmented input data, we have revealed that the combination of 7 Bi-LSTM classifiers clearly outperforms the state-of-the-art result on the challenging HDM05 dataset distinguishing the highest number of 130 classes, in comparison with other datasets. Finally, we demonstrate the suitability of the proposed BodyModel and KeyPose augmentation techniques that are involved in the majority of the best-performing combinations of independent classifiers. This indicates that new augmentation techniques in combination with the fusion approach could increase the recognition accuracy of future classifiers.
This research has been supported by the GACR project No. GA19-02033S.
-  (2018) Towards improved human action recognition using convolutional neural networks and multimodal fusion of depth and inertial sensor data. In 20th International Symposium on Multimedia (ISM), , pp. 223–230. Cited by: §1.
-  (2018) Deep motifs and motion signatures. ACM Transactions on Graphics 37 (6), pp. 187:1–187:13. External Links: Cited by: §1.
-  (2017) Deep learning on lie groups for skeleton-based action recognition. In , , pp. 1243–1252. External Links: Cited by: TABLE III.
-  (2020) Learning latent global network for skeleton-based action prediction. IEEE Transactions on Image Processing 29 (), pp. 959–970. Cited by: §1.
-  (2017) A new representation of skeleton sequences for 3d action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), , pp. 3288–3297. Cited by: §1.
-  (1998) On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (3), pp. 226–239. External Links: Cited by: §4.1, §4.1.
-  (2017) 3D skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Computer Animation and Virtual Worlds 28 (3-4), pp. e1782. External Links: Cited by: §1, TABLE III.
-  (2020) SGM-net: skeleton-guided multimodal network for action recognition. Pattern Recognition, pp. 1–38. External Links: Cited by: §1.
-  (2018) Skeleton based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing 27 (4), pp. 1586–1599. Cited by: §1.
-  (2019) Graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In 21st International Symposium on Multimedia (ISM), , pp. 25–31. Cited by: §1.
-  (2019) Si-gcn: structure-induced graph convolution network for skeleton-based action recognition. In International Joint Conference on Neural Networks (IJCNN), , pp. 1–8. External Links: Cited by: §1, TABLE III.
-  (2007) Documentation Mocap Database HDM05. Technical report Technical Report CG-2007-2, Universität Bonn. External Links: Cited by: §2.2, §5.1.
-  (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition 76, pp. 80–94. External Links: Cited by: §1.
-  (2014) AMAB: automated measurement and analysis of body motion. Behavior Research Methods (BRM) 46 (3), pp. 625–633. Cited by: §1, §2.1, §5.3.
-  (2018) Effective and efficient similarity searching in motion capture data. Multimedia Tools and Applications 77 (10), pp. 12073–12094. External Links: Cited by: §1, TABLE III.
-  (2018) Probabilistic classification of skeleton sequences. In 29th International Conference on Database and Expert Systems Applications (DEXA), Cham, pp. 50–65. External Links: Cited by: TABLE III.
-  (2019) Augmenting Spatio-Temporal Human Motion Data for Effective 3D Action Recognition. In 21st IEEE International Symposium on Multimedia (ISM), pp. 204–207. External Links: Cited by: §2.2, TABLE III.
Deep high-resolution representation learning for human pose estimation. In International Conference on Computer Vision and Pattern Recognition (CVPR), , pp. 5693–5703. Cited by: §1.
-  (2018) Part-based graph convolutional network for action recognition. In British Machine Vision Conference (BMVC), pp. 1–13. Cited by: TABLE III.
Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), , pp. 3633–3642. Cited by: §1, §2.1.
-  (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recognition Letters 119, pp. 3–11. External Links: Cited by: §1.
-  (2019) Adversarial action data augmentation for similar gesture action recognition. In International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. Cited by: §1.
-  (2019) Deep spatiotemporal LSTM network with temporal pattern feature for 3d human action recognition. Computational Intelligence 35 (3), pp. 535–554. External Links: Cited by: §1, §1.
-  (2020) PGCN-TCA: pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8, pp. 10040–10047. Cited by: TABLE III.
Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Transactions on Multimedia (), pp. 1–12. External Links: Cited by: TABLE III.
-  (2019) Bayesian graph convolution lstm for skeleton based action recognition. In IEEE International Conference on Computer Vision (ICCV), , pp. 6882–6892. Cited by: §1.