I Introduction
Skeletonbased action recognition has received great attention in recent years, as the dynamics of human skeletons has significant information for the task of action recognition. Compared to other modalities of human action, for example, video, depth images and optical flows, human skeletons have the advantage of small amount of data and high information density. The dynamics of human skeletons can be seen as time series of human poses, or the combination of human joint trajectories. Among all the human joints, the trajectory of important joints indicating the action class conveys the most significant information. It is also worth noting that when performing the same action under different attempts, trajectories of these joints are subject to some distortions. In this work, we propose a novel invariant feature under these distortions and then utilize them for facilitating skeletonbased action recognition.
When performing the same action, two similar trajectories of corresponding joints should share a basic shape. However, due to individual factors, these two trajectories always appear in diverse kinds of distortions. These distortions are caused by spatial and temporal factors. Spatial factors include the change of viewpoints, different skeleton sizes and action amplitude ([29, 27]), while temporal factors indicate time scaling along the time series ([5, 6]). All the spatial factors can be modeled by the affine transformation in 3D space, whereas the uniform time scaling is commonly discussed case, which can be seen as affine transformation in 1D space. We combine these two kinds of distortions as the spatiotemporal dual affine transformation.
In this paper, we propose a general method for constructing SpatioTemporal Dual Affine Differential Invariant (STDADI). Specifically, we utilize the rational polynomial of derivatives of joint trajectories to obtain the invariants. By bounding the degree of polynomial and the order of derivatives, we generate 8 independent STDADIs and combine them as an invariant vector at each moment for each human joint.
Recently, researchers tend to explore the potential of datedriven methods for skeletonbased action recognition. When considering to improve the generalization ability of neural networks under different transformations, a common practice is data augmentation. However, additional data preprocessing generates more samples and takes longer time in the training phase. In this paper, we propose an intuitive yet effective method, extending input data with STDADI along the channel dimension for training and evaluation, and call this practice as channel augmentation. Experiments show that channel augmentation based on STDADI not only achieves stronger performance and generalization, but also provides more insights for skeletonbased action recognition.
The main contributions of this work are the following:

We propose a novel feature called spatiotemporal dual affine differential invariant (STDADI).

In order to improve the generalization ability of neural networks, a channel augmentation method is proposed.
Ii Related Work
Skeletonbased action recognition
Before the rising of deep learning, some handcraftedfeaturebased methods were proposed to solve skeletonbased action recogniton. Wang
et al.[28] proposed to use relative locations of joints as motion features. Hussein et al.[7] exploited the covariance matrices of joint trajectories. Vemulapalli et al.[26] utilized rotations and translations between joint locations to capture the dynamics of human skeletons. However, the performance of these methods is limited as designed features do not cover all factors affecting the recognition. Thanks to the success of deep learning, datadriven methods achieve better performance than before. These methods can be further divided as RNNbased, CNNbased and GNNbased approaches. RNNbased approaches take the sequence of human joint coordinate vectors as input and predict the action label in a recursive manner ([27, 23, 25, 4, 16, 31, 18, 17]). CNNbased approaches express the skeleton data as a pseudoimage for conventional CNNs ([19, 14, 13, 12, 20]) or as a sequence of coordinate vectors for temporal CNNs ([9, 11, 10]). Compared to these two kinds of methods, GNNbased approaches are modeled based on the natural connections between human joints, thus better to characterize the dynamics of human skeletons. Recently, Yan et al.[30] proposed the spatiaotemporal graph convolutional network (STGCN) and achieved evident improvements over previous methods.Transformations and invariant features for skeletonbased action recognition For skeletonbased action recognition, commonly discussed transformations are geometric, which are usually caused by the change of viewpoint and the magnitude of motion. Müller et al. [21] took a set of boolean features associated with four joints to describe their relative positions, which is invariant with respect to the skeleton’s position, orientation and size. Vemulapalli et al. [26] utilized rigid transformations between human joints to describe the skeleton, which are geometrically invariant. Shao et al. [24] used integral invariants as a local description of joint trajectory. Boulahia et al. [2] integrated a set of features inspired by the handwriting recogniton, in which moment invariants [22] with respect to similar transformation were utilized.
Time scaling, as a transformation along the time dimension, are hardly discussed by previous works for skeletonbased action recognition. Most of these works([29, 24, 1, 8]), use a dynamic time warping technique for time alignment and trajectory matching. Esling and Agon [5] explicitly defined time scaling, and classfy it as uniform and dynamic. We explore the definition of uniform time scaling and model it as the temporal affine transformation.
In the deep learning domain, data augmentation is a universal approach for improving generalization under various transformations. However, this method is timeconsuming during training and hard to explain for its improvement. Wang et al.[27] proposed a data augmentation method for 3D coordinates of human skeletons including rotation, scaling and shear transformations, and this method is beneficial to training of the proposed twostream RNN.
Iii Approach
Iiia Spatiotemporal Dual Affine Transformation
Formally, we express the dynamics of human joints in the form of parameterized curve taking time as the parameter:
(1)  
(2) 
where and represent joint trajectories before and after the transformation respectively.
The dual affine transformation can be defined as
(3) 
where the matrix and vector express the spatial affine transformation, and the scalar and are used to denote the temporal affine transformation. This can be detailed as follows:
Spatial affine transformation The matrix controls the rotation and scaling while the vector means the translation. Spatial affine transformations are caused by multiple factors, including coordinate system convertion, pose orientation, different skeleton sizes and action amplitude.
Temporal affine transformation
The linear transformation of time domain can be considered as the 1D affine transformation. The parameter
means time scaling, indicating different speeds, and means phase shift, indicating different beginning time. We discuss uniform time scaling here and it assumes a uniform change of the time scale according to the same proportion [6]. We follow this assumption and express it as the temporal affine transformation.IiiB SpatioTemporal Dual Affine Differential Invariant
We utilize the rational polynomial of derivatives of joint trajectories to construct STDADI. Specifically, based on equation 3, we can derive the relationship between 1st derivatives of joint trajectories before and after the transformation:
(4)  
Similarly, we can obtain the relationship between their any order derivatives by chain rule:
(5) 
where the superscript denotes the order of derivation.
It is worth noting that when is equal to 0, formula 5 is equivalent to formula 3 without translation vector . We can eliminate the effect of translation vector by subtracting the mean value. That is,
(6) 
Thus, in Equation 5, we can set as a nonnegative integer.
Based on the relationship in equation 5, we construct a matrix using 3 derivatives of different orders as column and derive their relationship:
(7)  
where are all nonnegative integers. To ensure the determinant of is not equal to 0, are different from each other. We find that the determinant of is a relative invariant which is related to the transformation parameters of and :
(8)  
We eliminate the parameters of and by constructing the rational formula:
(9) 
This means that
(10) 
is an invariant feature with respect to the spatiotemporal dual affine transformation, namely, STDADI. In this expression, N is a positive integer named as the degree of polynomials, and the degree of numerator and denominator should be equal to guarantee the elimination of the matrix . The max value of derivatives is named as the order of STDADI. To ensure the elimination of the parameter , the following needs to be met,
(11) 
To ensure that every determinant is not equal to 0, it is also needed that . The parameter is a small value for computational stability.
For computation simplicity, we set the upper limit of the degree and order to be 2 and 4, respectively, and we obtain 55 invariants in total. We select 8 of them which are functionindependent [3] from each other, which means weaker correlation and better description ability. The 8 invariants are listed as follows ( are ignored here for compact expression):
(12)  
IiiC Channel Augmentation
Compared to other handcrafted features, our STDADI focuses on describing joint trajectories under the spatiotemporal dual affine transformation. As not all factors are covered, STDADI itself is not efficient enough for the recognition task. However, as the feature is beneficial for recognizing actions under different transformations, it can help improve the generalization of datadriven methods. In this case, we propose an inituitive yet effective method named channel augmentation.
Specifically, we extend input data with STDADI along the channel dimension, as shown in Fig. 1. Conventional inputs are 3D coordinates of human joints, and we concatenate the coordinate vector and the STDADI vector at each joint for each frame. Before the concatenation, we apply a hyperbolic tangent function on the STDADI vector to make sure that it matches the magnitude of coordinates. Channel augmentation introduces invariant information into input data without changing the inner structure of neural networks.
In our experiments we choose to use spatiotemporal graph convolutional networks (STGCN) [30]. This method models the skeleton data as a graph structure, considering spatial and temporal connections between human joints simultaneously. Particularly, it can help exploit local pattern and correlation from human skeletons, in other words, the importance of joints along the action sequence, expressed as weights of joints in the spatiotemporal graph. This is in line with our STDADI, because both of them focus on describing joint dynamics, and our features further provide an invariant expression which is not affected by the distortions.
NTURGB+D  NTURGB+D 120  
Method  Crosssubject  Crossview  Crosssubject  Crosssetup  
PartAware LSTM  [23]  62.9%  70.3%  25.5%  26.3% 
SpatioTermporal LSTM  [16]  69.2%  77.7%  55.7%  57.9% 
GCALSTM  [18]  74.4%  82.8%  58.3%  59.2% 
TwoStream Attention LSTM  [17]  76.1%  84.0%  61.2%  63.3% 
Skeleton Visualization  [19]  80.0%  87.2%  60.3%  63.2% 
Body Pose Evolution Map(*)  [20]  82.4%  86.7%  64.6%  66.9% 
MultiTask Learning Network  [9]  79.6%  84.8%  58.4%  57.9% 
MultiTask CNN with RotClips  [10]  81.1%  87.4%  62.2%  61.8% 
STGCN  [30]  81.5%  88.3%  71.7%  74.3% 
STGCN + data augmentation  80.6%  90.5%  72.2%  79.0%  
STGCN + channel augmentation  83.4%  91.3%  77.3%  78.8% 
Iv Results
In this section we validate the effectiveness of the proposed feature and method on the large scale action recognition dataset NTURGB+D [23] and its extended version NTURGB+D 120 [15]. In addition to the original STGCN, we adopted a data augmentation technique as the baseline method. As illustrated in [27]
, the data augmentation technique involves rotation, scaling and shear transformations of 3D skeletons during training. For all the experimental methods, we used the same training strategy and hyperparameters as suggested in
[30].Iva Datasets & Evaluation Metrics
NTURGB+D and its extended version, NTURGB+D 120 are currently the largest action recognition datasets with 3D joint annotations captured in a constrained indoor environment using Microsoft Kinect V2 cameras. Both of them provide 3D skeleton data containing 3D locations of 25 major body joints in the camera coordinate system. NTURGB+D contains 56880 samples in 60 action classes performed by 40 subjects, and NTU RGB+D 120 extends the original by adding 57600 more samples, expanding the number of action classes and subjects to 120 and 106, respectively. Both datasets have the crosssubject evaluation criteria, while NTU RGB+D 120 makes an improvement on the crossview benchmark by introducing more factors that affect the angle of view, including the height and distance of cameras to the subjects, and renames this benchmark as ”crosssetup”. We report top1 recognition accuracy on both datasets with corresponding evaluation metrics.
IvB Comparison with the Stateoftheart
As shown in Table I, our method, STGCN + channel augmentation, outperforms most of the previous stateoftheart methods. Compare to two baseline approaches, STGCN and STGCN + data augmentation, our method achieves obvious improvements on both benchmarks. For data augmentation, as it is mainly consisted of 3D geometric transformations, it helps much to improve accuracy in crossview recognition, but contributes little to the crosssubject setting. This also verifies that our spatiotemporal dual affine transformation assumption is valid on both evaluation criteria.
Method  Crosssubject  Crossview 

STGCN  81.5%  88.3% 
+ derivatives  80.4%  87.6% 
+ STDADI  83.4%  91.3% 
IvC Detailed Analysis
To validate the effectiveness of STDADI, we tried a different input setting using trajectory derivatives as the extended vector for channel augmentation. This vector contains the 1st, 2nd and 3rd derivatives of the joint trajectory and thus is 9dimensional. Seen from Table II, while the STGCN+STDADI has an improvement, the STGCN+derivatives has a decrease of accuracy on both the benchmarks. This shows that the improvement of accuracy comes from the invariance expressed by STDADI.
We also compare the improvements of STGCN + channel augmentation over STGCN of different action classes. As shown in Fig. 2, actions such as pointing to something and salute achieve the greatest performance gain, while actions like brushing hairs suffer performance loss. We find that those action classes with improving accuracy have specific joint trajectory motion patterns. When performing actions like pointing to something and salute, the trajectories of wrist joint of performers are geometrically similar. This indicates that the geometric similarity of important joint trajectories helps to recognize the action class, and our STDADI provides an invariant representation for the similarity under various distortions.
V Conclusion
In this paper, we propose a general method for constructing spatiotemporal dual affine differential invariant (STDADI). We prove the effectiveness of this invariant feature using a channel augmentation technique on the largescale action recognition dataset NTURGB+D and NTURGB+D 120. The combination of handcrafted features and datadriven methods not only improves the accuracy but also provides more insights. In the future, as the temporal affine transformation may not be efficient to model complex transformations along the time dimension, we are going to explore the invariance under unlinear dynamic time scaling.
References
 [1] (2016) Elastic functional coding of riemannian trajectories. IEEE transactions on pattern analysis and machine intelligence 39 (5), pp. 922–936. Cited by: §II.

[2]
(2016)
HIF3D: handwritinginspired features for 3d skeletonbased action recognition.
In
2016 23rd International Conference on Pattern Recognition (ICPR)
, pp. 985–990. Cited by: §II.  [3] (1935) Functional dependence. Transactions of the American Mathematical Society 38 (2), pp. 379–394. Cited by: §IIIB.

[4]
(2015)
Hierarchical recurrent neural network for skeleton based action recognition
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1110–1118. Cited by: §II.  [5] (2012) Timeseries data mining. ACM Computing Surveys (CSUR) 45 (1), pp. 12. Cited by: §I, §II.
 [6] (2014) A new similarity measure based on shape information for invariant with multiple distortions. Neurocomputing 129, pp. 556–569. Cited by: §I, §IIIA.

[7]
(2013)
Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations.
In
TwentyThird International Joint Conference on Artificial Intelligence
, Cited by: §II.  [8] (2018) A novel geometric framework on gram matrix trajectories for human behavior understanding. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II.
 [9] (2017) A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297. Cited by: §II, TABLE I.
 [10] (2018) Learning clip representations for skeletonbased 3d action recognition. IEEE Transactions on Image Processing 27 (6), pp. 2842–2855. Cited by: §II, TABLE I.
 [11] (2017) Interpretable 3d human action analysis with temporal convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1623–1631. Cited by: §II.
 [12] (2017) Skeleton based action recognition using translationscale invariant image mapping and multiscale deep cnn. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 601–604. Cited by: §II.

[13]
(2017)
Skeletonbased action recognition with convolutional neural networks
. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 597–600. Cited by: §II.  [14] (2017) Twostream 3d convolutional neural network for skeletonbased action recognition. arXiv preprint arXiv:1705.08106. Cited by: §II.
 [15] (2019) NTU rgb+ d 120: a largescale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence. Cited by: item 3, §IV.
 [16] (2016) Spatiotemporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pp. 816–833. Cited by: §II, TABLE I.
 [17] (2017) Skeletonbased human action recognition with global contextaware attention lstm networks. IEEE Transactions on Image Processing 27 (4), pp. 1586–1599. Cited by: §II, TABLE I.
 [18] (2017) Global contextaware attention lstm networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1656. Cited by: §II, TABLE I.
 [19] (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68, pp. 346–362. Cited by: §II, TABLE I.

[20]
(2018)
Recognizing human actions as the evolution of pose estimation maps
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1159–1168. Cited by: §II, TABLE I.  [21] (2005) Efficient contentbased retrieval of motion capture data. In ACM Transactions on Graphics (ToG), Vol. 24, pp. 677–685. Cited by: §II.
 [22] (1980) Threedimensional moment invariants. IEEE Transactions on Pattern Analysis and Machine Intelligence (2), pp. 127–136. Cited by: §II.
 [23] (2016) NTU rgb+ d: a large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019. Cited by: item 3, §II, TABLE I, §IV.
 [24] (2015) Integral invariants for space motion trajectory matching and recognition. Pattern Recognition 48 (8), pp. 2418–2432. Cited by: §II, §II.
 [25] (2015) Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international conference on computer vision, pp. 4041–4049. Cited by: §II.
 [26] (2014) Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595. Cited by: §II, §II.
 [27] (2017) Modeling temporal dynamics and spatial configurations of actions using twostream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 499–508. Cited by: §I, §II, §II, §IV.
 [28] (2012) Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. Cited by: §II.
 [29] (2009) Flexible signature descriptions for adaptive motion trajectory representation, perception and recognition. Pattern Recognition 42 (1), pp. 194–214. Cited by: §I, §II.
 [30] (2018) Spatial temporal graph convolutional networks for skeletonbased action recognition. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §II, §IIIC, TABLE I, §IV.
 [31] (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126. Cited by: §II.
Comments
There are no comments yet.