Log In Sign Up

Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition

Human action recognition is a quite hugely investigated area where most remarkable action recognition networks usually use large-scale coarse-grained action datasets of daily human actions as inputs to state the superiority of their networks. We intend to recognize our small-scale fine-grained Tai Chi action dataset using neural networks and propose a transfer-learning method using NTU RGB+D dataset to pre-train our network. More specifically, the proposed method first uses a large-scale NTU RGB+D dataset to pre-train the Transformer-based network for action recognition to extract common features among human motion. Then we freeze the network weights except for the fully connected (FC) layer and take our Tai Chi actions as inputs only to train the initialized FC weights. Experimental results show that our general model pipeline can reach a high accuracy of small-scale fine-grained Tai Chi action recognition with even few inputs and demonstrate that our method achieves the state-of-the-art performance compared with previous Tai Chi action recognition methods.


page 1

page 2


Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

The task of skeleton-based action recognition remains a core challenge i...

ConvGRU in Fine-grained Pitching Action Recognition for Action Outcome Prediction

Prediction of the action outcome is a new challenge for a robot collabor...

Transferable Feature Representation for Visible-to-Infrared Cross-Dataset Human Action Recognition

Recently, infrared human action recognition has attracted increasing att...

Adaptive Recursive Circle Framework for Fine-grained Action Recognition

How to model fine-grained spatial-temporal dynamics in videos has been a...

Sharing Pain: Using Domain Transfer Between Pain Types for Recognition of Sparse Pain Expressions in Horses

Orthopedic disorders are a common cause for euthanasia among horses, whi...

Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

Action recognition from videos, i.e., classifying a video into one of th...

Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion

Action recognition is an important yet challenging task in computer visi...

I Introduction

Human action recognition raises a lot of interest among researchers and has been widely applied in scenarios including human-machine interaction and remote monitoring. Skeleton-based human action datasets have been more used as network inputs for their convenience in data collection and low storage consumption. With the development of portable devices and the advancement of transmission efficiency of data, lots of motion-capture systems can easily obtain the 3D skeleton sequences such as Microsoft Kinect[1]

and Perception Neuron


. Microsoft Kinect V2 contains a depth sensor, a color camera, and a four-microphone array that provide full-body 3D motion capture, facial recognition, and voice recognition capabilities and it can easily extract multiple human motions with pose estimation algorithm, but the accuracy is affected by occlusion. Perception Neuron is an inertial sensor-based motion-capture system and each sensor measures its orientation and acceleration using a gyroscope, a magnetometer, and an accelerometer. The motion of the actor can be accurately calculated by wearing the whole system but it can meanwhile only get one actor’s motion.

With the advent of these portable devices, human action recognition using 3D skeleton sequences has attracted more attention in recent years and lots of remarkable approaches have been proposed. In the early years, most studies in skeleton-based action recognition created hand-crafted features to represent the relative 3D rotation and translations among joints[3]

, but these methods were not general and cannot fully capture all features of human motion. With the development of deep neural networks, deep learning methods have been applied in action recognition. Some works treated 3D skeleton sequences as 2D pseudo-images which represent the skeleton joints and temporal dynamics respectively in rows and columns, and use Convolutional Neural Network-based (CNN-based) methods to extract features from input pseudo-images


. Skeleton-based human action can also be regarded as a sequence of skeletons and utilize Long Short-Term Memory (LSTM)

[6] to model temporal dependencies among skeleton frames [7][8]. The above methods ignore the dependencies and break the physical connections among joints, which are not always optimal for all human actions.

Fig. 1: Illustration of proposed transfer learning network. (a) Spatial Transformer Block, (b) Overall framework of the model.

3D skeleton sequences are essentially graph-structured data. Inspired by CNN, [9] designed an efficient variant of CNN which operate directly on graphs named Graph Convolutional Network (GCN). [10] first leveraged GCN to design a spatial-temporal GCN (ST-GCN) which can automatically capture the patterns embedded in the spatial configuration of the joints as well as their temporal dynamics. [11] proposed a novel multi-stream attention-enhanced adaptive graph convolutional neural network including fixed structural graph matrix, learnable structural graph matrix, and actional graph matrix varied from input human motion. Bone information and joint information were simultaneously used to form a two-stream framework to improve the recognition accuracy. The above methods have the trend that the weight matrix should be formed according to the input data. While the Transformer architecture[12] with self-attention module[13]

as its mainstream has become the de-facto standard for natural language processing tasks, its structure has been widely applied in various research fields including computer vision

[14] and long sequence time-series forecasting[15]. Recently, Transformer algorithms were also applied in action recognition. [16] proposed a novel spatial-temporal network (ST-TR) which models dependence between joints using the Transformer self-attention operator. [17]

added a new multi-task self-supervised learning method based on the spatial-temporal network to improve the robustness of the model. Both models achieves remarkable performance on NTU RGB+D datasets

[18][19] compared to the above methods.

Researchers focus on deep and complex neural networks to achieve better results on large-scale coarse-grained action datasets. For examples, training samples vary from in UCF-101 dataset to in Kinetics-600 datasets. When directly applying these methods on small-scale datasets, these networks usually show severe over-fitting problems. We also notice that prevailing action recognition datasets contain human daily activities. These coarse-grained actions have obvious inter-class distances, which reduce the difficulty of recognition. Recently, some researches focus on fine-grained action recognition[20] and have established corresponding datasets to tackle application problems on specific scenarios. [21] created the MoCA dataset consisting 20 fine-grained cooking actions from 3 cameras views for each action. [22] published a fine-grained basketball action dataset consisting of annotated basketball game videos. These samples could all belong to the class ‘play basketball’ in coarse-grained datasets but are fine-annotated in the basketball dataset. As can be expected, recognizing fine-grained action is more challenging and has more application scenarios.

In this paper, we establish a small-scale fine-grained Tai Chi action dataset and use deep learning method for our Tai Chi action recognition. Researches on Tai Chi action recognition have meanings in action understanding and action evaluation. Tai Chi action can be treated as a specific action class, each Tai Chi action consists of multiple meta actions and is completed by the coordination of the whole body. Compared with human daily actions, attention needs to be paid to the movement characteristics of different human body parts at different stages, so recognizing Tai Chi action is also a challenging task for the similarity between Tai Chi action classes. As Tai Chi has become an Asian Games event, a good recognition network can help abecedarians to evaluate their activities. Some researches have cared about Tai Chi action recognition works, [23] created a fine-grained Tai Chi dataset consisting video clips collected from website with dynamic background. [24] [25] applied deep learning methods to recognize their own Tai Chi dataset. These datasets are not avalible at website for comparison and methods are not general for fine-grained action recognition.

Our small-scale open-source Tai Chi dataset will lead over-fitting of the network if directly used as network inputs, As Tai Chi actions and those daily actions are all manifestations of human movement, they share similar physical features. Based on the remarkable performance of Transformer and size limitation of Tai Chi dataset, we propose a transfer-learning transformer network for Tai Chi action recognition illustrated in Fig.

1. We first select 49 single-actor action classes in the NTU RGB+D dataset, and use Spatial Transformer to pre-train the model to capture common human motion features. Then we freeze the weights of the network except initializing the weights of the last FC layer. Finally, we take Tai Chi action as input to train the last FC layer for Tai Chi action recognition. Fig.1(b) shows the overall framework of the model while Fig.1

(a) gives a detailed illustration of a Spatial Transformer block. Our contributions are three-fold: (1) We analyze the application scenarios of Tai Chi action recognition and merge multiple methods including CNN, GCN and, Transformer to extract different features for Tai Chi action recognition. (2) Due to the size limitation of our Tai Chi dataset, we use a similar human action dataset to pre-train the model for human action features extraction. This can reduce the over-fitting phenomenon of the model to improve the recognition accuracy. (3) Compared to the previous recognition method

[26], the model of this paper is more concise and reaches a higher accuracy even in the little training sample condition.

Ii Tai Chi Dataset

In this section, we first introduce the collection of Tai Chi dataset. Then we explain how we pre-process the Tai Chi sequence to form a suitable input for the model.

Ii-a Sequence Formation

For Tai Chi action with a single actor, we use the wearable device Perception Neuron for action data collection. It has up to 32 neurons connected by wires fixed in a specific position, as shown in Fig.2. Each neuron measures its orientation and acceleration, the measured data is sent to the hub which gathers all the data from every connected neuron. The hub then sends that data either over a USB or WiFi connection to a PC to save data or to perform real-time processing. In this experiment, we store the data as a Bounding Volume Hierarchy (BVH) file including the relative rotation of 72 joints. The coordinates of joints in the local coordinate system can be calculated through this file.

Fig. 2: Neuron position and Sensor wearing combination of Perception Neuron

BVH file consists of two parts. The first part defines each joint’s name, several channels, and the relative positions between the child joint and its parent joint. This gives a sketch of input sequence. The second part gives the channel data of each joint at every frame including the rotation and translation information, so the coordinates of a skeleton joint in the local coordinate system can be calculated in a chain multiplication with a combined rotation matrix until finding its parent joint same as the root joint. The combined rotation matrix is calculated by multiplying the matrix of each axis, , , respectively. Supposing we get the rotation angle of three-axis , , and translation position , , of a child joint to its parent joint, we get the rotation matrix of each axis as shown in (1).

To calculate the coordinates of a joint, take the left foot joint as an example, assuming the relative position to its parent joint is , the coordinates in the local coordinate system defined as can be calculated in (2), where is the transformation Matrix shown in (3). A sequence of Tai Chi action “Preparation” after the calculation has shown in lower-left corner of Fig.1.


Ii-B Sequence Pre-processing

Our Tai Chi action dataset consists of 10 Tai Chi actions and each action has 20 samples. As for our small-scale Tai Chi dataset, we intend to use the NTU RGB+D dataset to pre-train the model. 25 joints[18] of NTU RGB+D dataset are provided as we get 72 joints of Tai Chi dataset. As illustrated in Fig.3

, Tai Chi dataset has abundant hand information. So the process of joint conversion will lose some action information about hands. Then we augment the size of the training Tai Chi dataset to avoid the overfitting of the network. We also normalize the sequence to a fixed angle to reduce the impact of various viewpoints. Finally, we set all sequences to 64 frames by bilinear interpolation.

Fig. 3: Joints of NTU RGB+D and Tai Chi dataset

Iii Transfer Learning Network

Iii-a Graph Convolutional Network

An action data can be represented by 2D or 3D coordinates of each human joint in each frame. Our input can be abstracted as joints representing each skeleton and frames composing the sequence. As mentioned in model ST-GCN [10], we bring in the graph adjacency matrix of the skeleton shown in Fig.1(a)(1) to capture the structural links between joints. After normalization, we finally get spatial adjacency matrix , where . In our model, we set matrix trainable to learn the dynamic structural links between joints, making the model adaptive.

Iii-B Self-attention Mechanism

The self-attention mechanism can automatically learn the relationships among the input data with variable weight matrix. The function self-attention can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and the output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a combination of the query and the corresponding key. Similar to

[16], we bring in the spatial self-attention to our model to extract features embedding the relations between joints. Given the input sequence , we take at time with joint as example. We first compute a query vector , a key vector and a value vector using input vector multiplied by trainable parameters , , which shared across all nodes. So, for any pair , we get a query-value score . For joint , it can get a query-value score vector including all joints. The attention score vector for joint is shown in (4) after the softmax function.


Concatenating each joint’s attention score vector, we form Spatial Attention Matrix in Fig.1(a)(2). In our model, multi-head attention is applied by splitting the input channel to folds, the self-attention outputs of each fold are concatenated together and pass through an FC layer to form the final output features.

Iii-C Spatial Transformer Block

The structure of our spatial transformer block is shown in Fig.1(a). Besides graph adjacency matrix and spatial attention matrix, positional relative encoding matrix shown in Fig.1

(a)(3) is added to encode the different joint information, as it is obvious that joints of hand and foot have different meanings even they are in the same location. In the Spatial Transformer block, input data are multiplied by the weight matrix to extract spatial features of the skeleton. Residual connections are added to stabilize the model. Batch normalization and a ReLU

[27] layer are applied to extracted features to form the spatial transformer block output.

Iii-D Framework of Model Pipeline

Illustrated in Fig.1(b), as for a batch of skeleton sequences , we first use a two-dimensional batch normalization to reduce internal covariate shift. Then a hierarchy of attention blocks is applied to extract spatial and temporal features of the normalized sequence. Each block contains Spatial Transformer and Temporal Convolution Network(TCN). We set a 2D convolutional layer as the TCN while Spatial Transformer has mentioned before. We reshape the input to to extract features of each frame. In TCN, a 2D kernel is applied on the input

for temporal modeling. Extracted features finally pass through the FC layer and softmax function to predict the probability of each class.

Iv Experimental Details

In this section, we first give a brief introduction of two datasets. Then we state the experimental settings and experiment contents. Finally, we give the recognition results of the experiment and the comparison to conventional methods.

Iv-a Experiment Datasets

NTU RGB+D dataset is used to pre-train the network. The NTU RGB+D dataset [18] is a large-scale dataset for 3D human action collected by Microsoft Kinect V2. It consists of 60 different action classes including single-player actions and multi-player interaction. In this work, we select the first 49 classes which are all single-player actions and use Cross-View evaluation (X-View) to pre-train the model. The denoised dataset has 30349 training and 15115 testing samples in X-View, split according to the different camera views.

Our small-scale fine-grained Tai Chi dataset has 10 Tai Chi actions with 20 samples in each action, and our dataset is available at Representations of each action’s iconic frame is illustrated in Fig.4. As shown in Fig.4, fine-grained Tai Chi actions are composed of a series of meta action and are hard to recognize with one frame. For example, ‘Brush Knee and Twist Step’, ‘Hold the Lute’, ‘Pulling, Blocking and Pounding’, ‘Apparent Close Up’ share common human poses with little different hand gestures. Limitation of the small-scale dataset and close inter-class distances increases the recognition difficulties.

Fig. 4: Illustration of 10 Tai Chi actions

Iv-B Experiment Settings

We split 10%, 30%, 50%, 70% of Tai Chi action samples as training samples respectively for action recognition. We also augment the training dataset up to 8 times to enrich the dataset samples and reduce over-fitting.

Our model is composed of 9-layer attention blocks and the channel dimension of each block is 64, 64, 64, 128, 128, 128, 256, 256, 256. In Spatial Transformer block, the number of multi-head attention is 8 and the dimension of are all

. Using Pytorch


framework, we first pre-train our model using NTU RGB+D dataset for a total of 100 epochs with batch size 32 and SGD as optimizer. The initial learning rate is 0.1 and is reduced by a factor of 10 at epoch 50, 70 and 90. We also use distributed data-parallel framework (DDP) to accelerate the training of the model as 8 GPUs are simultaneously used for calculation.

While finishing the training of the NTU RGB+D dataset, we freeze the weights of the network except for the FC layer and initialize the FC weights, action classes reduce from 49 to 10. All configuration parameters are the same as NTU dataset training. 12 different training dataset settings are used to demonstrate the effectiveness of the pre-trained model.

Iv-C Experiment Results and Analysis

The recognition accuracy of these settings in shown in Table.I. As star bold labeled in Table.I, an action sample ‘Pulling, Blocking and Pounding’ is misclassified into action ‘Apparent Close Up’. As shown in Fig.4, both action share similar body trend and have same motion direction during the action. Owing to our pre-trained process, we have to discard some hand information in Tai Chi action for skeleton joint pairing, which may lead to misclassifying.

RatioAccAug 2 4 8
10% 87.22% 87.22% 85.56%
30% 89.29% 90.71% 90.71%
50% 92.00% 93.00% 96.00%
70% 95.00% 98.33%* 96.67%
TABLE I: Tai Chi Action Recognition Accuracy

We first analyze the effect of sample augmentation. The performances of recognition increase notably when the augmentation factor increases to 4 but may drop a little when continuously raise to 8, especially for the small amount of training samples. This happens because too many similar samples restrict the universality of the model and make the model overfit.

From the perspective of training ratio, we find that even with 10% of training samples, that is only 2 samples included during training, the model has a remarkable recognition effect. With the growing of training ratio, the accuracy increases steadily, which states the effectiveness of our transfer learning model.

Compared to the conventional method[26] whose recognition accuracy is shown in Table.II, they applied a conventional method SVM to recognize the segmented trajectory as opposed to deep learning methods and didn’t perform well under small-scale settings. Our model outperforms their method at all configurations. In our transfer learning model, we don’t need to create handcrafted features or segment trajectories, using the sample itself to extract features for action recognition instead and reach higher recognition accuracy.

25% 80.05%
50% 90.79%
75% 93.86%
Temples 90.45%
TABLE II: Tai Chi Action Recognition Accuracy Using Conventional Method

V Conclusion and future work

In this paper, we build a fine-grained Tai Chi dataset with professional Tai Chi performers using a wearable motion capture system. Each Tai Chi action is composed of multiple meta actions with the coordination of whole body parts. Actions among classes have little differences and they are suitable for the fine-grained action recognition task. Compared with creating hand-crafted features for Tai Chi action or other fine-grained action recognition methods, we propose a general model pipeline that can extract discriminative motion features even on small-scale fine-grained datasets, and this method has not yet applied in other fine-grained action recognition research. Experiments demonstrate the our model pipeline can reach steady and accurate performances under a small-scale training dataset. Our future work focuses on improving the recognition accuracy under the condition of a small training set ratio with the combination of Spatial and Temporal Transformer.


  • [1] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012.
  • [2] T. Baumann, T. Hao, Y. He, and R. Shoda, “Perception neuron unity handbook,” Perception Neuron Unity Integration 0.2, vol. 2, 2017.
  • [3] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2014, pp. 588–595.
  • [4] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3288–3297.
  • [5] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol. 68, pp. 346–362, 2017.
  • [6] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [7]

    Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110–1118.
  • [8] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018.
  • [9] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [10] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in

    Thirty-second AAAI conference on artificial intelligence

    , 2018.
  • [11] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with multi-stream adaptive graph convolutional networks,” IEEE Transactions on Image Processing, vol. 29, pp. 9532–9545, 2020.
  • [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [15] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of AAAI, 2021.
  • [16] C. Plizzari, M. Cannici, and M. Matteucci, “Skeleton-based action recognition via spatial and temporal transformer networks,” Computer Vision and Image Understanding, vol. 208, p. 103219, 2021.
  • [17] Y. Zhang, B. Wu, W. Li, L. Duan, and C. Gan, “Stst: Spatial-temporal specialized transformer for skeleton-based action recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3229–3237.
  • [18] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
  • [19] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019.
  • [20] G. Sun, H. Cholakkal, S. Khan, F. S. Khan, and L. Shao, “Fine-grained recognition: Accounting for subtle differences between similar classes,” 2019. [Online]. Available:
  • [21] E. Nicora, G. Goyal, N. Noceti, A. Vignolo, A. Sciutti, and F. Odone, “The moca dataset, kinematic and multi-view visual streams of fine-grained cooking actions,” Scientific Data, vol. 7, no. 1, pp. 1–15, 2020.
  • [22] X. Gu, X. Xue, and F. Wang, “Fine-grained action recognition on a novel basketball dataset,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 2563–2567.
  • [23] S. Sun, F. Wang, Q. Liang, and L. He, “Taichi: A fine-grained action recognition dataset,” in Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, 2017, pp. 429–433.
  • [24] L. Dong, D. Li, S. Li, S. Lan, and P. Wang, “Tai chi action recognition based on structural lstm with attention module,” in 2019 International Conference on Image and Video Processing, and Artificial Intelligence, vol. 11321.   International Society for Optics and Photonics, 2019, p. 113211O.
  • [25] L. Liu, M. Qing, S. Chen, and Z. Li, “Tai chi movement recognition method based on deep learning algorithm,” Mathematical Problems in Engineering, vol. 2022, 2022.
  • [26] L. Xu, Q. Wang, L. Yuan, and X. Ma, “Using trajectory features for tai chi action recognition,” in 2020 IEEE International Instrumentation and Measurement Technology Conference (I2MTC).   IEEE, 2020, pp. 1–6.
  • [27] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
  • [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.