As a fundamental research topic in computer vision, action recognition has always been an important research problem. Driven by advancements in deep learning models, such as 2D and 3D convolutional neural network (CNN) models, record breaking accuracies have been achieved on state-of-the-art action recognition datasets, such as ActivityNet, Kinectics  and Sports1M . Nonetheless, fine-grained action recognition, where the appearances of action and background are very similar and the recognition relies more on the modeling of temporal dynamics, remains a challenging research problem. To address fine-grained action recognition, various fine-grained datasets [10, 3, 18, 5] have been recently proposed. Particularly, FineGym  is a gymnastic video dataset with structured temporal and semantics hierarchy, decomposing an action into coarse-to-fine event-level, set-level, and element-level annotations. The scale and diversity of FineGym poses significant challenges for fine-grained action recognition, involving not only recognition of complex dynamics of intense motions, but also differentiation in subtle spatial semantics. Hierarchy representation of fine-grained actions provides discriminating and complementary semantic, contextual and motion information at different granularity levels. However, this information is not fully utilized for joint representation and recognition, as most of the existing methods still focus on learning and prediction of single level action class.
In this paper, inspired by the FineGym hierarchy representation, we propose a multi-task network for effective joint representation and learning for fine-grained action recognition. First, we introduce a new deep architecture according to the three-level hierarchy for effective learning of multi-level representation of fine-grained actions. The structure of the new architecture is designed to effectively learn and represent the associations and constraints across different levels of fine-grained action representation. Second, we propose a compact network of joint learning layers with multi-task learning to achieve joint learning and prediction of the three-level action class of the fine-grained action. Our experimental results on the FineGym benchmark dataset have shown a clear improvement on the state-of-the-art performance.
2 Related Works
There are two main frameworks for fine-grained action recognition. The first type is the 2D+1D framework [19, 21, 14, 9] where a 2D CNN model is used for extracting spatial features, and is followed by a 1D model for temporal aggregation. TSN  samples frames from segmented video, and then aggregates the scores of each segment for final prediction. TRN  introduces a temporal relational reasoning module that allows aggregation of multi-scale temporal relations between frames. TSM  proposes a temporal shift module that achieves 3D performance with 2D complexity by shifting some channels along the temporal dimension. ActionVLAD  aggregates spatio-temporal features from two-stream networks by pooling across space and time streams.
expands 2D CNN into 3D CNN by adding a temporal dimension on filters and pooling kernels to capture the spatio-temporal features. Non-local neural network computes weighted sum of features in space-time to capture long-range dependencies. SlowFast  comprises a slow and fast pathway to capture spatial semantic at low frame rate and learn motion information at high frame rate. Lateral connections are used to aggregate the features from the two pathways. SlowOnly is a variant of SlowFast without the fast pathway. There are also hybrid framework that fuses 2D and 3D CNN in a network [15, 7, 17, 13, 8].
In existing works, investigations mainly focus on recognizing each level of the fine-grained actions separately. Different from existing works, we investigate joint spatio-temporal representation, learning and prediction of the hierarchical action class in fine-grained action recognition.
3 Proposed Method
To exploit the hierarchical action representation and make use of the complementary contexts and semantics across different levels, we propose a multi-task network that consists of three input pathways for each hierarchy level and a compact integration network to jointly learn and predict the coarse-to-fine action representations.
3.1 Hierarchy Representation of Fine-Grained Actions
In , a hierarchy representation format is proposed to represent fine-grained human actions. The hierarchy tree is composed of three levels, i.e., events, sets, and elements. The events represent the coarsest level of action categories. In FineGym, it represents the actions of different gymnastic routines. The sets represent middle level sub-action categories under the corresponding event node in the hierarchy tree. Each sub-action category represents a cluster of several technically and visually similar element actions. The elements represent the sub-action categories of the finest granularity under the corresponding set node in the hierarchy tree.
The hierarchy representation provides a multi-level coarse-to-fine representation of fine-grained human actions. There are several advantages of such representation for both learning and prediction. First, it represents a large set of fine-grained actions as a multi-level semantic hierarchy tree. The semantic hierarchy provides a domain constraint and solid foundation for joint learning and perception for coarse-level to fine-grained action understanding. Second, the hierarchy annotations are derived on a decision tree. Hence, the hierarchy representation encodes the distinctive features of semantic meaning, visual appearance and motion information on different levels. For example, some element actions can be distinguished from their background, objects and viewing angles, while some actions are mainly different in fine-grained motion features and duration. Training a single model on the element level actions may not be sufficient to learn the discriminative feature representation in a hierarchy structure. In addition, the hierarchy structure also encodes domain knowledge of action classes. Hence, exploiting multi-level action representations may help in better aggregation of spatial semantics and temporal dynamics, and learning of the domain class information implicitly. Such semantic information and domain knowledge might not be well exploited when training on a single level, especially the element-level for fine-grained action recognition.
3.2 Network Architecture for Multi-task Learning
Inspired by the SlowFast network that captures spatial and temporal semantics from its slow and fast pathways, we design a three-pathway multi-task network for joint learning of the hierarchical representation of fine-grained actions. Specifically, the network has Event pathway, Set pathway and Element pathway where each pathway samples the input clip at different frame rate and output features for our joint prediction layers. We first encode the features from each pathway to a lower dimensional vector, before fusing them for joint prediction. The encoded features from all pathways are concatenated for multi-task prediction on the hierarchical action representation. The architecture of the proposed network is illustrated in Fig.1, where the base model for each pathway is a SlowOnly network trained for single level prediction.
3.2.1 Base model for individual pathways
The observations in  suggest that, action representation such as the event class at the coarsest granularity is more related to visual appearance and background context, and it relies less on motion information. On the other hand, the element class at the finest granularity relies more on detailed motion information and has subtle differences in visual appearance . According to the hierarchy representation, we propose to use a slow pathway of low frame rate for event class which captures semantic information from a few sparsely sampled frames, a medium pathway of moderate frame rate for set class which captures increasing motion information from increased sample frames, and a fast pathway of high frame rate for element class which captures the detailed motion information from densely sampled frames. Following the notation in , we denote the feature shape of a pathway as , where is the number of the frames sampled, represents the spatial size, and denotes the number of channels.
3.2.2 Joint Prediction Layers
After training the base models for each hierarchy level, we combine them to form the three pathways of our multi-task network. The pre-trained features for Event, Set and Element pathways are encoded from the output vector (2048-D) to a lower dimension vector of size 128, 256 and 1024 respectively. Element has the largest dimension of feature vector, and those of Set and Event are reduced gradually since more bits are required to encode the details of low-level sub-actions. This is to involve more temporal dynamic features extracted from high frame rate input, and sufficient spatial information from lower frame rate input. The features are then concatenated and full-connected to a linear layer of dimension 1024. Finally, the linear layer is connected to three classifiers for joint prediction of the event, set and element labels. We adopt cross-entropy loss in (1) for the three categories, where denotes category of event, set or element. is the CNN score of class of category , is the ground-truth value and only one element is not zero for ground-truth positive class. denotes the number of classes in category .
The total loss, in (2), for training the multi-task network is a weighted sum of all losses, where the ratio for the weights , and are set as 1, 2 and 4 respectively.
|Element||SlowFast||4 (slow)||64 (slow)||-||-||-||-||-||-||82.32||98.15||78.93|
|32 (fast)||8 (fast)|
|Event model||Set model||Element model||Top-1||Top-5||Mean||Top-1||Top-5||Mean||Top-1||Top-5||Mean|
|NL I3D ||2Stream||-||-||-||-||75.3||64.3|
4.1 Implementation Details
Our framework is implemented using MMAction2  utilizing four GPUs GeForce GTX2080Ti. We adopt ResNet-50 as the backbone for SlowOnly base models. The base models and multi-task network are trained with learning rate 0.01, momentum 0.9, weight decay
and gradient clip 40. Following the original configurations, SlowOnly is trained with learning rate decay at fixed step 90 and 110.
Our experiment is conducted on FineGym dataset , specifically Gym99 which contains annotation on 4 events, 15 sets, and 99 elements. At the point where we download the dataset, some of the videos are no longer available.
4.2 Base Models
Recently, there is a study111https://github.com/open-mmlab/mmaction2/tree/master/configs/recognition/slowonly that implements SlowOnly network on FineGym99. With the parameters of and base , the model produced 79.3% Top-1 and 70.2% mean accuracies, which are still lower than the state-of-the-art performance. We first implement and investigate training of SlowOnly networks individually for Event, Set and Element actions as our base models with different configurations. For Event prediction, we sample 4 frames with 16 frames interval as low frame rate input, i.e., . For Set prediction, we experiment with two configurations, i.e., with interval 8, and with interval 4. For Element prediction, we sample 32 frames with interval 2, i.e., . The size of the input image is . The first convolution kernel is set as with base . Additionally, we also implement SlowFast network as a base model for Element prediction. The first convolution kernels are set as with and with
for the slow and fast pathways respectively. SlowFast follows the implementation of SlowOnly, but adopts Cosine Annealing learning rate schedule. Both SlowOnly and SlowFast models are trained for 120 epochs and tested on six clips with center crop. The results for the individual models, evaluated on Top-1, Top-5, and mean accuracies are shown in Table1.
As observed in Table 1, SlowOnly outperforms SlowFast network in Element prediction, with 8.7 gain in Top-1 accuracy. This is due to the reduced channel dimension in the fast pathway in SlowFast network which could not capture sufficient temporal context to distinguish the fine-grained motion. For Set and Element predictions, utilizing higher number of frame inputs leads to results improvement in Top-1 and mean accuracies. The results in Table 1 are also served as baselines for comparison with our multi-task network.
4.3 Multi-task Network
In this section, we take one base model from each hierarchy class in Table 1 to form the pathways of our multi-task network. During training, the weights of the base models are freezed, where we utilize the output features to train our joint prediction layers. The network is trained for 60 epochs and tested on a single clip with center crop. We experiment with four different configurations and obtain the results in Table 2.
As observed in Table 2, multi-task networks with fused features perform better than their corresponding base model trained separately for each category in Table 1. This is due to the encoding and learning of complementary feature representation from multi-level spatio-temporal semantics. It is also apparent that multi-task networks with SlowOnly network performs better than SlowFast network in the Element prediction due to the difference in channels dimension.
4.4 Comparison with State-of-the-Art
We compare our multi-task learning results with state-of-the-art results of action recognition on FineGym99 presented in . Specifically, we compare with TSN , TRNms , TSM , I3D  and NL I3D  with 2-stream modalities. The results are presented in Table 3
. The Event, Set and Element results from TSN are trained separately, while our results are joint prediction from our multi-task network. Additionally, we implemented SlowFast with multi-task learning by modifying its last fully-connected layer to three classifiers and adopt the same weighted loss function for training.
Our network outperforms all the action models in Element prediction, achieving 91.80 in Top-1 accuracy and 88.46 in mean accuracy, with improvement of 3.40 and 7.26 respectively over state-of-the-art performance. For Set prediction, our joint learning results performs better than TSN, with 1.25 and 6.90 increment in Top-1 and mean accuracies respectively. As compared to the baseline SlowFast multi-task network, our network with encoding and fusion layers learns better joint representation that leads to improved performance in Element and Set predictions. For Event prediction, our performance is comparable to TSN and SlowFast results.
In this paper, we presented a multi-task network for effective representation and joint learning of fine-grained human actions on the three-level hierarchy proposed for FineGym. Our experiment results show the effectiveness of exploiting and leveraging on the semantic and temporal context of parallel pathways with varying input sampling. We added integration layers to allow joint encoding and learning of complementary spatio-temporal features of hierarchical action categories. Our multi-task network outperforms networks with single task prediction. For future work, we will look into networks with end-to-end training to jointly learn and refine the hierarchical action representations for multi-task prediction.
Activitynet: a large-scale video benchmark for human activity understanding.
Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970. Cited by: §1.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. External Links: Cited by: §2, Table 3, §4.4.
-  (2020) HAA500: human-centric atomic action dataset with curated videos. External Links: Cited by: §1.
-  (2020) OpenMMLab’s next generation video understanding toolbox and benchmark. Note: https://github.com/open-mmlab/mmaction2 Cited by: §4.1.
-  (2020) Finegym: a hierarchical video dataset for fine-grained action understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.1, §3.2.1, §4.1, §4.4.
-  (2019) Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision, pp. 6202–6211. Cited by: §2, §3.2.1.
-  (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §2.
-  (2020) X3D: expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213. Cited by: §2.
-  (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. External Links: Cited by: §2.
-  (2017) The ”something something” video database for learning and evaluating visual common sense. External Links: Cited by: §1.
-  (2014) Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1725–1732. External Links: Cited by: §1.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1.
-  (2020) Semi-cnn architecture for effective spatio-temporal learning in action recognition. Applied Sciences 10 (2), pp. 557. Cited by: §2.
-  (2019) TSM: temporal shift module for efficient video understanding. External Links: Cited by: §2, Table 3, §4.4.
-  (2015) Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4597–4605. Cited by: §2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. External Links: Cited by: §2.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
-  (2020) Yoga-82: a new dataset for fine-grained classification of human poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 1038–1039. Cited by: §1.
-  (2017) Temporal segment networks for action recognition in videos. External Links: Cited by: §2, Table 3, §4.4.
-  (2018) Non-local neural networks. External Links: Cited by: §2, Table 3, §4.4.
-  (2018) Temporal relational reasoning in videos. External Links: Cited by: §2, Table 3, §4.4.