1. Introduction and related works
Fine-grained action classification is being more and more investigated in recent years due to its various potential applications such daily living care (Cartas et al., 2020; Das et al., 2019), video security and surveillance (Singh et al., 2016) or in sport activities (Martin et al., 2020; Shao et al., 2020; Li et al., 2018). The difference with coarse-grained action classification (Smaira et al., 2020; Li et al., 2020; Soomro et al., 2012) lays in the high intra-class similarity of the actions. Movements performed are often similar since they focus on one particular activity. Moreover, since videos are recorded in the same context, the background scene and manipulated objects are similar in all videos. Consequently, all possible information should be extracted from the performed movement itself in order to discriminate actions. The target application of our research is fine-grained action recognition in sports with the aim of improving athletes performance.
Collecting individual data from athletes by body-worn sensors (connected watches, smart clothes, exoskeleton) might be a valuable source of information for classification of similar actions. However, the analysis of gestures is often confined to laboratory studies (Ebner and Findling, 2019; Liu et al., 2019; Xia et al., 2020). Sound has also proven to be efficient for event detection (Baughman et al., 2019) but may not be used for more complex tasks. In (Voeikov et al., 2020), the authors propose an advanced real-time solution for scene segmentation, ball trajectory estimation and event detection but are not considering stroke classification.
The use of pose, expressed as coordinates of skeleton joints, has also become popular for action recognition. In (Wu et al., 2016)et al., 2016). Similarly, PoTion (Choutas et al., 2018) uses movement of the human joints as features to improve the classification score of the I3D models (Carreira and Zisserman, 2017). Pose has also been used in sport: (Fani et al., 2019)
performs classification of four classes of football (soccer) actions based on pose estimation. The authors of(Shimizu et al., 2019) investigate shot direction in Tennis using the pose of the player. In (Luvizon et al., 2018), the authors propose a multi-task method for 2D/3D pose estimation and action recognition. Similarly, (Rogez et al., 2020), based on LRC-Net (Rogez et al., 2017), builds a pseudo ground truth for 3D poses from images using 2D pose search in a projected 3D pose dataset in order to offer 3D human pose from images. In (Yan et al., 2019), pose representations from the pose estimators are feed to a 3D CNN in order to obtain spatio-temporal representation used for action classification. Furthermore, a 3D attention mechanism has been investigated on joint skeleton using LSTM (Liu et al., 2017). The pose can also be used for spatial segmentation as in (Soomro et al., 2019).
However, there are limitations of using skeleton-based approaches for action recognition as pointed out in (Zheng et al., 2020). The authors manage to induce large errors with attacks on the pose based models through low variation of the inputted pose. The use of several modalities is therefore needed to be less dependent on the pose estimation alone. Pose can also be used for further analysis: in (Wu and Koike, 2020) the position of a table tennis ball is predicted according to the player’s pose. In (Morel et al., 2017), qualitative measures of tennis and karate gestures are computed for comparing the pose of expert and novice participants.
Recording of “markerless” and “sensorless” video of performing athletes has an advantage. It does not bias human performance in the target task. In this case the classification of actions has to be done using video only. Hence, as much as possible information must be extracted from the video stream in order to conduct movement analysis. The first modality is the raw information of pixel colour values. Motion is an important modality, extracted by optical flow, as investigated in (Martin et al., 2019). It was proved to be efficient in terms of classification performance. Improvements can be achieved by making use of other information from the video recordings, such as the sound. Another possibility is to extract an information which is the interpretation result of raw data. Thus, we consider a human pose expressed via joints spatial coordinates which can be computed from the same videos. The purpose is to make cameras “smart” to analyse sport practices (Einfalt et al., 2018).
In this work, we investigate the use of pose information for classification inspired by the work carried out in (Martin et al., 2020, 2021). In the original twin model that takes as input RGB stream and its estimated optical flow, a third branch with pose information is added to the network. The branches are fused at the latest stage of the network through several bilinear layers. Experiments are performed on the TTStroke-21 dataset. We solve two distinctive tasks: classification only and joint classification and segmentation from videos. Our method achieves slightly better performance on the classification task but much better scores on the joint classification and segmentation task through the use of pose and a fusion approach. We also present the opportunities that the pose offers for further movement analysis to enrich the feedback to the users.
2. Proposed approach
To deal with the low inter-class variability of the actions proper to fine-grained action classification, the most complete information from video must be extracted, i.e. both appearance (RGB) and motion (Optical Flow) modalities. Spatio-temporal convolutions in the network are performed on cuboids of RGB frames and on cuboids of optical flow (OF). Pose joints are also processed by temporal convolutions. All three modalities are processed simultaneously through a three-stream architecture as presented in Figure 2.
2.1. Optical Flow Estimation
As presented in (Martin et al., 2019), the OF and its normalization can strongly impact the classification results. The same motion estimator reaching best classification performances is used thereafter. The method is based on iterative re-weighted least square solver (Liu, 2009). Each OF frame is encoded with its horizontal and vertical motion components being computed from two consecutive RGB frames. In order to only keep foreground motions, estimated OF is smoothed with Gaussian filter with kernel size and multiplied using Hadamard product by the computed foreground mask : (Zivkovic and van der Heijden, 2006).
2.2. Region Of Interest Estimation
The region of interest (ROI) center is estimated from the maximum of the OF norm and the center of gravity of non-null OF values as follows:
with parameter set empirically to , the size of video frames. Function allows to define ROI without the image border. The size of cuboids are which corresponds to a duration of 0.83s. To avoid jittering within the RGB and OF, a Gaussian filter of kernel size ( second) and scale parameter is applied along the temporal dimension to average the ROI center position. These parameter values were chosen experimentally and are suitable for the 120 fps video frame rate.
2.3. Pose Estimation
. It supplies poses and human joints positions and their confidence score. In addition, the pose position (mean of the joint coordinates) and its attributed score are used, leading to a descriptor vectorwith elements such as:
with the joint or the pose, and its horizontal and vertical coordinate and its associated score.
2.4. Data Normalization
To map their values into interval , RGB data are normalized by theoretical maximum, while joints position and are normalized with respect to the width and height of the video frames: and . The OF is normalized using the meanof the maximum absolute values distribution of each OF components over the whole dataset as described in equation 3:
with and representing respectively one component of the OF and its normalization. This normalization scales values into [-1,1] and increases the magnitude of most of vectors making the OF more relevant for classification.
2.5. Model Architecture
The architecture is inspired from the Twin Spatio-Temporal Convolutional Neural Network - T-STCNN with attention mechanisms presented in(Martin et al., 2021) which takes as input the OF and RGB values through two branches using 3D convolutions and attention mechanism. Compared to the latest, our network has three branches and takes as additional input joint coordinates. Furthermore the fusion step is adapted to fuse the three modalities.
As depicted in Figure 2, the networks perform 3D (spatio-temporal) and 1D (temporal) convolutions. The two first branches are composed of three 3D convolutional layers with , , filters respectively which can be described by equation 4:
where is the output channel, the number of channels of the input and is the valid 3D cross-correlation operator. Each branch takes cuboids of RGB values and OF of size with respectively and channels. The 3D convolutional layers usein each direction. Their output is processed by max-pooling layers using kernels of size . Each max-pooling layer feeds an attention block. The output of the successive convolutions is then flattened to feed a fully connected layer: of length .
An extra branch processes the pose data of length . It follows the same organization than the two other branches but uses 1D temporal convolutions over all the joints coordinates and scores (see eq. 2) at the first convolution leading to channels. This operation is similar to equation 4 using simple cross-correlation. A max-pooling operation is performed along the temporal dimension.
The three branches are fused two by two using bilinear fully connected layers: , of length , which represents the number of classes. The three resultant outputs are summed and processed by a Softmax function to output probabilistic scores used for classification.
2.6. Data Augmentation
Data augmentation is performed on-the-fly on the train set. Each stroke sample is fed to the model once per epoch. For temporal augmentation,
successive data from the RGB, OF and Pose modalities, are extracted following a normal distribution around the center of the stroke video segment with a standard deviation ofwith . Spatial augmentation is performed with random rotation in range , random translation in range in and directions, random homothety in range and flip in horizontal direction with
of probability. The OF and Pose values are updated accordingly. Transformations are applied on the region of interest to avoid inputting regions outside the image borders. During the test phase, no augmentation is performed and theextracted frames are temporally centered on the stroke segment.
2.7. Training Phase
All models are trained from scratch using stochastic gradient descent with Nesterov momentum and weight decay. Cross-entropy loss is used as objective function. A learning rate scheduler is used, which reduces and increases the learning rate when the observed metric (validation loss) reaches a plateau. Warm restart technique(Loshchilov and Hutter, 2017) is used: weights and state of the model are saved when performing the best (lowest validation loss) and re-loaded when the learning rate is updated. This allows to re-start the optimization process from the past state with a new learning rate in the gradient descent optimizer.
The following parameters were found optimal after successive experimental trials using grid search. Grid search was used for the following parameters: , and the number of epochs considered for comparing the training loss averages.
Training process starts with a learning rate of . A number of epochs: , set to , is considered before updating the learning rate, unless the performance drastically dropped (in our case: of the best validation accuracy obtained).
The metric of interest is the training loss: if its average on the last epochs is greater than its average on the epochs before, the process is re-started from the past state and the learning rate divided by ten until reaching . After this step, the learning rate is set back to and the process continues. This technique differs from decreasing only by step (Zagoruyko and Komodakis, 2016) since the learning rate might re-increase if no amelioration is observed.
3. Experiments and Results
We compare results with the original T-STCNN with attention mechanism from (Martin et al., 2021) and the Two-Stream I3D model (Carreira and Zisserman, 2017), all trained and tested from scratch on TTStroke-21 (Fig. 3), and fed with cuboids of same size. As an ablation study, the three-stream network is trained with and without attention mechanism on the third branch, using in both cases a momentum of , a weight decay of and a batch size of over epochs. The learning rate varies between and . The two tasks are considered: i) pure classification and ii) joint classification and segmentation. We also widen the field of application by considering the pose estimation for movement analysis.
3.1. TTStroke-21 Dataset
TTStroke-21, depicted in Figure 3, is composed of table tennis videos, recorded indoors at different frame rates. The players are filmed in game or training situations, performing in natural conditions without marker. The videos have been annotated by table tennis players and experts from the Faculty of Sports (STAPS) of the University of Bordeaux, France. The number of classes considered is :
8 services: Serve Forehand Backspin, Serve Forehand Loop, Serve Forehand Sidespin, Serve Forehand Topspin, Serve Backhand Backspin, Serve Backhand Loop, Serve Backhand Sidespin, Serve Backhand Topspin;
6 offensive strokes: Offensive Forehand Hit, Offensive Forehand Loop, Offensive Forehand Flip, Offensive Backhand Hit, Offensive Backhand Loop, Offensive Backhand Flip;
6 defensive strokes: Defensive Forehand Push, Defensive Forehand Block, Defensive Forehand Backspin, Defensive Backhand Push, Defensive Backhand Block, Defensive Backhand Backspin;
and an extra negative class.
In the following experiments, of videos recorded at fps are used. They represent a total of actions/strokes. From these temporally segmented table tennis strokes, negative additional samples are extracted from the rest of the videos. A larger number of negative samples could have been extracted, but this choice was made to have a lighter class imbalance, speed up the training process and be consistent with the previous experiments. The dataset is distributed in Train, Validation and Test sets with , and proportions as in (Martin et al., 2020). Extracted frames of size , are resized to before computing modalities.
Note that all computed human joints are not considered in pose data. Some of them might not be visible in the videos, e.g. knees and the ankles, which are often hidden by the table. We consider human joints: nose, eyes, ears, shoulders, elbows, wrists and hips. This leads to a descriptor of length . Furthermore, other players may appear in the scene, which leads to the detection of several poses in the same frame. In this case, only the closest pose from the previously computed ROI center is considered. If no pose is detected (which is the case for 25% of the frames), the descriptor vector is filled with ROI center coordinates and a score of . Miss-detection of the pose happens in situations when the player is out of the camera field of view. This happens when the ball leaves the table and the player has to catch it, or at the beginning and at the end of the game.
3.2. Pure Classification Task
Results are compared with different models that have been tested following an ablation method (Martin et al., 2021). As it can be observed from table 1, the I3D model performances are worse compared to the others. The network is too deep for such a limited real-life dataset and the challenging fine-grained task. Furthermore, the classification performances of the Twin model and the Three-Stream model are similar. However, room for improvement still remains on the Three-Stream Network using attention mechanism since the gap between validation and train accuracy is lower. Convergence is achieved at epoch for the latest model after only effective epochs (counting only epochs following and saving of the state). The other models reach convergence only after epoch .
Preliminary results also have shown that the Pose alone could not achieve convergence and was able to classify with of accuracy only on the test set for the pure classification task. The importance of the combination of different information sources is thus obvious.
|RGB - I3D (Carreira and Zisserman, 2017)|
|RGB-STCNN (Martin et al., 2019)|
|Flow - I3D (Carreira and Zisserman, 2017)|
|Flow-STCNN (Martin et al., 2019)|
|RGB + Flow - I3D (Carreira and Zisserman, 2017)|
using attention mechanism on all branches
* using attention mechanism only on the OF and RGB branches
3.3. Joint Classification and Segmentation Task
Similarly to (Martin et al., 2020), joint classification and segmentation of video segments is performed using an overlapping sliding window. Different decisions are investigated to flatten the obtained probabilities along the temporal dimension using a window size of for “Vote” and “Avg” rules, and size for “Gauss” rule. Once more, these window sizes were fixed after a preliminary grid search. The decision rules were respectively: i) majority vote, ii) average decision rule, and iii) weighted decision fusion using a Gaussian kernel. Performances are reported with all the labels, and also when the negative class is not considered. This second evaluation is motivated by the fact that most parts of a video are constituted of negative labels. Indeed, all portions between stokes are considered as negative: i.e. when the player is getting ready, when the match or training session end, and when the player is resting.
|without taking into account the negative labels|
using attention mechanism on all branches
* using attention mechanism only on the OF and RGB branches
Superiority of the Three-stream network is better observed for this joint classification and detection task, see table 2. On the first part of the table, the Three-Stream Net is able to reach of accuracy against for the Twin model. Frame wisely, this represents a precision of and a recall of
with regards to the negative class, leading to a F-score of. This means the model is still more likely to classify as a stroke some frames belonging to a the negative class. However this score is also biased by the frame wised approach of the evaluation. This is why the second part of the table is of better interest: the add of the third branch allows to boost the performance up to % compared to the model without pose information. The attention mechanism performs slightly lower, which might be overcome with a longer training phase as observed earlier. Overall, better performances are noticed for all models when not considering the negative samples, which can be more challenging to classify. This may be due to all different and nonconventional gestures when a player attempts to catch a lost ball, leading to features similar to a stroke and classified as such.
3.4. Perspectives for Movement Analysis
Pose information may also be very interesting to assess the player’s performance and the efficiency of his/her gesture. The organization of the joints skeleton, during a movement can be compared with a baseline to give an appreciation of the stroke performed (Morel et al., 2017). Richer feedback could also be given by adding depth information, computed from a single image, in order to create a 3D model of the stroke. Such a representation is drawn in Figure 4. TTStroke-21
does not offer for the moment any qualitative annotations for strokes. Such quality assessment needs to be built by experts in the field and may be one perspective of such dataset in order to widen its application.
We have proposed a three-stream network with different kinds of convolutions and input data: raw pixels values and optical flow undergo 3D (2D + time) convolution, while the pose-vectors are submitted to the branch with temporal convolution. Pose information, fused with RGB and optical flow branches, yields much better performances (up to 18%) in the joint classification and segmentation task. Further analysis may be conducted in this task by developing other evaluation methods not frame-wised.
Improvements can be achieved by developing a better pose estimator which can consider temporal information for avoiding misdetected poses/joints and obtain a better precision of the skeletal joints coordinates. Furthermore, pose information coupled with other technology may be one step forward to gesture analysis in order to assess athletes performance.
- Detection of tennis events from acoustic data. In Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports, MMSports@MM 2019, Nice, France, October 25, 2019, R. Lienhart, T. B. Moeslund, and H. Saito (Eds.), pp. 91–99. External Links: Cited by: §1.
- Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pp. 4724–4733. Cited by: §1, Table 1, §3.
- Activities of daily living monitoring via a wearable camera: toward real-world applications. IEEE Access 8, pp. 77344–77363. Cited by: §1.
- PoTion: pose motion representation for action recognition. In CVPR, pp. 7024–7033. Cited by: §1.
- Toyota smarthome: real-world activities of daily living. In ICCV, pp. 833–842. Cited by: §1.
- Tennis stroke classification: comparing wrist and racket as IMU sensor position. In MoMM, pp. 74–83. Cited by: §1.
- Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming. In WACV, pp. 446–455. Cited by: §1.
- Pose-projected action recognition hourglass network (PARHN) in soccer. In CRV, pp. 201–208. Cited by: §1.
- The ava-kinetics localized human actions video dataset. CoRR abs/2005.00214. Cited by: §1.
- RESOUND: towards action recognition without representation bias. In ECCV (6), Lecture Notes in Computer Science, Vol. 11210, pp. 520–535. Cited by: §1.
- Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §2.1.
- Global context-aware attention LSTM networks for 3d action recognition. In CVPR, pp. 3671–3680. Cited by: §1.
- Table tennis stroke recognition based on body sensor network. In IDCS, Lecture Notes in Computer Science, Vol. 11874, pp. 1–10. Cited by: §1.
- SGDR: stochastic gradient descent with warm restarts. In ICLR (Poster), Cited by: §2.7.
2D/3d pose estimation and action recognition using multitask deep learning. In CVPR, pp. 5137–5146. Cited by: §1.
- Optimal choice of motion estimation methods for fine-grained action classification with 3d convolutional networks. In ICIP, pp. 554–558. Cited by: §1, §2.1, Table 1.
- Fine grained sport action recognition with twin spatio-temporal convolutional neural networks. Multim. Tools Appl. 79 (27-28), pp. 20429–20447. Cited by: §1, §1, §3.1, §3.3.
- 3D attention mechanism for fine-grained classification of table tennis strokes using a twin spatio-temporal convolutional neural networks. In ICPR, Cited by: §1, §2.5, §3.2, §3.
- Graphical models for social behavior modeling in face-to face interaction. Pattern Recognit. Lett. 74, pp. 82–89. Cited by: §1.
- Automatic evaluation of sports motion: A generic computation of spatial and temporal errors. Image Vis. Comput. 64, pp. 67–78. Cited by: §1, §3.4.
- PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV (14), Lecture Notes in Computer Science, Vol. 11218, pp. 282–299. Cited by: §2.3, Figure 4.
- SharpNet: fast and accurate recovery of occluding contours in monocular depth estimation. In ICCV Workshops, pp. 2109–2118. Cited by: Figure 4.
- LCR-net: localization-classification-regression for human pose. In CVPR, pp. 1216–1224. Cited by: §1.
- LCR-net++: multi-person 2d and 3d pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42 (5), pp. 1146–1161. Cited by: §1.
- FineGym: A hierarchical video dataset for fine-grained action understanding. In CVPR, pp. 2613–2622. Cited by: §1.
- Prediction of future shot direction using pose and position of tennis player. In Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports, MMSports@MM 2019, Nice, France, October 25, 2019, R. Lienhart, T. B. Moeslund, and H. Saito (Eds.), pp. 59–66. External Links: Cited by: §1.
A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, pp. 1961–1970. Cited by: §1.
- A short note on the kinetics-700-2020 human action dataset. CoRR abs/2010.10864. Cited by: §1.
- Online localization and prediction of actions and interactions. IEEE Trans. Pattern Anal. Mach. Intell. 41 (2), pp. 459–472. Cited by: §1.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. Cited by: §1.
- TTNet: real-time temporal and spatial video analysis of table tennis. pp. 3866–3874. Cited by: §1.
- Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38 (8), pp. 1583–1597. Cited by: §1.
- FuturePong: real-time table tennis trajectory forecasting using pose prediction network. In CHI Extended Abstracts, pp. 1–8. Cited by: §1.
- Racquet sports recognition using a hybrid clustering model learned from integrated wearable sensor. Sensors 20 (6), pp. 1638. Cited by: §1.
- PA3D: pose-action 3d machine for video recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 7922–7931. External Links: Cited by: §1.
- Wide residual networks. In BMVC, Cited by: §2.7.
- Towards understanding the adversarial vulnerability of skeleton-based action recognition. CoRR abs/2005.07151. Cited by: §1.
- Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett. 27 (7), pp. 773–780. Cited by: §2.1.