Group-aware Contrastive Regression for Action Quality Assessment

by   Xumin Yu, et al.
Tsinghua University

Assessing action quality is challenging due to the subtle differences between videos and large variations in scores. Most existing approaches tackle this problem by regressing a quality score from a single video, suffering a lot from the large inter-video score variations. In this paper, we show that the relations among videos can provide important clues for more accurate action quality assessment during both training and inference. Specifically, we reformulate the problem of action quality assessment as regressing the relative scores with reference to another video that has shared attributes (e.g., category and difficulty), instead of learning unreferenced scores. Following this formulation, we propose a new Contrastive Regression (CoRe) framework to learn the relative scores by pair-wise comparison, which highlights the differences between videos and guides the models to learn the key hints for assessment. In order to further exploit the relative information between two videos, we devise a group-aware regression tree to convert the conventional score regression into two easier sub-problems: coarse-to-fine classification and regression in small intervals. To demonstrate the effectiveness of CoRe, we conduct extensive experiments on three mainstream AQA datasets including AQA-7, MTL-AQA and JIGSAWS. Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.


page 1

page 8

page 10


Uncertainty-aware Score Distribution Learning for Action Quality Assessment

Assessing action quality from videos has attracted growing attention in ...

Action Quality Assessment using Siamese Network-Based Deep Metric Learning

Automated vision-based score estimation models can be used as an alterna...

FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment

Most existing action quality assessment methods rely on the deep feature...

Auto-Encoding Score Distribution Regression for Action Quality Assessment

Action quality assessment (AQA) from videos is a challenging vision task...

Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos

The objective of action quality assessment is to score sports videos. Ho...

Uncertainty-Driven Action Quality Assessment

Automatic action quality assessment (AQA) has attracted more interests d...

Action Quality Assessment using Transformers

Action quality assessment (AQA) is an active research problem in video-b...

Code Repositories


[ICCV 2021] Group-aware Contrastive Regression for Action Quality Assessment

view repo

1 Introduction

Action quality assessment (AQA), which aims to evaluate how well a specific action is performed, has attracted growing attention in recent years since it plays a crucial role in many real world applications including sports [firstaqa, DBLP:conf/wacv/ParmarM19, DBLP:conf/icvs/JugPDK03, asdasfaf, DBLP:conf/eccv/PirsiavashVT14, DBLP:journals/corr/abs-1904-04346, DBLP:conf/bmvc/VenkataramanVT15, DBLP:conf/cvpr/ParmarM17], healthcare [DBLP:conf/ipcai/MalpaniVCH14, DBLP:journals/pami/ZhangL15, Sharma2014VideoBA, 10.1145/2072545.2072550, DBLP:conf/miccai/ZiaSBSCE15, DBLP:journals/cars/ZiaSBSE18] and others [DBLP:journals/corr/DoughtyDM17, DBLP:conf/cvpr/DoughtyMD19]. Unlike conventional action recognition tasks that focus on action classification [DBLP:conf/icml/JiXYY10, DBLP:conf/eccv/WangXW0LTG16, DBLP:conf/cvpr/Wang0T15, DBLP:conf/nips/SimonyanZ14, DBLP:conf/mmm/LiCHYCX19, DBLP:conf/iccv/Feichtenhofer0M19, DBLP:conf/cvpr/0004GGH18] and detection  [DBLP:conf/iccv/ZhaoXWWTL17, DBLP:journals/corr/abs-1907-09702, DBLP:conf/cvpr/TangDRZZZL019, DBLP:conf/cvpr/YeungRMF16, DBLP:journals/corr/MontesSN16], AQA is more challenging as it requires the model to predict fine-grained scores from videos that describe the same action. Considering the differences between videos and large variations in scores, we argue that a key to addressing this problem is to discover the differences among the videos and predict scores based on the differences.

Figure 1: Our Contrastive Regression (CoRe) framework for action quality assessment. Inspired by contrastive learning that learns representation by encouraging the distances of samples (, and ) to reflect their semantic relationship, we learn an AQA model to regress the relative scores (, and ) to reflect the differences of action quality among videos. By comparing two videos with different scores, CoRe encourage the model to learn from differences between videos for assessment.

Many efforts have been made to tackle this problem over the past few years [DBLP:conf/iccv/JiaHuiaction, DBLP:journals/corr/abs-1904-04346, DBLP:conf/cvpr/DoughtyMD19, score_figure_skating, DBLP:conf/wacv/ParmarM19]. Most of them formulate the AQA as a regression problem, where the scores are directly predicted from a single video. While some promising results have been achieved, AQA still faces three challenges. First, since the score labels are usually annotated by human judges (e.g., the score of the diving game is calculated by summarizing scores from different judges, then multiplied by the degree of difficulty), the subjective appraisal of judges makes accurate score prediction quite difficult. Second, the difference between videos for AQA is very subtle, since actors are usually performing the same action in a similar environment. Last, most current models are evaluated based on Spearman’s Rank, which may not faithfully reflect the prediction performance (see our discussions in Section 4.1).

Towards a better AQA framework that can utilize the differences among the videos to predict the final rating, we borrow the merits from the concept of contrastive learning [he2020momentum, chen2020simple]. Contrastive learning (Figure 1, top-left) aims to learn a better representation space where the distance between two similar samples is enforced to be small while the distance between the dissimilar ones is encouraged to be large. Therefore, the distance in the representation space can already reflect the semantic relationship between two samples (, if they are from the same category). Analogically, in the context of AQA, we aim to learn a model that can map the input video into the score space where the differences between the action qualities can be measured by the relative scores (). Motivated by this, we propose a Contrastive Regression (CoRe) framework for the AQA task. Unlike previous works which aim to predict the scores directly, we propose to regress the relative scores between an input video and several exemplar videos as references.

Moreover, as a step towards more accurate score prediction, we devise a group-aware regression tree (GART) to convert the relative score regression into two easier sub-problems: (1) coarse-to-fine classification. We first divide the range of the relative score into several non-overlapping intervals (, groups) and then use a binary tree to allocate the relative score to a certain group by performing classification progressively; (2) regression in a small interval. We perform regression inside the group where the relative score lies and predict the final score. As another contribution, we design a new metric, called relative L2-distance (R-

) to more precisely measure the performance of action quality assessment by considering the intra-class variance.

To verify the effectiveness of our method, we conduct extensive experiments on three mainstream AQA datasets containing both Olympic and surgical actions, namely AQA-7 [DBLP:conf/wacv/ParmarM19], MTL-AQA [DBLP:journals/corr/abs-1904-04346] and JIGSAWS [gao2014jhu]. Experiments results demonstrate our method largely outperforms the state-of-the-art on the three benchmarks under the Spearman’s Rank Correlation (81.0% to 84.0% on AQA-7, 92.7% to 95.1% on MTL-AQA and 70% to 85% on JIGSAWS) and new proposed R- metric, which clearly shows the advantages of our proposed contrastive regression framework.

2 Related Work

Figure 2: The pipeline of our proposed group-aware contrastive regression method. We first sample an exemplar video for each input video according to the category and degree of difficulty of the action. We then feed the video pair into a shared I3D backbone to extract spatio-temporal features and combine these two features with the reference score of the exemplar video. Finally, we pass the combined feature to the group-aware regression tree and obtain the score difference between the two videos. During inference, the final score can be computed by averaging the results from multiple different exemplars.

The past few years have witnessed the rapid development of AQA. The mainstreams of AQA methods formulate AQA as a regression task based on reliable scores labels given by expert judges. For example, Gordan  [firstaqa] propose to use the trajectory of the skeleton to solve the problem of gymnastic vault action quality assessment in their pioneer work. Pirsiavash et al. [DBLP:conf/eccv/PirsiavashVT14]

use DCT to encode body pose as input features. SVR 


is also used to build the mapping from the features to the final score. Thanks to the great success of deep learning in action recognition tasks, Parmar

et al. [DBLP:conf/cvpr/ParmarM17] show that the spatio-temporal features from C3D  [DBLP:conf/iccv/TranBFTP15] can better encode the video data and significantly improve the performance. They also propose a large-scale AQA dataset and explore all-action models to further enhance the scoring performance. Following [DBLP:conf/cvpr/ParmarM17], Xu  [score_figure_skating] propose a model containing two LSTM to learn the multi-scale features of videos. Pan  [DBLP:conf/iccv/JiaHuiaction] propose to use spatial and temporal relation graphs to model the interaction among the joints. In addition, they also propose to use I3D [DBLP:conf/cvpr/CarreiraZ17] as a stronger backbone network to extract spatio-temporal features. Parmar  [DBLP:journals/corr/abs-1904-04346] propose a larger AQA dataset with more annotations for various tasks. The idea of multi-task learning is also introduced to improve the model capacity for AQA. Recently, Tang  [musdl] propose a new uncertainty-aware score distribution learning (USDL) to reduce the underlying ambiguity of the action score labels from human judges. Different from this line of works, several methods [DBLP:journals/pami/ZhangL15, DBLP:journals/corr/DoughtyDM17, DBLP:conf/cvpr/DoughtyMD19, DBLP:conf/iccv/BertasiusPYS17a] formulate AQA as a pair-wise ranking task. However, they mainly focus on longer and more ambiguous tasks and only predict an overall ranking, which might limit the application of AQA where some quantitative comparisons are required. In this work, we present a new contrastive regression framework to simultaneously rank videos and predict accurate scores, which makes our method distinguished from previous works.

3 Approach

The overall framework of our method is illustrated in Figure 2. We will describe our method in detail as follows.

3.1 Contrastive Regression

Problem Formulation.

Most existing works [DBLP:conf/iccv/JiaHuiaction, DBLP:journals/corr/abs-1904-04346, DBLP:conf/cvpr/DoughtyMD19, score_figure_skating, DBLP:conf/wacv/ParmarM19, musdl] formulate AQA as a regression task, where the input is a video containing the target action and the output is the predicted quality score of the action. Note that in some AQA tasks (, diving), each video is associated with a degree of difficulty for each video (which is a known constant). The final score is the multiplication of the action quality score (, raw score) and the degree of difficulty. Since the degree of difficulty is already known, we only need to predict the action quality score following [musdl]. Formally, given the input video with action quality label , the regression problem is to predict the action quality based on the input video:


where and are the regressor model and the feature extractor parameterized by and , respectively. The regression problem is usually solved by minimize the mean-square error between the predicted score and the ground-truth score:


where and are the parameters of regression model and feature extractor.

However, since the action videos are usually captured in similar environments (, diving competitions often take place in aquatics centers), it is difficult for the model to learn the diverse scores based on videos with subtle differences. To this end, we propose to reformulate the problem as regressing relative score between the input and an exemplar. Let denotes the input video, and denotes the exemplar video with score label . The regression problem can be re-written as:


This formulation can be also viewed as a form of residual learning [he2016deep], where we aim to regress the difference of the scores between the input video and a reference video.

Exemplar-Based Score Regression.

We now describe how to implement the CoRe framework for the AQA problem. Since we aim to regress the relative score between the input and the exemplar, how to select the exemplar becomes critical. To make the input and the exemplar comparable, we tend to select the video that shares some certain attributes (, category and degree of difficulty) with the input video as the exemplar. Formally, given an input video and the corresponding exemplar , we first use an I3D [DBLP:conf/cvpr/CarreiraZ17] to extract the features following [musdl, DBLP:journals/corr/abs-1904-04346], and then aggregate them with the score of the exemplar :


where is a normalizing constant to make sure . We then predict the score difference of the pair through a regressor as .

3.2 Group-Aware Regression Tree

Figure 3: The architecture of the proposed group-aware regression tree. Given the video features and the reference score, the regression tree determines the score difference in a coarse-to-fine manner, where a sequence of binary classification tasks is performed at first (purple nodes) and the regression modules in the leaf layer then give the final prediction (white nodes).

Although the contrastive regression framework can predict the relative score , usually takes values from a wide range (, for diving, ). Therefore, predicting directly is still of great difficulty. To this end, we devise a group-aware regression tree (GART) to solve the problem in a divide-and-conquer manner. Specifically, we first divide the range of into non-overlapping intervals (namely “groups”). We then construct a binary regression tree with layers, of which the leaves represent the groups, as is illustrated in Figure 3

. The decision process of group-aware regression tree follows a coarse-to-fine manner: in the first layer, we determine whether the input video is better or worse than the exemplar video; in the following layers, we gradually make a more accurate prediction about how much the input video is better/worse than the exemplar. Once we have reached the leaf nodes, we can know which group the input video should be classified and we can then perform regression in the corresponding small interval.

Tree Architecture. We adopt the binary tree architecture to perform the regression task. To begin with, we perform an MLP to

and use the output as an initialization of the root node feature. We then perform the regression in a top-down manner. Each node takes the output feature from its parent node as input and produces the binary probability together with the updated feature. The probability of each leaf node can be computed by multiplying all the probabilities along the path to the root. We use the Sigmoid to map the output of each leaf node to

, which is the predicted score difference w.r.t. the corresponding group.

We then describe our partition strategy to define the boundary of each group. First, we collect the list of score differences of all possible training video pairs . Then, we sort the list in an ascending order to obtain . Given the group number , the partitioning algorithm gives the bounds of each interval as:


where we use to represent the th element of . It is worth noting that the partition strategy is non-trivial. If we simply uniformly divide the whole range into multiple groups, the pairs of videos in the training set of which the differences of scores lie in some certain group may be unbalanced (see Figure 4 for details).

Figure 4: The distribution of the differences of scores in the training set under different partition strategy. (a) Uniform partition. We can observe a large variation of frequency among different groups. (b) The proposed grouping strategy in Equation (5). The training pairs belonging to each group are balanced.

Optimization. We train the regression tree by imposing a classification task on the leaf probabilities and a regression task on the ground-truth interval. Specifically, when the ground-truth score difference of the input pairs is in -th group, , the one-hot label classification is defined by assigning 1 to the -th node and the regression label is set as .

For each video pair in the training data with classification label and regression label , the objective function for the classification task and regression task can be written as:

where and are the predicted leaf probabilities and regression results. The final objective function for the video-pair is:


Inference. The overall regression process of the proposed group-aware regression tree can be written as:


where is the group with the highest probability. In our implementation, we also adopt a multi-exemplar voting strategy. Given an input video , we select exemplars from training data to construct pairs using these different exemplars whose scores are . The process of multi-exemplar voting can be summarized as:


4 Experiments

Sp. Corr Diving Gym Vault BigSki. BigSnow. Sync. 3m Sync. 10m Avg. Corr. Year
Pose+DCT [DBLP:conf/eccv/PirsiavashVT14] 0.5300 0.1000 2014
ST-GCN [DBLP:conf/aaai/YanXL18] 0.3286 0.5770 0.1681 0.1234 0.6600 0.6483 0.4433 2018
C3D-LSTM [DBLP:conf/cvpr/ParmarM17] 0.6047 0.5636 0.4593 0.5029 0.7912 0.6927 0.6165 2017
C3D-SVR [DBLP:conf/cvpr/ParmarM17] 0.7902 0.6824 0.5209 0.4006 0.5937 0.9120 0.6937 2017
JRG [DBLP:conf/iccv/JiaHuiaction] 0.7630 0.7358 0.6006 0.5405 0.9013 0.9254 0.7849 2019
I3D+MLP [musdl] 0.7438 0.7342 0.5190 0.5103 0.8915 0.8703 0.7472 2020
USDL [musdl] 0.8099 0.7570 0.6538 0.7109 0.9166 0.8878 0.8102 2020
I3D + MLP 0.8685 0.6939 0.5391 0.5180 0.8782 0.8486 0.7601
CoRe + GART 0.8824 0.7746 0.7115 0.6624 0.9442 0.9078 0.8401
R-(100) Diving Gym Vault BigSki. BigSnow. Sync. 3m Sync. 10m Avg. R- Year
C3D-SVR [DBLP:conf/cvpr/ParmarM17] 1.53 3.12 6.79 7.03 17.84 4.83 6.86 2017
USDL [musdl] 0.79 2.09 4.82 4.94 0.65 2.14 2.57 2020
I3D + MLP 0.81 2.54 6.06 5.31 1.41 3.08 3.20
CoRe + GART 0.64 1.78 3.67 3.87 0.41 2.35 2.12
Table 1: Comparisons of Spearman’s Correlation and R- Distance on the AQA-7 dataset. indicts our implementation.

4.1 Datasets and Experiment Settings

Datasets. We perform experiments on three widely used AQA benchmarks including AQA-7 [DBLP:conf/wacv/ParmarM19], MTL-AQA [DBLP:journals/corr/abs-1904-04346] and JIGSAWS [gao2014jhu]. For more details about the datasets, please refer to the Supplementary.

Evaluation Protocols. In order to compare with the previous work [DBLP:conf/iccv/JiaHuiaction, DBLP:conf/wacv/ParmarM19, DBLP:journals/corr/abs-1904-04346, musdl]

in AQA, we adopt Spearman’s rank correlation as an evaluation metric. Spearman’s correlation is defined as:


were and represent the ranking for each sample of two series respectively. We also follow the previous work to use Fisher’s z-value [DBLP:conf/wacv/ParmarM19] when measure the average performance across actions.

We also propose a stricter metric to measure the performance of AQA models more precisely, called relative L2-distance (R-). Given the highest and lowest scores for an action and , R- is defined as:


where and represent ground-truth score and prediction for -th sample, respectively. We use R- instead of traditional L2-distance because different actions have different scoring intervals. Comparing and averaging distance among different classes of actions is meaningless and confusing. Our proposed R- is different from Spearman’s correlation: Spearman’s correlation focuses more on the ranks of the predict scores while our R- focuses on the numerical values.

Implementation Details: We adopt the I3D model pre-trained on Kinetics [DBLP:conf/cvpr/CarreiraZ17] dataset as the feature extractor . For all the experiments, we set the depth of GART to and the node feature dimension as 256. The initial learning rate is 1e-3 for the regression tree and 1e-4 for the I3D backbone. We use Adam [Kingma2014Adam] optimizer, and weight decay is set to zero. we select 10 exemplars for an input test video during inference and vote for the final score using the multi-exemplar voting strategy. In experiments on AQA-7 and MTL-AQA, we follow [musdl, DBLP:conf/iccv/JiaHuiaction, DBLP:conf/wacv/ParmarM19, DBLP:journals/corr/abs-1904-04346] to extract 103 frames for each video clip, and segment them into 10 overlapping snippets, each containing 16 continuous frames. In JIGSAWS, we follow [musdl] to evenly sampled out 160 frames to form 10 non-overlapping 16-frame snippets. In AQA-7 and JIGSAWS, we select the exemplar video only according to the coarse category of the video. For example, if the input video is from single diving-10m platform in AQA-7, we randomly select an exemplar video from the training set of single diving-10m platform in AQA-7. In MTL-AQA dataset, since there are annotations about the degree of difficulty (DD) for diving sports, we select the exemplar based on both the category and the degree of difficulty. Note that this implementation is consistent with the real-world scenario since DD is known to all judges before the action is completed.

We report the performance of the following methods in experiments including the baseline method and different versions of our methods111We use to indicate that we did not use DD in both training and test.:

  • I3D + MLP and I3D + MLP(Baseline) : Most existing works adopt this strategy. We use I3D [DBLP:conf/cvpr/CarreiraZ17] to encode a single input video, and predict the score based on the feature with a 3-layer MLP. MSE loss between the prediction and the ground-truth is used to optimize the model.

  • CoRe + MLP and CoRe + MLP: We reformulate the regression problem as mentioned in Sec. 3.1. We choose exemplar videos from the training set to construct the video pairs and also use MSE loss for optimization.

  • I3D + GART and I3D + GART: We replace the regression sub-network (MLP) with our group-aware regression tree in the baseline method. We use the loss defined in Equation (6)

  • CoRe + GART and CoRe + GART: The proposed method in Section. 3.

Note that we did not evaluate some of them on the AQA-7 and JIGSAWS datasets due to the absence of degree of difficulty annotations.

4.2 Results on AQA-7 dataset

Figure 5: Effects of the depth of regression tree (a) and the number of exemplars for voting (b).

The experiment results of our method and other AQA approaches on AQA-7 are shown in Table 1. The state-of-the-art method USDL [musdl]

uses Gaussian distribution to create a soft distribution label for each video, which can reduce the subjective factor from human judges on original labels. We achieve the same goal with contrastive regression. We also provide the results of the baseline I3D + MLP

on this dataset, which clearly show the performance improvement obtained by our method. We reach the best results on almost all classes in AQA-7 under both Spearman’s correlation and R-. Our method achieves 8.95%, 2.32%, 8.83%, -6.82%, 3.01% and 2.25% performance improvement for each sports class compared with USDL under Spearman’s rank. Meanwhile, we achieve 0.15, 0.31, 1.15, 1.07, 0.24, -0.21 performance improvement under R-. For the average correlation and average R- performance, we have nearly 3.7% and 0.45 improvements compared to USDL model, clearly showing the effectiveness of our model.

We also conduct several analysis experiments to study the effects of the depth of the regression tree and the vote number M in the multi-exemplar voting on Diving class of AQA-7 dataset.

Effects of the depth of regression tree. In the regression tree module, the depth of the tree is a significant hyper-parameter determining the architecture of the regression tree. We conduct several experiments on Diving class of AQA-7 dataset with different values of depth, ranging from 2 to 7, and set the M to 10. As shown in Figure  5, our model performs better when depth is 5 and 6, where the total number of groups is 32 and 64. However, there is a little drop in performance when depth is smaller than 4 or bigger than 7. In general, our model is robust to different depths.

Effects of the number of exemplars for voting. The number of exemplars used in the inference phase is another important hyper-parameter. A larger number for M means the model can refer to more exemplars while leading to larger computational cost. We conduct experiments on Diving class to study the impact of M. Figure 5 shows the result when the depth of regression tree is set to 5. We observe that with increasing, the performance becomes better and the variance is lower. The improvement on Sp. Corr. becomes less significant when exceeds 10. We can also find the same trend for R-.

4.3 Results on MTL-AQA dataset

Method (w/o DD) Sp. Corr. R-(100) Year
Pose+DCT [DBLP:conf/eccv/PirsiavashVT14] 0.2682 2014
C3D-SVR [DBLP:conf/cvpr/ParmarM17] 0.7716 2017
C3D-LSTM [DBLP:conf/cvpr/ParmarM17] 0.8489 2017
MSCADC-STL [DBLP:journals/corr/abs-1904-04346] 0.8472 2019
C3D-AVG-STL [DBLP:journals/corr/abs-1904-04346] 0.8960 2019
MSCADC-MTL [DBLP:journals/corr/abs-1904-04346] 0.8612 2019
C3D-AVG-MTL [DBLP:journals/corr/abs-1904-04346] 0.9044 2019
I3D + MLP [musdl] 0.8921 0.707 2020
USDL [musdl] 0.9066 0.654 2020
MUSDL [musdl] 0.9158 0.609 2020
I3D + MLP 0.9196 0.465
CoRe + GART 0.9341 0.365
Method (w/ DD) Sp. Corr. R-(100) Year
USDL [musdl] 0.9231 0.468 2020
MUSDL [musdl] 0.9273 0.451 2020
I3D + MLP 0.9381 0.394
CoRe + GART 0.9512 0.260
Table 2: Comparisons of performance with existing methods on the MTL-AQA dataset. indicts our implementation.
Method Ablation Sp. Corr. R-(100)
I3D + MLP Baseline 0.9381 0.394
I3D + GART + GART 0.9403 0.366
CoRe + GART + CoRe 0.9512 0.260
Table 3: Ablation study on MTL-AQA dataset
Figure 6: A comparison of different methods in scatter plot. Each point in the figure represents a video in the test set. The red line indicates the prefect predictions.
Figure 7: Cumulative score curve on MTL-AQA dataset. The larger the area under the curve indicates the better performance.
Figure 8: Case study. The videos marked with and in the upper left corner are the exemplar and the input video, respectively. Each pair of exemplar and input videos have the same degree of difficulty (DD). We show the probability output for each layer of the regression tree and the regression value for each leaf on the right. We take the regression value of the leaf node with the highest probability as the final regression result. The very small errors between our prediction results with ground-truths demonstrate the effectiveness of our method.

Table 2 shows the performance of existing methods and our method on MTL-AQA dataset. Since the degree of difficulty (DD) annotations are available for diving actions in MTL-AQA, we also verify the effects of DD on this dataset. We divide all methods into two types: some use the DD labels in the training phase (bottom part of the table) and the others (upper part of the table) do not. We see CoRe + GART achieves respectively 2.0% and 0.244 improvement compared to MUSDL [musdl] under Spearman’s rank and R- metric without DD labels. By training with the degree of difficulty, our method becomes even better, achieving 2.6% and 0.191 improvements compared to MUSDL under the two metrics. We conjecture that there are two reasons: one is that we can select more suitable exemplars, the other reason is that our method can exploit more information about the action from the degree of difficulty. To have an intuitive understanding of the differences between our method and baseline methods, we visualize the prediction results in form of a scatter plot in Figure 6. We see our method is much more accurate compared to the baseline. By using the degree of difficulty information, the performance of our method can be further improved, where almost all the points are near the red line in the middle of the picture. In Figure 7, we show the cumulative score curves of our methods and SOTA method MUSDL [musdl]. Given the error threshold , the samples whose absolute differences between their prediction and ground-truth are less than will be regarded as positive samples. It can be observed that under any error threshold, CoRe + GART (red line) shows a stronger ability to predict accurate scores.

Ablation Study. We further conduct an ablation study for our method. The results are shown in Table 3. Comparing I3D + MLP and I3D + GART, we see when replacing MLP with our group-aware regression tree, the performance is improved by 0.0022 and 0.028 under Spearman’s rank metric and R- metric, which demonstrates the effectiveness of the designs of GART. The performance is further improved when replace the I3D baseline with our proposed CoRe framework. The above results demonstrate the effectiveness of the two components of our method.

Sp. Corr. S NP KT Avg. Corr.
ST-GCN [DBLP:conf/aaai/YanXL18] 0.31 0.39 0.58 0.43
TSN [DBLP:conf/cvpr/ParmarM17] 0.34 0.23 0.72 0.46
JRG [DBLP:conf/iccv/JiaHuiaction] 0.36 0.54 0.75 0.57
USDL [musdl] 0.64 0.63 0.61 0.63
MUSDL [musdl] 0.71 0.69 0.71 0.70
I3D + MLP 0.61 0.68 0.66 0.65
CoRe + GART 0.84 0.86 0.86 0.85
R-(100) S NP KT Avg.
I3D + MLP 4.795 11.225 6.120 7.373
CoRe + GART 5.055 5.688 2.927 4.556
Table 4: Comparisons of performance with existing methods on the JIGSAWS dataset.
Figure 9: Visualization. We show the visualization result on MTL-AQA using Grad-CAM [selvaraju2017grad]. Our method can focus on the regions that are critical to assess the action quality.

Case Study. In order to have a deeper understanding of the behavior of our model, we present a case study in Figure 11. Based on the comparison between input and exemplar, the regression tree determines the relative score from coarse to fine. The first layer of the regression tree tries to determine which video is better, and the following layers try to make the prediction more accurate. The first case in the figure shows the behavior when the difference between input and exemplar is large, and the second case shows the behavior when the difference is small. In both situations, our model can give satisfactory predictions.

4.4 Results on JIGSAWS

We also conduct experiments on this surgical action dataset JIGSAWS. Four-fold cross-validation is used following previous works [musdl, DBLP:conf/iccv/JiaHuiaction]. Table 4 shows the experiment results. CoRe + GART largely improves the previous state-of-the-arts. Our method also obtains a more balanced performance in different action classes.

4.5 Visualization

To further prove the effectiveness of our method, we visualize the baseline model (I3D + MLP) and our best model (CoRe + GART) using Grad-CAM [selvaraju2017grad] on MTL-AQA, as is shown in Figure 9. We observe that our method can focus on certain regions (hands, body, ), which indicates our contrastive regression framework can alleviate the influence caused by the background and pay more attention to the discriminative parts.

5 Conclusions

In this paper, we have proposed the CoRe framework for action quality assessment, which learns the relative scores based on the exemplars. We have also devised a group-aware regression tree to convert the conventional score regression into a coarse-to-fine classification task and a regression task in small intervals. The experiments on three AQA datasets have demonstrated the effectiveness of our approach. We expect the introduction of CoRe provides a new and generic solution for various AQA tasks.


This work was supported in part by the National Natural Science Foundation of China under Grant U1813218, Grant U1713214, and Grant 61822603, in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI), and in part by a grant from the Institute for Guo Qiang, Tsinghua University.

Appendix A Datasets

We now describe the datasets we used in our experiments in detail.

AQA-7 [DBLP:conf/wacv/ParmarM19]: AQA-7 contains 1,189 samples from seven different actions collected from winter and summer Olympic Games. It contains two dataset released before: UNLV-Dive [DBLP:conf/cvpr/ParmarM17] is named single diving-10m platform in AQA-7, contains 370 samples. UNLV-Vault [DBLP:conf/cvpr/ParmarM17] is named gymnastic vault in AQA-7, contains 176 samples; The other action classes are newly collected in this dataset: synchronous diving-3m springboard contains 88 samples and synchronous diving-10m platform contains 91 samples. big air skiing cantains 175 samples and big air snowboarding contains 206 samples.

MTL-AQA [DBLP:journals/corr/abs-1904-04346]: The MTL-AQA dataset contains all kinds of diving actions, which is the largest AQA dataset up to date. There are 1,412 samples collected from 16 difference world events. The annotations in this dataset are various, including the degree of difficulty (DD), scores from each judge (totally 7 judges), type of diver’s action, and the final score. We adopt the evaluation protocol suggested in [DBLP:journals/corr/abs-1904-04346] in our experiments.

JIGSAWS [gao2014jhu]: JIGSAWS is a surgical actions dataset containing 3 type of surgical task: ”Suture(S)”, ”NeedlePassing(NP)” and ”Knotted(KT)”. For each task, each video sample is annotated with multiple annotation scores assessing different aspects of surgical actions, and the final score is the sum of those sub-scores. We adopt a similar four-fold cross validation strategy as [DBLP:conf/iccv/JiaHuiaction, musdl].

Appendix B More Discussions

More analysis on the regression tree.

To better understand the prediction process of the regression tree, we also investigate the prediction accuracy of each layer in the regression tree on the MTL-AQA dataset, as shown in Figure 10. We also compare the results with two baseline methods. Comparing CoRe + GART and GART, we can see CoRe + GART performs better in each layer under all values of , which indicates measuring relative score between input and exemplar is more effective than predicting the final score directly. Comparing two CoRe-based methods, we see the group-aware regression tree measures relative score more accurately.

Figure 10: Classification accuracy for each layer of the group-aware regression tree. CoRe + GART is our final method, combining contrastive regression and group-aware regression tree together. CoRe + MLP uses an MLP to replace the regression tree and the GART method only keeps the regression tree without using the contrastive regression framework. is a tolerance threshold, which indicates classifying a pair into the nearest-K groups is still regarded as a correct classification.
Figure 11: Case study. The videos marked with and in the upper left corner are the exemplar and the input video, respectively. Each pair of exemplar and input videos have the same degree of difficulty (DD). We show the probability output for each layer of the regression tree and the regression value for each leaf on the right. We take the regression value of the leaf node with the highest probability as the final regression result.

More analysis on CoRe.

Another advantage of our proposed CoRe is that CoRe could alleviate the subjectiveness from human judges by predicting the difference, despite the fact that the exemplar video is also annotated by human judges. Formally, we can assume a score can be decompose as the actual value and a subjectiveness term

that subjects to normal distribution

. If we directly predict , the variance of subjectiveness term is . By introducing exemplar videos with scores , our goal is to predict the difference


which also subjects to a normal distribution:


We see the prediction becomes closer to the actual value when . The empirical results in Figure 5(b) in the original paper also support our assumption.

More analysis on R-.

To more precisely measure the AQA performance, we propose a stricter metric, called relative L2-distance (R-), to measure the performance of the score prediction model. We use R- instead of traditional L2-distance because different actions may have different scoring intervals. Comparing and averaging distance among different classes of actions is may be confusing in some cases. Given the highest and lowest scores for an action and , R- is defined as:


and represent ground-truth score and prediction for sample. is a tolerance threshold. If error between prediction and ground-truth is less than the threshold, the error will be ignored. is the size of dataset.

Compared to previous metrics like Spearman’s correlation, the proposed R- metric has two key advantages: 1) our metric can judge a single prediction while Spearman’s correlation requires the whole test set, which makes our metric more flexible; 2) our metric is stricter and more reasonable especially when the test set is relatively small. For example, diver A and diver B get score of 95 and 65 respectively by human professional judges. If the predictions of these two actions are 80 and 30, it is a prefect prediction under the Spearman’s correlation metric, while our metric can clearly reflect the prediction performance.

Appendix C Case study

We conduct two more case studies here, as shown in Figure 11. Based on the comparison between the input and the exemplar, the regression tree determines the relative score from coarse to fine. The first layer of the regression tree tries to determine which video is better, and the following layers try to make this determination more accurate. The first case in the figure shows the behavior when the difference between the pair is small, while the second case shows the behavior when this difference is large. When the difference between the two videos is large, it is easy to make the prediction. While the difference is small, the classification task is more difficult, but our method can still give a relatively accurate judgment. We see the proposed contrastive regression framework and the regression tree are two key techniques to achieve accurate score prediction.