Analyzing sports video requires robust algorithms to automate fine-grained action recognition, retrieval, and detection in large-scale video collections. Human pose is a useful feature when sports are centered around people.
State-of-the-art skeleton-based deep learning techniques for action recognition[36, 68] rely on accurate 2D pose detection to extract the athletes’ motion, but the best pose detectors [54, 65] routinely fail on fast-paced sports video with complex blur and occlusions, often in frames crucial to the action (Figure 1). To circumvent these issues, end-to-end learned models operate directly on the video stream [8, 14, 32, 52, 62, 73]. However, because they consume pixel instead of pose inputs, when trained with few labels, they tend to latch onto specific visual patterns [10, 63] rather than the fine-grained motion (e.g., an athlete’s clothes or the presence of a ball). As a result, prior pose and end-to-end methods often generalize poorly on fine-grained tasks in challenging sports video, when labels are scarce. While collecting large datasets with fine action and pose annotations is possible, doing so for each new sport does not scale.
We propose Video Pose Distillation (VPD), a weakly-supervised technique in which a student network learns to extract robust pose features from RGB video frames in a new video domain (a single sport). VPD is designed such that, whenever pose is reliable, the features match the output of a pretrained teacher pose detector. Our strategy retains the best of both pose and end-to-end worlds. First, like directly supervised end-to-end methods, our student can exploit the rich visual patterns present in the raw frames, including but not limited to the athlete’s pose, and continue to operate when pose estimation is unsuccessful. Second, by constraining our descriptors to agree with the pose estimator whenever high-confidence pose is available, we avoid the pitfall of overfitting to visual patterns unrelated to the athlete’s action. And third, weak pose supervision allows us to enforce an additional constraint: we require that the student predicts not only instantaneous pose but also its temporal derivative. This encourages our features to pick up on visual similarities over time (e.g., an athlete progressing from pose to pose). When we train the student with weak-supervision over a corpus of unlabeled sports video, the student learns to ‘fill-in the gaps’ left by the noisy pose teacher. Together, these properties lead to a student network whose features outperform the teacher’s pose output when used in downstream applications.
VPD features improve performance on few-shot, fine-grained action recognition, retrieval, and detection tasks in the target sport domain, without requiring additional ground-truth action or pose labels. We demonstrate the benefits of VPD on four diverse sports video datasets with fine-grained action labels: diving , floor exercises , tennis , and a new dataset for figure skating. In a few-shot — limited supervision — setting, action recognition models trained with distilled VPD features can significantly outperform models trained directly on features from the teacher as well as baselines from prior skeleton-based and end-to-end learning work. For instance, when restricted to between 8 and 64 training examples per class from diving and floor exercises, the two datasets that are most challenging for pose, VPD features improve fine-grained classification accuracy by 6.8 to 22.8% and by 5.0 to 10.5%, respectively, over the next best method(s). Even when labels are plentiful, VPD remains competitive, achieving superior accuracy on three of the four test datasets. To summarize, VPD surpasses its teacher in situations where leveraging pose is crucial (e.g., few-shot) and is also competitive when end-to-end methods dominate (e.g., unreliable pose and the high-data / full supervision setting). Finally, we show applications of VPD features to fine-grained action retrieval and few-shot temporal detection tasks.
This paper makes the following contributions:
A weakly-supervised method, VPD, to adapt pose features to new video domains, which significantly improves performance on downstream tasks like action recognition, retrieval, and detection in scenarios where 2D pose estimation is unreliable.
State-of-the-art accuracy in few-shot, fine-grained action understanding tasks using VPD features, for a variety of sports. On action recognition, VPD features perform well with as few as 8 examples per class and remain competitive or state-of-the-art even as the training data is increased.
A new dataset (figure skating) and extensions to three datasets of real-world sports video, to include tracking of the performers, in order to facilitate future research on fine-grained sports action understanding.
2 Related Work
Pose representations provide a powerful abstraction for human action understanding. Despite significant progress in 2D and 3D pose estimation [44, 45, 54], downstream algorithms that depend on pose continue to suffer from unreliable estimates in sports video. With few labels available, for tasks such as fine-grained action recognition, models must learn both the actions and to cope with noisy inputs.
VIPE  and CV-MIM  show that learned pose embeddings, which factor-out camera view and forgo explicit 3D pose estimation, can be useful; they are trained on out-of-domain 3D pose data to embed 2D pose inputs and are effective when 2D pose is reliable. VPD extends these works by using distillation to replace the unreliable 2D pose estimation step with a model that embeds directly from pixels to pose-embedding. [25, 45, 70] learn human motion from video but produce 3D pose rather than embeddings.
Video action recognition is dominated by end-to-end models [3, 8, 14, 32, 52, 59, 62, 73], which are often evaluated on diverse but coarse-grained classification tasks (e.g., ‘golf’, ‘tennis’, etc.) [26, 29, 42, 51, 71]. Fine-grained action recognition in sports is a recent development [31, 49]. Besides being necessary for sports video analysis, fine-grained classification within a single sport is interesting because it avoids many contextual biases in coarse-grained tasks [10, 31, 63]. [2, 12, 17, 61] are also fine-grained datasets, but differ from body-centric actions in sports.
Pose or skeleton-based methods [11, 36, 68] appear to be a good fit for action recognition in human-centric sports. They depend on reliable 2D or 3D pose, which exists in datasets captured in controlled settings [35, 48] but not for public sports video, where no ground-truth is available and automatic detectors often perform poorly (e.g., [31, 49]).
VPD improves upon pose-based and end-to-end methods in human-centric sports datasets, especially when pose is not reliable. Like VIPE , VPD produces effective pose features, to the extent that comparatively simple downstream models such as nearest neighbor search  or a generic BiGRU  network can compete with the state-of-the-art in action recognition — in both few-shot and high-data regimes. To show this, we compare against several recent action recognition methods [36, 52] in Section 4.1.
VPD features can be used for any tasks where pretrained pose features may be helpful, such as action retrieval and temporally fine-grained detection (e.g., identifying tennis racket swings at 200 ms granularity). The latter is interesting because prior baselines [13, 24] focus on more general categories than human-centric action within a single sport and few papers [1, 67] address the few-shot setting.
Few-shot action recognition literature follows a number of paradigms, including meta-learning, metric learning, and data-augmentation approaches [1, 7, 30, 41]. These works focus on coarse-grained datasets [13, 26, 29, 51], adopt various protocols that partition the dataset into seen/unseen classes and/or perform a reduced N-way, K-shot classification (e.g., 5-way, 1 or 5 shot). VPD differs in that it is completely agnostic to action labels when training features and does not require a particular architecture for downstream tasks such as action recognition. In contrast to ‘few-shot’ learning that seeks to generalize to unseen classes, we evaluate on the standard classification task, with all classes known, but restricted to only -examples per class at training time. Our evaluation is similar to [50, 72], which perform action and image recognition with limited supervision, and, like [50, 72], we test at different levels of supervision.
Self-supervision/distillation. VPD relies on only machine-generated pose annotations for weak-supervision and distillation. VPD is similar to  in that the main goal of distillation is to improve the robustness and accuracy of the student rather than improve model efficiency. Most self-supervision work focuses on pretraining and joint-training scenarios, where self-supervised losses are secondary to the end-task loss, and subsequent or concurrent fine-tuning is necessary to obtain competitive results [9, 18, 23, 27, 33]. By contrast, our VPD student is fixed after distillation.
3 Video Pose Distillation
Our strategy is to distill inaccurate pose estimates from an existing, off-the-shelf pose detector — the teacher —, trained on generic pose datasets, into a — student — network that is specialized to generate robust pose descriptors for videos in a specific target sport domain (Figure 2). The student (Section 3.2) takes RGB pixels and optical flow, cropped around the athlete, as input. It produces a descriptor, from which we regress the athlete’s pose as emitted by the teacher (Section 3.1). We run this distillation process over a large, uncut and unlabeled corpus of target domain videos (Section 3.3), using the sparse set of high-confidence teacher outputs as weak supervision for the student.
Since the teacher is already trained, VPD requires no new pose annotations in the target video domain. Likewise, no downstream application-specific labels (e.g., action labels for recognition) are needed to learn pose features. VPD does, however, require that the athlete be identified in each input frame, so we assume that an approximate bounding box for the athlete is provided in each frame as part of the dataset. Refer to Section 5 for discussion and limitations.
3.1 Teacher Network
To stress that VPD is a general approach that can be applied to different teacher models, we propose two teacher variants of VPD. The first uses an off-the-shelf pose estimator  to estimate 2D joint positions from , the RGB pixels of the -th frame. We normalize the 2D joint positions by rescaling and centering as in 
, and we collect the joint coordinates into a vector. We refer to this as 2D-VPD since the teacher generates 2D joint positions.
Our second teacher variant further processes the 2D joint positions into a view-invariant pose descriptor, emitted as . Our implementation uses VIPE to generate this descriptor. VIPE is a reimplementation of concepts from Pr-VIPE  that is extended to train on additional synthetic 3D pose data [39, 46, 74] for better generalization. We refer to this variation as VI-VPD since the teacher generates a view-invariant pose representation. (See supplemental for details about VIPE and its quality compared to Pr-VIPE.)
3.2 Student Feature Extractor
Since understanding an athlete’s motion, not just their current pose, is a key aspect of many sports analysis tasks, we design a student feature extractor that encodes information about both the athlete’s current pose and the rate of change in pose .
The student is a neural networkthat consumes a color video frame , cropped around the athlete, along with its optical flow , from the previous frame. and are the crop’s spatial dimensions, and denotes the frame index. The student produces a descriptor , with the same dimension as the teacher’s output. We implement as a standard ResNet-34  with 5 input channels, and we resize the input crops to .
During distillation, the features emitted by are passed through an auxiliary decoder , which predicts both the current pose and the temporal derivative . Exploiting the temporal aspect of video, provides an additional supervision signal that forces our descriptor to capture motion in addition to the current pose. is implemented as a fully-connected network, and we train the combined student pathway using the following objective:
Since only is needed to produce descriptors during inference, we discard at the end of training.
Unlike its teacher, which was trained to recognize a general distribution of poses and human appearances, the student specializes to frames and optical flow in the new target domain (e.g., players in tennis courts). Specialization via distillation allows to focus on patterns present in the sports data that explain pose. We do not expect, nor do downstream tasks require, that encode poses or people not seen in the target domain (e.g., sitting on a bench, ballet dancers), although they may be part of the teacher’s training distribution. Experiments in Section 4 show that our pose descriptors, , improve accuracy on several applications, including few-shot, fine-grained action recognition.
3.3 Training Data Selection and Augmentation
Data selection. The teacher’s output may be noisy due to challenges such as motion blur and occlusion or because of domain shift between our target videos and the data that the teacher was trained on. To improve the student’s ability to learn and to discourage memorization of the teacher’s noise, we exclude frames with low pose confidence scores (specifically, mean estimated joint score) from the teacher’s weak-supervision set. By default, the threshold is 0.5, although 0.7 is used for tennis. Tuning this threshold has an effect on the quality of the distilled features (see supplemental for details). We also withhold a fixed fraction of frames (20%) uniformly at random as a validation set for the student.
Data augmentation. We apply standard image augmentations techniques such as random resizing and cropping; horizontal flipping; and color and noise jitter, when training the student . To ensure that left-right body orientations are preserved when horizontally augmenting and , we also must flip the teacher’s output . For 2D joint positions and 2D-VPD, this is straightforward. To flip VIPE (itself a chiral pose embedding) used to train VI-VPD, we must flip the 2D pose inputs to VIPE and then re-embed them.
We evaluate the features produced by VPD on four fine-grained sports datasets that exhibit a wide range of motions.
Figure skating consists of 371 singles mens’ and womens’ short program performances from the Winter Olympics (2010-18) and World Championships (2017-19), totalling 17 video hours. In the classification task, FSJump6, there are six jump types defined by the ISU . All videos from 2018 (134 routines, 520 jumps) are held out for testing. The remaining jumps are split 743/183 for training/validation.
Tennis consists of nine singles matches from two tournaments (Wimbledon and US Open), with swings annotated at the frame of ball contact . There are seven swing classes in Tennis7. The training/validation sets contain 4,592/1,142 examples from five matches and the test set contains 2,509 from the remaining four matches. Split by match video, this dataset is challenging due to the limited diversity in clothing and unique individuals (10 professional players).
Floor exercise. We use the womens’ floor exercise event (FX35) of the FineGym99 dataset , containing 1,214 routines (34 hours). There are 35 classes and 7,634 actions.
Diving48  contains 16,997 annotated instances of 48 dive sequences, defined by FINA . We evaluate on the corrected V2 labels released by the authors and retrain the existing state-of-the-art method, GSM , for comparison.
All four datasets contain frames in which pose is not well estimated or uncertain, though their distribution varies (see supplemental for details). As mentioned beforehand, pose estimates are typically worse in frames with fast motion, due to motion blur and atypical, athletic poses such as flips or dives; see Figure 1 for examples. A common challenge across these datasets, the fast-motion frames are often necessary for discriminating the fine-grained actions of interest.
We assume the subject of the action is identified and tracked. With multiple humans in the frame, fast-moving athletes in challenging poses are often missed otherwise: i.e., detected at lower confidence than static audience members or judges, or not detected at all. For fair comparison, we boost the baselines by providing them the same inputs as our method, which improves their results significantly.
|Dataset (Top-1 acc)||
|†TRNms (2-stream) [49, 73]||84.9|
|TSN  (w/o crop)||57.9||-||83.2||82.3|
|TSN (crop; 2-stream)||82.7||90.9||90.4||83.6|
|TRNms  (w/o crop)||68.7||-||81.5||80.5|
|TRNms (crop; 2-stream)||84.0||76.3||87.3||81.5|
|GSM  (w/o crop)||42.1||-||90.3||90.2|
|Skeleton / pose-based (w/ tracked 2D poses)|
|†ST-GCN (w/o tracking) [49, 68]||40.1|
|MS-G3D (ensemble) ||91.7||91.0||92.1||80.2|
|Pose features (w/ BiGRU)|
|Normalized 2D joints||95.5||90.9||86.9||75.7|
|Input features \ Amount of training data||Full||16-shot||Full||16||Full||16||Full||16|
|(a)||Normalized 2D joints (teacher)||95.5||72.5||90.9||64.3||86.9||65.6||75.7||25.5|
|distilled w/o motion; RGB||96.1||73.2||90.9||66.5||92.0||76.3||85.3||52.8|
|distilled w/o motion; RGB & flow||95.8||74.6||91.7||67.0||91.6||76.6||85.6||53.0|
|2D-VPD: distilled w/ motion; RGB & flow||97.0||74.4||92.6||66.9||94.5||82.7||86.4||57.6|
|distilled w/o motion; RGB||97.1||81.3||92.1||67.6||93.5||83.4||86.5||54.9|
|distilled w/o motion; RGB & flow||97.3||79.3||91.7||69.7||92.9||83.2||85.9||53.7|
|VI-VPD: distilled w/ motion; RGB & flow||97.4||80.2||93.3||71.1||94.6||84.9||88.6||58.8|
|(b)||VI-VPD (distilled on action video only)||96.3||79.4||92.4||69.1||94.1||84.3||-||-|
|(c)||VI-VPD (distilled w/ the entire video corpus)||97.2||81.9||93.8||72.6||94.5||84.9||88.4||59.6|
4.1 Fine-Grained Action Recognition
Fine-grained action recognition tests VPD’s ability to capture precise details about an athlete’s pose and motion. We consider both the few-shot setting, where only a limited number of action examples are provided, and the traditional full supervision setting, where all of the action examples in the training set are available.
Our VPD features are distilled over the training videos in the sports corpus, uncut and without labels. To extract features on the test set, we use the fixed VPD student . VI-VPD and 2D-VPD features maintain the same dimensions , of their teachers: for VIPE and for normalized 2D joints. For Diving48, VIPE has because we also extract pose embeddings on the vertically flipped poses and concatenate them. This data augmentation is beneficial for VIPE due to the often inverted nature of diving poses, which are less well represented in the out-of-domain 3D pose datasets that VIPE is trained on.
Action recognition model.16] trained atop the (fixed) features produced by the student . Since our features are chiral and many actions can be performed with either left-right orientation, we embed both the regular and horizontally flipped frames with the student. See supplemental for implementation details.
Prior pose embedding work has explored using sequence alignment followed by nearest-neighbor retrieval . We also tested a nearest-neighbor search (NNS) approach that uses dynamic time warping to compute a matching cost between sequences of pose features. For NNS, each test example is searched against all the training examples, and the label of the best aligned match is predicted. The BiGRU is superior in most settings, though NNS can be effective in few-shot situations, and we indicate when this is the case.
Baselines. We compare our distilled 2D-VPD and VI-VPD features against several baselines.
The features from the teacher: VIPE or the normalized 2D joint positions, using the same downstream action recognition models and data augmentations.
End-to-end: GSM , TSN , and TRNms  (multiscale). We test with both the cropped athletes and the full frame (w/o cropping) as inputs, and we find that cropping significantly improves accuracy in both the few-shot setting on all four datasets, and the full supervision setting on all datasets except for Diving48. When applicable, combined results with RGB and optical flow models are indicated as 2-stream.
4.1.1 Few-shot and limited supervision setting
Experiment protocol. Each model is presented examples of each action class but may utilize unlabeled data or knowledge from other datasets as pre-training. For example, skeleton-based methods rely on 2D pose detection; VIPE
leverages out-of-domain 3D pose data; and VPD features are distilled on the uncut, unlabeled training videos. This experimental setup mirrors real-world situations where few labels are present but unlabeled and out-of-domain data are plentiful. Our evaluation metric is top-1 accuracy on the full test set. To control for variation in the training examples selected for each few-shot experiment, we run each algorithm on five randomly sampled and fixed subsets of the data, for each, and report the mean accuracy.
Results. Figure 3 compares 2D-VPD and VI-VPD features to their teachers (and other baselines). On FSJump6 and Tennis7, VI-VPD provides a slight improvement over its state-of-the-art teacher, VIPE, with accuracies within a few percent. FX35 shows a large improvement and VI-VPD increases accuracy by up to 10.5% over VIPE at and 5% over the MS-G3D ensemble at . Likewise, on Diving48, where end-to-end GSM and 2-stream TSN are otherwise better than the non-VPD pose-based methods, VI-VPD improves accuracy by 6.8 to 22.8%. Our results on FX35 and Diving48 suggest that VI-VPD helps to transfer the benefits of pose to datasets where it is most unreliable.
While view-invariant (VI) features generally perform better than their 2D analogues, the difference in accuracy between VI-VPD and 2D-VPD is more noticeable in sports with diverse camera angles (such as figure skating and floor exercise) and at small , where the action recognition model can only observe a few views during training.
|Normalized 2D joints||91.8||84.8||73.8||91.8||88.1||82.1||71.6||57.4||39.0||34.5||22.1||14.6|
4.1.2 Traditional, full training set setting
VPD features are competitive even in the high-data regime (Table 1). On all four datasets, both VI-VPD and 2D-VPD significantly improve accuracy over their teachers. VI-VPD also achieves state-of-the-art accuracy on the FSJump6 (0.6% over VIPE), Tennis7 (1.5% over VIPE), and FX35 (1.0% over GSM, with cropped inputs) datasets.
Diving48 is especially challenging for pose-based methods, and VI-VPD performs worse than GSM, without cropping, by 1.6%. GSM, with cropping, is also worse by 1.5%, possibly due to errors and limitations of our tracking. VI-VPD does, however, perform significantly better than the top pose-based baseline (8.4% over MS-G3D, ensemble).
Our results demonstrate that VPD’s success is not limited to few-shot regimes. However, because many methods in Table 1 can produce high accuracies, at or above 90%, when given ample data, we view improving label efficiency as a more important goal for VPD and future work.
4.1.3 Ablations and additional experiments
We highlight two important ablations of VPD to understand the source of VPD’s improvements: (1) analyzing parts of the distillation method and (2) distilling with only the action segments of the video. We also consider (3) an unlabeled setting where VPD is distilled over the entire video corpus. Please refer to supplemental for additional experiments.
Analysis of the distillation method. Table 2(a) shows the increase in accuracy on action recognition for ablated 2D-VPD and VI-VPD features when we distill without flow input and without motion prediction111The student mimics the teacher’s output directly, without the auxiliary decoder and in the training loss.. The incremental improvements are typically most pronounced in the few-shot setting, on the FX35 and Diving48 datasets, where VPD produces the largest benefits (see Section 4.1.1).
With VIPE as the teacher, distillation alone from RGB can have a large effect (2.7% and 7.7%, at full and 16-shot settings on FX35; 7.9% and 19.9% on Diving48). Adding flow in addition to RGB, without motion, gives mixed results. Finally, adding motion prediction and decoder , further improves results (1.1% and 1.5% on FX35, at full and 16-shot; 2.1% and 3.9% on Diving48). The effect of distilling motion on FSJump6 and Tennis7 is mixed at the 16-shot setting, though the full setting shows improvement.
2D-VPD can be seen as an ablation of view-invariance (VIPE) and shows a similar pattern when further ablated.
Training VPD on action parts of video only. Fine-grained action classes represent less than 7%, 8%, and 28% of the video in FSJump6, FX35, and Tennis7. We evaluate whether distillation of VI-VPD over uncut video improves generalization on action recognition, by distilling VI-VPD features with only the action parts of the training videos.
The results are summarized in Table 2(b) and show that distilling with only the action video performs worse on our datasets. This is promising because (1) uncut performances are much easier to obtain than performances with actions detected, and (2) in the low-supervision setting, VI-VPD improves accuracy even if actions have not been detected in the rest of the training corpus. This also suggests that distilling over more video improves the quality of the features.
Distillation with the entire video corpus. An unlabeled corpus is often the starting point when building real-world applications with videos in a new domain (e.g., ). Because VPD is supervised only by machine-generated pose estimates from unlabeled video, VPD features can be distilled over all of the video available, not just the training data.222This setting is similar to [55, 56], which propose self-supervision to align the training and testing distributions in situations with large domain shift. Table 2(c) shows results when VI-VPD is distilled jointly with both the training and testing videos, uncut and without labels. The improvement, if any, is minor on all four datasets (1.5%, attained on Tennis7 at 16-shot) and demonstrates that VI-VPD, distilled over a large dataset, is able to generalize without seeing the test videos.
|Figure skating jumps (trained on five routines)|
|Pretrained R3D ||39.5||30.0||23.1||15.0||9.0|
|Normalized 2D joints||80.6||70.0||53.5||40.2||24.6|
|Tennis swings at 200 ms (trained on five points)|
|Pretrained R3D ||41.3||37.8||29.9||15.8||7.6|
|Normalized 2D joints||59.7||58.2||43.7||24.6||10.3|
4.2 Action Retrieval
Action retrieval measures how well VPD features can be used to search for similar unlabeled actions. Here, the VPD features are distilled on the entire unlabeled corpus.
Experiment protocol. Given a query action, represented as a sequence of pose features, we rank all other actions in the corpus using the distance between pose features and dynamic time warping to compute an alignment score. A result is considered relevant if it has the same fine-grained action label as the query, and we assess relevance by the precision at k results, averaged across all the queries.
Results. At all cut-offs in Table 3 and in all four datasets, VPD features outperform their teachers. Sizeable improvements are seen on FX35 and Diving48. View-invariance does not always result in the highest precision if the number of camera angles is limited (e.g., Tennis7 and Diving48), though it may be helpful for retrieving more diverse results.
4.3 Pose Features for Few-Shot Action Detection
Detection of fine-grained actions, at fine temporal granularity and with few labels, enables tasks such as few-shot recognition and retrieval. We evaluate VPD features on the figure skating and tennis datasets, to temporally localize the jumps and the swings, respectively. The average jump is 1.6 seconds in length (40 frames), while a swing is defined to be the 200 ms around the frame of ball contact (5 frames).
Experiment protocol. We follow the same video-level train/test splits as FSJump6 and Tennis7, and distill features on the training videos only. As a simple baseline method, we train a BiGRU that outputs per-frame predictions, which are merged to produce predicted action intervals (see supplemental for details). The BiGRU is trained on ground-truth temporal labels from five routines (figure skating) and five points (tennis). For more consistent results, we perform five-fold cross-validation and ensemble the per-frame predictions. In Table 4, we report average precision (AP) at various levels of temporal intersection over union (tIoU).
Results. VPD improves AP on both tasks. The short duration of tennis swings means that noise in per-frame pose estimates has a large impact, and VI-VPD improves AP at every tIoU threshold (up to 7.4 over VIPE at ).
5 Limitations and Discussion
Subject tracking is needed for VPD to ensure that the pose is of the correct person. Real-world sports video often contains many people, such as audience and judges, in addition to the subject. The tracking annotations in the datasets in Section 4.1
are computed automatically using off-the-shelf models and heuristics (see supplemental for details). This is possible because athletes are salient in appearance, space, and time — sports video is a natural application for work on tracking[4, 64] and detecting salient regions . We observe that the difference in accuracy between the tracked and non-tracked inputs on other prior methods such as [52, 62, 68] can be staggering (48% on FSJump6 for GSM  and 40% on FX35 for ST-GCN ; see Table 1).
To evaluate the quality of our pose features, we focused on motion by a single athlete or synchronized athletes (contained in Diving48). Tasks and actions involving many people require a more sophisticated downstream model that can handle multiple descriptors or poses per frame.
Future work. First, the 2D pose estimates used to supervise VPD are inherently ambiguous with respect to camera view, and additional information such as depth or a behavioral prior could help alleviate this ambiguity. Other weak supervision sources, in addition to motion and VIPE, may also help. Second, our distillation process is offline; supporting online training, similar to [43, 56]
, at the pose feature extraction stage could be beneficial in time-evolving datasets. Distillation for explicit 2D or 3D pose estimation is another possibility. Although VPD features can improve accuracy with limited data, performance on few-shot and semi-supervised tasks still has much room to improve, and we hope that future work continues to explore these topics.
Pose features are useful for studying human-centric action in novel sports video datasets. However, such datasets are often challenging for off-the-shelf models. Our method, VPD, improves the reliability of pose features in difficult and label-poor settings, by distilling knowledge from existing pose estimators. VPD learns features that improve accuracy on both traditional and few-shot action understanding tasks in the target (sport) domain. We believe that our distillation-based method is a useful paradigm for addressing challenges faced by applications in new video domains.
Acknowledgements. This work is supported by the National Science Foundation (NSF) under III-1908727 and Adobe Research. We also thank the anonymous reviewers.
-  (2021) TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition. In CVPR Workshops, Cited by: §2, §2.
-  (2020) The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose. Note: arXiv:2007.00394 External Links: Cited by: §2.
-  (2021) Is Space-Time Attention All You Need for Video Understanding?. In ICML, Cited by: §2, Table 1.
-  (2016) Simple Online and Realtime Tracking. In ICIP, Cited by: Appendix F, §5.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: Appendix F.
-  (2016) Where Should Saliency Models Look Next?. In ECCV, Cited by: §5.
-  (2020) Few-Shot Video Classification via Temporal Alignment. In CVPR, Cited by: §2.
-  (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, Cited by: §1, §2.
-  (2020) A Simple Framework for Contrastive Learning of Visual Representations. In ICML, Cited by: §2.
-  (2019) Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In NeurIPS, Cited by: §B3, §1, §2.
-  (2018) PoTion: Pose MoTion Representation for Action Recognition. In CVPR, Cited by: §2.
-  (2018) Scaling Egocentric Vision: The EPIC-KITCHENS Dataset. In ECCV, Cited by: §2.
-  (2015) ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In CVPR, Cited by: §2, §2.
-  (2019) SlowFast Networks for Video Recognition. In ICCV, Cited by: §1, §2.
-  Fédération Internationale de Natation. External Links: Cited by: §4.
-  (2016) Deep Learning. Note: http://www.deeplearningbook.org Cited by: §B1, §2, §4.1.
-  (2017) The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV, Cited by: §2.
-  (2020) Self-supervised Co-Training for Video Representation Learning. In NeurIPS, Cited by: §2.
-  (2015) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §3.2.
-  (2019) Pytorch-seq2seq. Cited by: Table 8.
-  (2014) Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. External Links: Cited by: Table 12, Appendix G.
-  International Skating Union. External Links: Cited by: Appendix D, Appendix F, §4.
-  (2020) Video Representation Learning by Recognizing Temporal Transformations. In ECCV, Cited by: §2.
-  (2014) THUMOS Challenge: Action Recognition with a Large Number of Classes. Cited by: §2.
-  (2019) Learning 3D Human Dynamics from Video. In CVPR, Cited by: §2.
-  (2017) The Kinetics Human Action Video Dataset. Note: arXiv:1705.06950 External Links: Cited by: Appendix D, §2, §2.
-  (2019) Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In AAAI, Cited by: §2.
-  (2014) Convolutional Neural Networks for Sentence Classification. In EMNLP, Cited by: Table 8.
-  (2011) HMDB: A Large Video Database for Human Motion Recognition. In ICCV, Cited by: §2, §2.
-  (2019) ProtoGAN: Towards Few Shot Learning for Action Recognition. In ICCV Workshops, Cited by: §2.
-  (2018) RESOUND: Towards Action Recognition without Representation Bias. In ECCV, Cited by: Table 5, Table 9, §E5, Appendix F, Appendix F, §1, §2, §2, §4.
-  (2019) TSM: Temporal Shift Module for Efficient Video Understanding. In ICCV, Cited by: §1, §2.
MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In ACM Multimedia, Cited by: §2.
Microsoft COCO: Common Objects in Context. In ECCV, Cited by: Appendix A, Appendix G.
-  (2020) NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (10), pp. 2684–2701. External Links: Cited by: §2.
-  (2020) Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In CVPR, Cited by: §B3, §1, §2, §2, Figure 3, item 2, Table 1.
-  (2019) Decoupled Weight Decay Regularization. In ICLR, Cited by: Appendix A, §B1, Appendix D.
-  (2017) Discriminative Correlation Filter with Channel and Spatial Reliability. In CVPR, Cited by: Appendix F.
-  (2019) AMASS: Archive of Motion Capture as Surface Shapes. In ICCV, Cited by: Appendix G, §3.1.
-  (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: Appendix G.
-  (2018) A Generative Approach to Zero-Shot and Few-Shot Action Recognition. In WACV, Cited by: §2.
-  (2020) Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 502–508. External Links: Cited by: §2.
-  (2019) Online Model Distillation for Efficient Video Inference. In ICCV, Cited by: §5.
-  (2018) PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In ECCV, Cited by: §2.
-  (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, Cited by: §2, §2.
-  (2019) 3DPeople: Modeling the Geometry of Dressed Humans. In ICCV, Cited by: Appendix G, §3.1.
-  (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1), pp. 43–49. Cited by: §B2.
-  (2016) NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, Cited by: §2.
-  (2020) FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding. In CVPR, Cited by: §B3, Appendix F, Appendix F, §1, §2, §2, Table 1, §4.
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In NeurIPS, Cited by: §2.
-  (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. Note: arXiv:1212.0402 External Links: Cited by: §2, §2.
-  (2020) Gate-Shift Networks for Video Action Recognition. In CVPR, Cited by: §B3, §B3, Table 9, §E5, §1, §2, §2, Figure 3, item 3, Table 1, §4, §5.
-  (2020) View-Invariant Probabilistic Embedding for Human Pose. In ECCV, Cited by: Appendix A, Appendix G, Appendix G, Appendix G, Appendix G, §2, §2, §3.1, §3.1, §4.1.
-  (2019) Deep High-Resolution Representation Learning for Human Pose Estimation. In CVPR, Cited by: Appendix A, Figure 1, §1, §2, §3.1.
-  (2019) Unsupervised Domain Adaptation through Self-Supervision. Note: arXiv:1909.11825 External Links: Cited by: footnote 2.
-  (2020) Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In ICML, Cited by: §5, footnote 2.
Rethinking the Inception Architecture for Computer Vision. In CVPR, Cited by: §B3.
-  (2020) RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In ECCV, Cited by: Appendix A.
-  (2018) A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR, Cited by: Table 7, §2.
-  (2018) A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR, Cited by: Appendix D, Table 4.
The 20BN-something-something Dataset V2. External Links: Cited by: §2.
-  (2016) Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In ECCV, Cited by: §B3, §1, §2, Figure 3, item 3, Table 1, §5.
-  (2021) Mimetics: towards understanding human actions out of context. Note: arXiv:1912.07249 External Links: Cited by: §B3, §1, §2.
-  (2018) Deep Cosine Metric Learning for Person Re-identification. In WACV, Cited by: §5.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: Appendix F, §1.
Self-Training With Noisy Student Improves ImageNet Classification. In CVPR, Cited by: §2.
-  (2018) Similarity R-C3D for Few-shot Temporal Activity Detection. arXiv. Note: arXiv:1812.10000 External Links: Cited by: §2.
-  (2018) Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In AAAI, Cited by: §B3, §1, §2, item 2, Table 1, §5.
-  (2021) Vid2Player: Controllable Video Sprites That Behave and Appear Like Professional Tennis Players. ACM Transactions on Graphics 40 (3). Cited by: Appendix F, Appendix F, §1, §4.1.3, §4.
-  (2019) Predicting 3D Human Dynamics from Video. In ICCV, Cited by: §2.
-  (2013) From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding. In ICCV, Cited by: Appendix G, §2.
-  (2021) Learning View-Disentangled Human Pose Representation by Contrastive Cross-View Mutual Information Maximization. In CVPR, Cited by: Appendix G, §2, §2.
-  (2018) Temporal Relational Reasoning in Videos. In ECCV, Cited by: §B3, §B3, §1, §2, item 3, Table 1.
-  (2020) Reconstructing NBA Players. In ECCV, Cited by: Appendix G, §3.1.
Appendix A Implementation: Video Pose Distillation
This section provides additional implementation details for our method described in Section 3.
Pose: definition. VPD is not dependent on a specific 2D pose estimator or joint definition. We use an off-the-shelf HRNet  to estimate pose in the detected region of the athlete, as is typical for top-down pose estimation. Heuristic tracking, described in Appendix F, can often provide bounding boxes in frames where person detection fails. We use only 13 of the 17 COCO  keypoints (ignoring LEye, REye, LEar, and REar), and we apply the same joint normalization procedure as in .
Student inputs. The RGB crops are derived from the spatial bounding boxes of the athlete in frame
. We expand the bounding box to a square and then pad each side by 10% or 25 pixels, whichever is greater.
Optical flow is computed using RAFT  between and , where we crop the same location as in the previous frame for . In datasets where the frame rate differs between videos, a target frame rate of 25 frames per second (fps) determines . To obtain the final , we subtract the median of the RAFT output, clip to pixels, and quantize into 8-bits.
During training and inference, is scaled to a range of
and standardized with respect to the dataset RGB mean and standard deviation;is also centered to . In video frames where the athlete was explicitly detected by Mask R-CNN with a score above 0.8 (see Appendix F), we use the predicted mask to jitter the background with Gaussian noise () as data augmentation.
For performance reasons, we pre-compute , , and in an offline manner for the entire corpus.
Auxiliary decoder is a standard fully connected network, whose sole purpose is to provide supervision for training the student . We use two hidden layers, each with dimension of 128. Note that the ablations without motion in Table 2 do not use and directly optimize loss between the student’s output and the teacher’s .
The student is initialized with random weights. In each training epoch, we randomly sample 20,000 framesthat meet the pose selection criteria outlined in Section E1. We use an AdamW  optimizer with learning rate and a batch size of 100. The student is trained for 1,000 epochs, though in practice the model often converges sooner and using a higher learning rate is also possible. We use the loss on the held-out validation frames to select the best epoch. On a single Titan V GPU, the student model trains in approximately 8 hours.
Appendix B Implementation: Action Recognition
This section provides details about our fine-grained action recognition models and baselines.
B1 BiGRU Architecture
This is a standard bidirectional-GRU  architecture. The model is trained on sequences of VI-VPD, 2D-VPD, VIPE, and normalized 2D joint position features.
The inputs are variable length sequences of per-frame pose features (for each action). The features are sampled to 25 fps in FX35 and Diving48, where frame rate varies from 25 to 60 fps. FSJump6 is a small dataset and normalizing the features also reduces overfitting.
Architecture. We use a two-layer BiGRU as the backbone, with a hidden dimension . The output of the BiGRU is a sequence
of hidden states from the final layer. To obtain a fixed size encoding of this sequence, we max-pool across the time steps in
. To output an action class, the pooled encoding is sent to a fully connected network consisting of BN-Dropout-FC-ReLU-BN-Dropout-FC, with the FC dimensions beingand the number of output classes.
Training. We train the network with AdamW  and a batch size of 50 for 500 epochs (200 on Diving48 due to the larger dataset). Learning rate is initially set to and adjusted with a cosine schedule. Dropout rate is on the dense layers and on the input sequence. Data augmentation consists of the horizontally flipped input sequences.
On a single Titan V GPU, our model takes 7 minutes to train for FSJump6, 25 minutes for Tennis7, 50 minutes for FX35, and 100 minutes for Diving48 over the full datasets.
Inference. At inference time, we feed the input sequence and its horizontal flip to the model; sum the predictions; and output the top predicted class.
B2 Nearest-Neighbor Search
Our nearest-neighbor search (NNS) uses sequence alignment cost with dynamic time warping (DTW).
The inputs are the same as in Section B1, but with each feature vector normalized to unit length.
Inference. We treat the training set as an index. Alignment cost between two sequences of features, normalized by sequence length, is calculated using DTW with pairwise distance and the symmetricP2 step pattern . Combinations of the regular and horizontally flipped pose sequences in the testing set and training set are considered, with the lowest cost match returned.
Because the computational complexity of inference grows linearly with training set size, this method is unsuited for larger datasets with more examples or classes. DTW is also sensitive to factors such as the precision of the temporal boundaries and the duration of the actions.
B3 Additional Baselines
We evaluated ST-GCN , MS-G3D , multiscale TRN , and GSM  on our datasets using the reference implementations released by the authors. For TSN , we used the code from the authors of GSM . The GSM  codebase extends the TRN  and TSN frameworks, and we backported ancillary improvements (e.g., learning rate schedule) to the TRN codebase for fairness.
Skeleton based. The inputs to ST-GCN and MS-G3D are the tracked 2D skeletons of only the identified athlete. For MS-G3D, we trained both the bone and joint feature models and reported their ensemble accuracy. Ensemble accuracy exceeded the separate accuracies in all of our experiments.
End-to-end. We follow the best Diving48 configuration in the GSM  paper for the GSM, TSN, and TRNms baselines. This configuration uses 16 frames, compared to 3 to 7 in earlier work , and samples 2 clips at inference time. As seen in benchmarks by the authors of , additional frames are immensely beneficial for fine-grained action recognition tasks compared to coarse-grained tasks, where the class can often be guessed in a few frames from context [10, 63]. The backbone for these baselines is an InceptionV3 , initialized using pretrained weights.
When comparing to TSN and TRN with optical flow, we train using the same cropped flow images as VPD, described in Appendix A. Flow and RGB model predictions are ensembled to obtain the 2-stream result. Recent architectures that model temporal information in RGB, such as GSM, often perform as well as or better than earlier flow based work.
Appendix C Implementation: Action Retrieval
The search algorithm for action retrieval is identical to nearest neighbor search described in Section B2, for action recognition, except that the pose sequence alignment scores are retained for ranking.
Query set. For FSJump6, Tennis7, and FX35 we evaluate with the entire corpus as queries. For the much larger Diving48 dataset, we use the 1,970 test videos as queries.
Appendix D Implementation: Action Detection
We evaluated pose features for few-shot figure skating jump and tennis swing detection. Our method should be interpreted as a baseline approach to evaluate VPD features, given the lack of prior literature on temporally fine-grained, few-shot video action detection, using pose features. More sophisticated architectures for accomplishing tasks such as generating action proposals and refining boundaries are beyond the scope of this paper.
The inputs are the uncut, per-frame pose feature sequences. For figure skating, the sequences are entire, 160 second long, short programs. ISU  scoring rules require that each performance contains two individual jumps and a jump combination (two jumps). For tennis, each point yields two pose sequences, one for each player. The points sampled for training have at least five swings each per player.
For the ResNet-3D 
baseline, we extracted features for each frame using a Kinetics-400 pretrained model on the subject crops, with a window of eight frames. A limitation of this baseline is that actions (e.g., tennis swings) can be shorter than the temporal window.
Architecture. We use a two-layer BiGRU as the backbone with a hidden dimension . The hidden states at each time step from the final GRU layer are sent to a fully connected network consisting of BN-Dropout-FC-ReLU-BN-Dropout-FC, with the FC dimensions being and 2 (a binary label for whether the frame is part of an action).
Training. The BiGRU is trained on randomly sampled sequences of 250 frames from the training set. We use a batch size of 100, steps with the AdamW  optimizer, and a learning rate of . We apply dropout rates of on the dense layers and on the input sequence. Because only five examples are provided in this few-shot setting, we use five-fold cross validation to train an ensemble.
The reported results are an average of separate runs on five randomly sampled, fixed few-shot dataset splits.
Inference. We apply the trained BiGRU ensemble to the uncut test videos to obtain averaged frame-level activations. Consecutive activations above 0.2 are selected as proposals; the low threshold is due to the large class imbalance because actions represent only a small fraction of total time. A minimum proposal length of three frames is required. The mean action length in the training data was also used to expand or trim proposals that are too short (less than ) or too long (greater than ).
|Score||% All||% Action||Full||16-shot||% All||% Action||Full||16-shot||% All||Full||16-shot|
|Features \ Model||BiGRU||NNS||BiGRU||NNS||BiGRU||NNS||BiGRU||NNS|
|Normalized 2D joints||38.5 3.7||50.8 6.1||60.1 4.5||65.3 4.5||72.5 3.9||71.7 3.9||89.7 0.9||79.7 1.8|
|(Ours) 2D-VPD||43.2 5.2||50.7 5.8||66.1 1.1||70.3 3.7||74.4 3.0||75.7 1.5||90.8 1.9||84.1 1.2|
|VIPE||51.1 3.0||64.3 5.0||69.7 2.9||75.7 3.6||80.5 3.5||78.3 2.6||91.3 1.7||84.5 1.3|
|(Ours) VI-VPD||54.4 5.0||65.9 5.5||71.4 1.7||78.4 2.5||80.2 1.9||81.1 2.5||92.2 1.2||86.2 0.7|
|Normalized 2D joints||48.0 1.9||54.2 3.4||58.5 3.0||57.0 5.5||64.4 2.6||63.0 2.8||69.7 2.6||64.6 2.3|
|(Ours) 2D-VPD||53.0 3.3||57.0 3.4||62.0 1.7||61.3 4.8||66.9 1.7||65.0 2.0||71.5 2.4||67.2 1.5|
|VIPE||61.4 4.1||62.4 4.4||65.8 3.4||65.6 3.5||67.0 2.8||68.8 4.3||73.2 2.3||70.1 2.0|
|(Ours) VI-VPD||63.9 6.1||62.4 4.5||65.5 4.5||66.1 3.5||71.1 2.4||68.4 3.5||76.3 2.0||70.3 1.8|
|Normalized 2D joints||37.6 1.2||38.0 1.9||54.8 2.6||45.8 1.2||65.6 0.9||52.8 1.4||75.3 0.9||59.0 0.6|
|(Ours) 2D-VPD||51.2 1.0||47.4 2.1||70.0 1.2||54.9 1.5||82.7 0.6||63.9 1.4||88.8 0.8||69.7 0.5|
|VIPE||49.7 0.7||43.0 1.7||62.5 2.1||49.1 0.9||75.7 0.4||54.3 1.2||81.8 0.5||59.7 1.3|
|(Ours) VI-VPD||59.3 1.9||51.0 1.1||73.0 0.6||57.1 1.3||84.9 0.5||65.4 1.5||89.1 0.6||70.6 0.7|
|Normalized 2D joints||12.6 1.2||13.3 1.4||13.3 1.2||15.3 0.8||25.5 3.5||-||44.2 0.9||-|
|(Ours) 2D-VPD||27.6 2.6||18.4 2.4||29.4 1.2||22.8 1.4||57.6 6.5||-||76.6 0.9||-|
|VIPE||17.0 1.6||12.9 1.6||18.8 1.0||16.1 1.3||35.0 4.5||-||53.2 1.4||-|
|(Ours) VI-VPD||29.2 2.5||16.9 2.1||34.0 1.2||21.2 1.0||58.8 3.6||-||76.7 0.8||-|
|Figure skating jumps|
|Pretrained R3D ||23.3||23.1||18.2||16.6||14.2||12.8||10.0||7.4||5.9|
|Normalized 2D Joints||57.1||53.4||50.4||46.9||42.2||37.9||33.3||27.0||20.3|
|Tennis swings at 200 ms|
|Pretrained R3D ||32.1||31.5||30.9||29.9||28.5||27.4||26.4||25.1||22.3|
|Normalized 2D Joints||46.8||45.3||44.4||43.7||43.6||43.1||41.8||40.1||37.7|
|Architecture \ Features||VIPE||VI-VPD||VIPE||VI-VPD||VIPE||VI-VPD||VIPE||VI-VPD|
|NNS (w/ DTW) [Section B2]||90.6||92.7||89.1||88.6||71.8||81.2||-||-|
|BiLSTM (w/ attn)||97.3||97.9||90.7||92.0||88.8||93.9||76.8||87.5|
|BiGRU [Section B1]||96.8||97.4||91.8||93.3||90.8||94.6||78.6||88.6|
|BiGRU (w/ attn)||96.8||98.3||91.1||92.5||89.5||94.3||77.5||88.0|
|8||9.5 1.3||15.1 0.8||+5.5|
|16||21.8 1.5||33.0 0.9||+11.2|
|32||49.5 3.1||59.3 2.4||+9.8|
|64||72.6 1.0||75.6 1.2||+3.0|
Appendix E Additional Experiments
This section includes results of additional ablations, analysis, and baselines omitted from the main text.
E1 Ablation: Data Selection Criterion
Mean estimated joint score from the teacher pose estimator is used as the weak-pose selection criterion. Figure 4 shows the distribution of such scores in each of the four sports datasets. Notice that the teacher produces significantly less confident pose estimates on the floor exercise (FX35) and Diving48 datasets, and also on the labeled action portions of all four datasets.
While the optimal selection threshold is ultimately dependent on the calibration and quality of the pose estimator used, we evaluate the effect of tuning the weak-pose selection criterion on three of our datasets: Tennis7, FX35, and Diving48. Table 5 shows results with VI-VPD when various thresholds are applied. There is benefit to ignoring the least confident pose estimates, though setting the threshold too high also diminishes performance, as insufficient data remains to supervise the student.
E2 Ablation: NNS vs. BiGRU for Recognition
E3 Ablation: Activation Threshold for Detection
In Appendix D, we use a frame-level activation threshold of 0.2 when proposing action intervals for few-shot action detection. Table 7 shows the impact on average precision (AP) of other thresholds, scored at 0.5 temporal intersection over union (tIoU). The results are similar at nearby thresholds and results at 0.2 are reported for consistency.
E4 Ablation: Action Recognition Architectures
The BiGRU described in Section B1 was used in our experiments for consistency. This section includes a number of additional simple, well-studied architectures that we also tested. Results from these models are given in Table 8 and are often similar; the BiGRU is not necessarily the best performing model in all situations. As Section 4.1 shows, however, the BiGRU is competitive with recent, state-of-the-art methods when trained with VIPE or our VI-VPD features.
E5 Baseline: GSM Without Cropping on Diving48
In Section 4.1.1, on few-shot action recognition, we reported results from GSM  with cropping. This is despite GSM, without cropping, having higher accuracy in the full supervision setting on Diving48  (see Table 1). Table 9 shows that GSM, with cropping, is the stronger baseline when limited supervision is available.
We speculate that cropping forces the GSM model focus on the diver in few-shot settings. In the full supervision setting, the GSM model can learn this information by itself and is limited by noise in the crops and the loss of other information from the frame (e.g., the other diver in synchronized diving; the 3 metre springboard or 10 metre platform; and spatial information).
E6 Analysis: Visualizing Distilled 2D Pose
Although the goal of this paper is to distill pose features for downstream tasks, this section provides preliminary qualitative results on how well distilled features mimic their teachers and reflect the explicit 2D pose. Because the learned VIPE and VPD features are not designed to be human interpretable, we use normalized 2D joint positions (described in Appendix A) as the teacher instead, and we train an ablated student without the auxiliary decoder for motion.
Figure 5 compares the teacher’s normalized 2D joint features to the student’s distilled outputs. Visible errors in the student’s predictions show that our distillation method presented in this paper does not solve the explicit 2D pose estimation problem in challenging sports data. However, solving this explicit task is not necessarily required to improve results in downstream tasks that depend on pose.
Appendix F Additional Dataset Details
This section provides additional details about the fine-grained sports video datasets used in the results section.
Figure skating is a new dataset that contains the jumps in 371 singles short programs. Because professional skaters often repeat the same routine in a competitive season, all performances from 2018 are held out for testing.
The six jump types that occur in the FSJump6 dataset are: Axel, flip, loop, Lutz, Salchow, and toe-loop (see Table 11). The labels are verified against the ISU’s  publicly accessible scoring data. For the classification task, the average label duration is 3.3 seconds and includes the poses from before taking off and after landing.
Tennis consists of Vid2Player’s  swing annotations in nine matches. For action recognition, Tennis7 has seven swing types: forehand topspin, backhand topspin, forehand slice, backhand slice, forehand volley, backhand volley, and overhead. Note that the distribution of actions in tennis is unbalanced, with forehand topspin being the most common. Serves are intentionally excluded from the action recognition task because they always occur at the start of points and do not need to be classified. For swing detection, however, serves are included.
All action recognition models receive a one second interval, centered around the frame of ball contact for the swing.
Floor exercise. We use the videos, labels, and official train/validation split from the floor exercise event of FineGym99 . We focus on floor exercises (FX35) because the data is readily tracked and because the  authors report accuracies on this subset. Because actions are often short, for each action, we extracted frames from 250 ms prior to the annotated start time to the end time, and we use these frames as the inputs to our methods and the baselines.
Diving48  contains both individual and synchronized diving. We use the standard train/validation split. For synchronized diving, we track either diver as the subject and tracks can flicker between divers due to missed detections. Tracking is the most challenging in this dataset because of the low resolution, motion blur, and occlusion upon entering the water. Also, because the clips are short, it is more difficult to initialize tracking heuristics that utilize periods of video before and after an action, where the athlete is more static and can be more easily detected and identified.
To focus on the athletes, we introduce subject tracking to the figure skating, floor exercises , and Diving48  datasets. Our annotations are created with off-the-shelf person detection and tracking algorithms. First, we run a Mask R-CNN detector with a ResNeXt-152-32x8d backbone  on every frame to detect instances of people. We use heuristics such as “the largest person in the frame” (e.g., in figure skating, floor exercise, and diving) and “upside down pose” (e.g., in floor exercise and diving) to select the athlete. These selections are tracked across nearby frames with bounding box intersection-over-union, SORT , and OpenCV  object tracking (CSRT ) when detections are missed. This heuristic approach is similar to the one taken by the authors of Vid2Player .
Example images of tracked and cropped athletes are shown in Figure 6. We run pose estimation on the pixels contained in and around the tracked boxes.
Appendix G Vipe Details
We provide details of VIPE, which is used as the teacher for our view-invariant VI-VPD student. VIPE is used because the evaluation code and documentation for Pr-VIPE  is not released at the time of development. The experiments in this section are to demonstrate that VIPE is a suitable substitute, based on ’s evaluation on coarse-grained action recognition.
Overview. View-invariant pose embedding (VIPE) methods embed 2D joints such that different camera views of the same pose in 3D are similar in the embedding space. VIPE is trained via 3D lifting to canonicalized features (w.r.t. rotation and body shape). We designed VIPE to train on multiple (publicly available) datasets with differing 3D joint semantics; we use Human3.6M  as well as synthetic pose data from 3DPeople , AMASS , and NBA2K .
Inputs. VIPE learns view-invariant embeddings by regressing 3D joint features from 2D joint pose. The 2D joint inputs are the 13 COCO  keypoints (excluding eyes and ears) normalized as in . To obtain canonicalized 3D features, first, we rotate the 3D pose around the vertical-axis, aligning the torso-normal vector to the depth-axis. Then, we normalize each joint as two unit length offsets from its parent and from the hip (centered to 0). We also concatenate the cosine bone angle at each 3D joint. These transformations standardize 3D poses with respect to body appearance and camera view.
We use a fully-connected decoder that takes embeddings as input. This decoder is discarded after training. To support multi-task training with 3D datasets with different ground-truth joint semantics, we specialize the output layer weights for each dataset.
Contrastive embedding loss. We minimize the pairwise distance between embeddings of different 2D views of the same 3D pose (positive pairs). We also negatively sample pairs of 2D poses, corresponding to different 3D poses in each action sequence, and maximize their embedding distance. Two 3D poses are considered to be different if one of their joint-bone angles differs by 45 or more.
Substitute for Pr-VIPE. We compare VIPE’s performance to the coarse-grained action recognition results reported by [53, 72] on the Penn Action  dataset. Our results suggest parity with Pr-VIPE when trained with Human3.6M only and a small improvement from extra synthetic data. VIPE has 98.2% top-1 accuracy (compared to 98.4%, the best result for Pr-VIPE ) when trained on the same subjects of the Human3.6M dataset and using nearest-neighbor search as the action recognition method (see Section B2). VIPE obtains 98.6% accuracy when trained with extra synthetic 3D data. The saturated accuracies of VIPE, Pr-VIPE , and other prior work  on the Penn Action dataset suggest that more challenging datasets, such as fine-grained sports, are needed to evaluate new techniques.
For fine-grained action recognition in sports, additional synthetic 3D data improves VIPE (Table 12). This is especially notable on FX35 and Diving48, which contain a variety of poses that are not well represented by Human3.6M. We use VIPE, improved with the synthetic 3D data, as the teacher for all of our VI-VPD experiments.
|VIPE training data|