Action recognition has made remarkable progress over the past few years [I3D, slowfast, TwoStream, TSN]. Most state-of-the-art methods [I3D, hara2018can, R2+1D] are built upon deep spatio-temporal convolutional architectures applied on short clips of RGB frames. These approaches achieved impressive classification performance, with a top-1 accuracy over 77% on the Kinetics dataset [Kinetics], and a top-5 accuracy of more than 93%. However, the explanations behind such performances remain unclear. In particular, recent works [resound, repair] have shown that most datasets, and thus what Convolutional Neural Networks (CNNs) learn, are biased by static context such as scenes and objects. For instance, Figure 1
shows some examples where context is only partial, absent or misleading, and that are misclassified by state-of-the-art 3D CNNs. In particular, the last video taking place on a soccer field is classified asshooting goal (soccer), regardless of the actual action performed in the video.
To further assess the bias of existing datasets towards scenes and objects, we retrain a model on Kinetics after masking out all the humans in the videos, see Figure 2. The performance of this model on the original test set is around 65%, which is extremely high for a model that has never seen any human at training. This shows that scenes and objects are often sufficient to correctly classify the actions.
While this contextual information is certainly useful to predict human actions, it is not sufficient to truly understand what is happening in a scene. Humans have a more complete understanding of actions and can even recognize them without any context, object or scene. The most obvious example is given by mime artists, see middle row of Figure 1, who can suggest emotions or actions to the audience using only facial expressions, gestures and movements, but without words or context. Mime as an art originates from ancient Greece and reached its heights with sixteen century Commedia dell’Arte, but it is considered one of the earliest mediums of expressions even before the appearance of spoken language. We claim that an intelligent system should also be able to understand mimed actions.
To understand action in out-of-context scenarios, i.e., when object and scene are absent or misleading as shown in Figure 1, action recognition can only rely on body language captured by human pose and motion. In particular, 3D action recognition methods [du2015hierarchical, liu2016spatio, zhu2016co], that take as input 3D pose skeleton sequences, have shown impressive results, validating that contextual information is not always necessary to recognize actions. However, these methods are usually trained and tested on accurate and scripted sequences of 3D human poses, captured with RGB-D sensor [NTU] or Motion Capture systems [du2015hierarchical, zhu2016co] in constrained and unrealistic environments. To the best of our knowledge, 3D action recognition has never been applied to real-world situations and videos captured in the wild. Recent human pose estimation methods [VNect, LCRNet++] allow to estimate 3D poses of multiple people from a single image. In this paper, we follow [Smarthome] and employ LCR-Net++ [LCRNet++] that has shown robustness to challenging cases like occlusions and truncations by image boundary, estimating full-body 2D and 3D poses for every person in an image. We compare three different baselines for action recognition based on these poses. The most intuitive pipeline is to detect 3D human poses in every frame, build 3D pose sequences by linking detections over time, and apply a state-of-the-art 3D action recognition algorithm. However, such a method is likely to be sensitive to the inherent noise when estimating 3D poses in the wild. The second baseline applies graph convolutions on 2D pose sequences, without 3D information, which might have the advantage to be more accurate. We finally study another approach where 1D temporal convolutions are applied on human-level intermediate pose feature representations from LCR-Net++. In other words, we transfer the features learned for 2D-3D pose estimation to action recognition: they typically contain information about the human poses without explicitly representing them as body keypoint coordinates.
To benchmark action recognition methods in out-of-context scenarios, we introduce the Mimetics dataset111https://europe.naverlabs.com/research/computer-vision/mimetics/. It contains over 700 video clips of mimed actions for a subset of 50 classes from the Kinetics dataset. Mimetics allows to evaluate on mimed actions models that have been trained on Kinetics. For further analysis, we additionally annotate for each clip whether an object gives clues on the action or not, and similarly for the scene. We evaluate a state-of-the-art 3D convolutional network, and confirm that these models are biased towards scenes and objects. Pose-based action recognition provides a more interpretable output but can lack fine-grained pose details for higher performance.
This paper is organized as follows. After reviewing related work in Section 2, we study the bias of state-of-the-art action recognition datasets and models in Section 3. Section 4 then presents various pose-based baselines and compares them on existing action recognition datasets. Finally, Section 5 introduces the Mimetics dataset and analyzes the performance on out-of-context action recognition.
2 Related work
We benchmark action recognition approaches, comparing standard CNNs on RGB clips with pose-based methods. This latter category can be further split into 2D pose-based approaches and 3D action recognition.
Action classification in real-world videos. Different strategies have been deployed to handle video processing with CNNs such as two-stream architectures [FeichtenhoferConvolutional, TwoStream]
, Recurrent Neural Networks (RNNs)[Donahue_LSTM], or spatio-temporal 3D convolutions [I3D, slowfast, C3D]. Simonyan and Zisserman [TwoStream] introduced a two-stream architecture with 2D convolutions, in which one stream captures appearance information from RGB inputs while the second one operates on optical flow representation and models motion. While improvements of this approach have been proposed [FeichtenhoferConvolutional], most state-of-the-art methods now use a 3D deep convolutional network [I3D, C3D, R2+1D, XieRethinking], optionally in combination with a two-stream architecture. Compared to 2D convolutions, 3D convolutions allow to leverage spatio-temporal information at the cost of a higher number of parameters and higher computational cost. With recent very large-scale datasets such as Kinetics [Kinetics], it is possible to train such 3D CNNs effectively [hara2018can], and impressive performances can be obtained even on small datasets, thanks to pretraining on Kinetics [I3D]. For instance, I3D [I3D] achieved state-of-the-art accuracy on HMDB51 [HMDB] and UCF101 [UCF101] using a two-stream network with a 3D Inception backbone [Inception]. Du et al. [R2+1D] and Xie et al. [XieRethinking] replaced 3D convolutions with separate spatial and temporal convolutions, which reduces the number of parameters to learn. However, all these methods lack a clear understanding of their classification choices. In particular, recent studies [resound, repair] suggest that they tend to leverage dataset biases instead of focusing on the human action.
2D pose for action classification in real-world videos. An insightful diagnostic to understand what affects the action recognition results most was provided by Jhuang et al. [JHMDB], who found that high-level 2D pose features greatly outperform low/mid level features. This has motivated further research on incorporating 2D body poses information in real-world action recognition models [ActionXPose, PCNN, rpan, iqbal2017pose, StarNet, ActionMachine]. For instance, this can be done by pooling features [cao2016action, PCNN] or defining an attention mechanism [rpan, girdharNIPS17]. However, this leads to limited gain and often assumes that humans are fully-visible. Zolfaghari et al. [ChainedMS] trained a 3D CNN on human part segmentation inputs, and added a third stream to two-stream networks. Some other recent methods have shown improved action recognition performance by incorporating 2D pose information from off-the-shelf pose detectors [PoTion, liu2018recognizing, wang2018pose]. For instance, Choutas et al. [PoTion] and Liu et al. [liu2018recognizing] extract joint heatmaps and encode their evolution over time. Wang et al. [wang2018pose] define a two-stream network: one stream encodes the evolution of the pose while the second one models relationship with objects. However, it remains limited to single-person action recognition. Luvizon et al. [luvizon20182d] propose a multi-task architecture where 2D poses are predicted at the same time as appearance features are pooled over body joints for action recognition.
3D action recognition. Compared to 2D poses, 3D poses have the advantage to be unambiguous and to better handle motion dynamics. Recent attempts on 3D action recognition have employed RNNs to handle sequential data and to model the contextual dependencies in the temporal domain [du2015hierarchical, liu2016spatio, si2018skeleton, weng2018deformable, zhu2016co]. Du et al. [du2015hierarchical] propose a hierarchical RNN in which the human skeleton was divided into five parts (arms, legs and trunk) to feed five different subnets later fused hierarchically. Zhu et al. [zhu2016co] added a mixed-norm regularization term to a RNN cost function in order to learn the co-occurrence features of skeleton joints for action classification. More recently, simple CNN-based methods applied to the 2D or 3D joint coordinates have shown to outperform more complex RNN architectures [du2015skeleton]. In a similar spirit, Yan et al. [STGCN] represent the sequence of poses as a graph, and apply a spatio-temporal graph convolutional network (STGCN) to recognize actions. Most of these algorithms use 3D human poses obtained from a Motion Capture system [du2015hierarchical, zhu2016co], a Kinect sensor [liu2016spatio] or a multi-camera setting [yao2012coupled], and none of them experimented on real-world videos with estimated 3D poses.
To the best of our knowledge, we are the first to analyze 3D action recognition in real-world videos. Yan et al. [STGCN] show that their STGCN method can also be applied in the wild, but they only use 2D poses in this scenario. More precisely, they extract 2D human poses with OpenPose [OpenPose], build a graph using the 2 highest-scored detections per frame, and apply their spatio-temporal graph network, replacing X,Y,Z coordinates of the 3D poses, by , where are the 2D coordinates of the joint, normalized into and is the score for this keypoint. In the framework of Luvizon et al. [luvizon20182d], the multi-task architecture can deal with 2D and 3D poses at the same time as action recognition. However, ground-truth keypoints are required for training, and the 3D component is disabled for datasets in-the-wild, i.e., without 3D ground-truth poses.
3 Context biases in action recognition
To assess how much context is leveraged by current methods based on spatio-temporal CNNs, we consider videos where people are masked out. To do so, we extracted human tubes in all videos using LCR-Net++ [LCRNet++] detections linked over time (see Section 4.1 for a detailed description) and removed all the humans from the video frames by colouring the tubes content in grey, see Figure 2.
We performed this experiment on the standard Kinetics dataset [Kinetics] which consists of around 240k training videos, 20k for validation and 40k for testing for a total of 400 classes. As a state-of-the-art model, we use a 3D CNN model, i.e., with spatio-temporal convolutions instead of 2D convolutions, using a ResNeXt-101 backbone [resnext]. We first evaluate a 3D CNN trained on original videos and tested on masked videos, thus measuring the biases learned by the model. Mean top-1 accuracy on the validation set is reported in Table 1. It remains close to 40%, which is extremely high given that there is no human from which the action can be recognized in the test videos. This prediction is thus based on the remaining content of the video, i.e., context such as objects or scenes.
To better measure the biases of the dataset itself, we have trained a 3D CNN model on the masked videos and obtain 65.7% on the original videos, down by only 8.8% compared to training on the original data. This performance is outstanding for a model that has not seen any human during training, and therefore has not really seen any action. To further analyze this aspect, we additionally show in Table 2 the classes with the most increase of accuracy. Masking the actors at training increases the accuracy for classes in which the scene context (e.g. long jump, playing basketball) or the presence of large objects (e.g. driving tractor) are sufficient to recognize the actions, see also Figure 2.
|tying knot (not on a tie)||64.0||72.0||+8.0|
Such bias problem can be tackled by sampling over multiple datasets or reweighting samples, as shown for action [resound, repair] or object recognition [hyojin2019arxiv, undoing, unbiased]. For action recognition, another direction is to leverage body language which is not affected by this context bias.
4 Real-world 3D action recognition baselines
We benchmark three baselines, that all require the extraction of human tubes (Section 4.1). We present two different methods that employ a spatio-temporal graph convolutional network, on explicit 3D (Section 4.2) or 2D (Section 4.3) pose sequences respectively. Next, we introduce a third approach that consists of a single 1D temporal convolution applied on mid-level implicit pose features (Section 4.4). Finally, we present experimental results on existing benchmarks in Section 4.5.
4.1 Extracting human tubes
Overview of LCR-Net++. We build our tube extraction and pose estimation upon LCR-Net++ [LCRNet++], which leverages a Faster R-CNN like architecture [Faster] with a ResNet-50 backbone [ResNet]. A Region Proposal Network extracts candidate boxes around humans. These regions are then classified into different so-called ‘anchor poses’ that replace standard object classes: these key poses typically correspond to a person standing, a person sitting, etc. Poses are then refined using a regression branch, that takes as input the same features used for classification. Anchor-poses are defined jointly in 2D and 3D, and the refinement occurs in this joint 2D-3D pose space. The detection framework allows to handle multiple people in a scene. As the approach is holistic, it outputs full-body poses, even in case of occlusions or truncation by image boundaries. We use the real-time model released by the authors222http://thoth.inrialpes.fr/src/LCR-Net/, allowing experiments on large-scale datasets.
Tube extraction. In order to leverage the evolution of poses over time, one needs to track each individual, i.e., to obtain human tubes [HumanTubes]. We proceed by first running LCR-Net++ in every frame and follow standard procedures used in the spatio-temporal action localization literature [ACT, singh2017online] to link detections over time. Starting from the highest scored detection, we match it with the detections in the next frame based on the Intersection-over-Union (IoU) between boxes. We link it if the IoU is over
. Otherwise, we match it to the frame after, and perform linear interpolation in the missing frames. We stop a tube if there was no match duringconsecutive frames. This procedure is run forward and backward to obtain a human tube. We then delete all detections in this first link, and repeat the procedure for the remaining detections. At training, we label the tubes with the video class. At test time, for each video and for each class, we take the maximum score over all tubes.
4.2 Baseline based on explicit 3D pose
Figure 3 shows an overview of the most intuitive baseline. It is based on explicit 3D pose information. More precisely, given the human tubes, we extract the 3D poses estimated by LCR-Net++ for each box, thus building a 3D pose skeleton sequence for each tube. We finally run a state-of-the-art 3D action recognition method using the code released by Yan et al. [STGCN]333https://github.com/yysijie/st-gcn. The idea consists in building a graph in space and time from the pose sequence, on which spatio-temporal convolution are applied. We denote this first baseline as STGCN3D.
4.3 Variant based on explicit 2D pose
As the STGCN method of Yan et al. [STGCN] has also been applied to 2D poses, we use a variant of the previous pipeline, replacing the 3D poses estimated by LCR-Net++ by its 2D poses. On the one hand, this variant is likely to get worse performance, as 3D poses are more informative than 2D poses which are inherently ambiguous. But on the other hand, 2D poses extracted from images and videos tend to be more accurate than 3D poses which are more prone to noise. We call this second baseline STGCN2D.
4.4 Temporal convolution on implicit pose features
We finally study a baseline that transfers the implicit pose representation carried by mid-level features within LCR-Net++, without using explicit body keypoint coordinates, see Figure 4. We select the features used as input to the final layers for pose classification and refinement. These features have 2048 dimensions with a ResNet50 backbone and carry information about both 2D and 3D poses. The features are stacked over time along human tubes and a temporal convolution of kernel size is applied on top of the resulting matrix. This convolution outputs action scores for the sequence.
At training, we sample random clips of
consecutive frames and use a cross-entropy loss. At test time, we use a fully-convolutional architecture and average the class probabilities by a softmax on the scores for all clips in the videos. We did experiment with deeper network on top of the stacked features but did not see any significant improvement. Due to GPU memory constraint, we freeze the weights of LCR-Net++ during training, allowing larger temporal windows to be considered. We denote this third baseline asSIP-Net for Stacked Implicit Pose Network.
4.5 Comparison on existing datasets
Before comparing these baselines on out-of-context actions (Section 5), we assess their performance for real-world action recognition on existing datasets, with various levels of ground-truth. Table 3 summarizes them in terms of number of videos, classes, splits, as well as frame-level ground-truths. For datasets with multiple splits, some results are reported on the first split only, denoted for instance as JHMDB-1 for the split 1 of JHMDB. While our goal is to perform action recognition in real-world videos, we validate the baselines on the constrained NTU 3D action recognition dataset [NTU] that contains ground-truth poses in 2D and 3D, using the standard cross-subject (cs) split. We also experiment on the JHMDB [JHMDB] and PennAction [PennAction] datasets that have ground-truth 2D poses, but no 3D poses. Finally, we use HMDB51 [HMDB], UCF101 [UCF101] and Kinetics [Kinetics] that contain no more information than the ground-truth label of each video. As metric, we report the standard mean accuracy, i.e., the ratio of correctly classified videos per class, averaged over all classes.
|#cls||#vid||#splits||in-the-wild||GT 2D||GT 3D|
In Appendix A, we report various experiments based on this various levels of ground-truth, allowing to study the impact of extracted tubes, extracted poses as well as the benefit of transferring pose features for SIP-Net. We also plot the performance of SIP-Net with varying and use in the remaining of this work.
|Zolfaghari et al. [ChainedMS] (pose only)||45.5||-||-||67.8||36.0||-||56.9||-||-|
|MultiTask [luvizon20182d] (uses RGB)||-||-||97.4||74.3||-||-||-||-||-|
|STGCN [STGCN] (OpenPose)||25.2||25.4||71.6||79.8||38.6||34.7||54.0||50.6||30.7|
Table 4 provides a comparison of the mean accuracy on all datasets (last three rows). The method based on implicit pose features (SIP-Net) significantly outperforms the baselines that employ explicit 2D and 3D poses, except on NTU. The gap is over 10% on HMDB51, UCF101 and Kinetics. This can be explained by the fact that explicitly extracting the poses lead to a significant level of noise in the body keypoint representations for in-the-wild videos. Using an implicit pose representation as in SIP-Net allows for more robustness. Interestingly, on HMDB51, UCF101 and Kinetics, the 2D pose baseline performs slightly better than the 3D, suggesting that 3D pose suffers from much more noise in unconstrained videos.
Finally, we compare our baselines to the state of the art among pose-based methods, see Table 4. SIP-Net obtains a higher accuracy than PoTion [PoTion] with a margin over 5% on JHMDB, HMDB51 and UCF101-1, and of 16% on Kinetics. Compared to the pose model only of Zolfaghari et al. [ChainedMS], we obtain a higher accuracy on JHMDB, HMDB51 and UCF101. On NTU and PennAction, Luvizon et al. [luvizon20182d] obtain a higher accuracy because their approach also leverages appearance features. When combining SIP-Net with a standard RGB stream using 3D ResNeXt-101 backbone, we obtain 98.9% on PennAction. Finally, as in [STGCN], we run STGCN code on 2D poses detected by OpenPose [OpenPose]. We significantly outperform this approach on JHMDB, PennAction, HMDB51 and UCF101. On Kinetics, the gap is much smaller, with only 2%. This is because this dataset contains many videos with very near close-ups on faces, or captured from a first-person viewpoint, leading to misdetections by LCR-Net++. For videos where only the face is visible, OpenPose that outputs 18 keypoints including 5 on the head (nose, two ears, two eyes) is able to detect a pose. In contrast, LCR-Net++ that estimates only 1 (out of 13) keypoint on the center of the head, fails to detect humans in such cases. Table 5 shows the 10 classes with the highest and lowest accuracy for SIP-Net. Classes with high top-1 accuracy can be clearly recognized from body pose only. In contrast, the classes at 0% are either actions often captured in first-person viewpoint where the poses are not detected (making a cake), or classes with no motion of the body keypoint as they mainly contain motion of the face (sniffing) or the hands (drumming fingers).
|highest top-1 accuracy||lowest top-1 accuracy|
|crawling baby||91.8||rock scissors paper||0.0|
|presenting weather forecast||90.0||throwing ball||0.0|
|riding mechanical bull||89.8||eating chips||0.0|
|surfing crowd||87.5||tossing coin||0.0|
|filling eyebrows||84.4||unloading truck||0.0|
|shearing sheep||83.7||holding snake||0.0|
|bench pressing||82.0||making a cake||0.0|
|front raises||81.6||ripping paper||0.0|
5 Experiments on mimed actions
To assess the bias of action recognition algorithms towards scenes and objects, and evaluate their generalizability in absence of such visual context, we introduce Mimetics, a dataset of mimed actions.
5.1 The Mimetics dataset
Mimetics contains short YouTube video clips of mimed human actions that mostly consist in manipulations of, or interactions with certain objects. These include sport actions, such as playing tennis or juggling a soccer ball, daily activities such as drinking, personal hygiene, e.g. brushing teeth, or playing musical instruments including bass guitar, accordion or violin. These classes were selected from the action labels of the Kinetics dataset, allowing to evaluate models trained on Kinetics. Mimetics contains 713 video clips for a subset of 50 human action classes, i.e., an average of 14.3 clips per class. As it is hard to find mimed actions on the web, we restrict Mimetics to testing purposes, not for training. These actions are performed on stage or on the street by mime artists (middle row of Figure 1) but also in everyday life of people, typically during mime games, or captured and shared for fun on social media. For instance, the top row of Figure 1 shows a video of someone training indoor for surfing water or the bottom row shows soccer players mimicking the action bowling to celebrate a goal.
The clips for each class were obtained by searching for candidates through the use of key words such as miming or imitating followed by the desired action, or using query words such as imaginary and invisible followed by a certain object category. The dataset was built making sure that a human observer was able to recognize the mimed actions. The videos have variable resolutions and frame rates and have been manually trimmed between 1 and 10 seconds, following the Kinetics dataset. The URLs of the original YouTube videos and the temporal intervals of the video clips will be shared to spur further research on this topic. The detailed list of classes with the number of videos per class is available in Appendix B.
5.2 Experimental results
We compare several approaches on the Mimetics dataset: our three pose-based baselines, a state-of-the-art 3D CNN method on RGB input or Flow input as well as their late fusion, in addition to STGCN [STGCN] with OpenPose. For optical flow input, we use the TV-L1 algorithm [TVL1]. All methods were trained on the 400 classes of Kinetics. We then run them on the videos from the Mimetics dataset, and report top-1, top-5 accuracies as well as the mean average-precision (mAP). As each video has a single label, average-precision computes for each class the inverse of the rank of the ground-truth label, averaged over all videos of this class. Overall performances are reported in Table 6. We refer to Appendix B for per-class results. Figure 5 shows some qualitative examples.
|RGB+Flow (late fusion)||10.5||26.9||19.1|
|STGCN [STGCN] (OpenPose)||12.6||27.4||20.7|
We first observe that the performance is relatively low for all methods, below 15% top-1 accuracy and 25% mAP, showing that the recognition of mimed actions is challenging. In fact, all methods completely fail for a certain number of actions including climbing a rope, reading newspaper, eating cake or, more surprisingly, sweeping floor. One reason for this overall low accuracy is that some Kinetics actions are fine-grained (e.g. different classes correspond to eating various types of food) and are hard to distinguish, especially when mimed. Another difficulty is that mimed actions tend to be exaggerated (e.g., when performing air guitar or reading newspaper, particularly when performed by mime artist, see first row of Figure 5) and are therefore harder to understand. However, humans are still able to recognize these mimed actions and so should an intelligent system. We manually label a flag for each video whether the actor is a mime artist or not, and show the global top-1 accuracy in Table 7. For all approaches, the performance is significantly lower on videos where actions are performed by mime artists compared to standard people.
The best overall performance is achieved by SIP-Net which consists of a temporal convolution applied on pose features, reaching 14.2% top-1 accuracy and a mAP of 22.7%. Some failure cases occur when several people are present in the scene. The tubes can erroneously mix several individuals or other persons (e.g. spectators) sometimes obtain higher scores than the one miming the action of interest.
In comparison, state-of-the-art 3D CNN model trained on RGB clips performs more poorly, with 8.6% mean top-1 accuracy and 15.6 mAP. For some classes such as archery, playing accordion, playing bass guitar, playing trumpet, this state-of-the-art RGB model obtains 0% while SIP-Net performs decently. One key reason for that is the bias learned by the model: it focuses on the objects being manipulated or the scenes where the video is captured more than on the performed actions. For instance, in the second row of Figure 5, someone mimics playing piano on a console table covered with a tablecloth, which looks like a massage table. As a consequence, the RGB model predicts the action massage back without considering what the person is really doing. To further verify the bias towards object and scene, we manually label for each video if there is any relevant object or not, and if the scene is relevant for the action. We report the global top-1 accuracy (as some classes have no video or just a few, global accuracy is better suited than mean per-class accuracy) in Table 7 for the subset of videos where object and/or scene are not relevant. On these videos, the state-of-the-art RGB 3D CNN performance significantly drops while the SIP-Net baseline is more robust. RGB 3D CNN still performs better than SIP-Net on classes such as brushing teeth, catching or throwing baseball, or juggling balls. This corresponds to classes in which the object is barely visible in most training videos, either too small (e.g. cigarette for smoking) or mostly occluded by hands (baseball ball, toothbrush, hair brush). In such cases, 3D CNN model focuses on face and hands (for brushing teeth, smoking) or on the body (throwing baseball) and therefore performs reasonably well on these mimed actions. To further verify this, we manually annotate for each of the 50 classes of Mimetics whether there is an object being manipulated or not, and if it is small or large. We report the global top-1 accuracy in Table 8. RGB performs better than SIP-Net on actions with no object or with small objects, while SIP-Net clearly outperforms RGB in case of large objects.
|not a mime artist||(510)||9.8||13.5||17.8|
|object is not relevant||(671)||7.2||10.4||13.7|
|scene is not relevant||(644)||6.4||9.8||13.5|
|both object and scene are not relevant||(610)||4.9||8.7||13.0|
We then also evaluate a similar 3D CNN that takes as input optical flow clips instead of RGB clips. The overall performance is higher than RGB, with 11.8% top-1 accuracy and 21.1% mAP. This suggests that this flow model learns less biases than RGB, because it does not see the appearance of the scenes and objects. For instance, playing piano is correctly predicted in the example of the second row of Figure 5, because from the optical flow, a piano and a covered table roughly look the same. Sevilla-Lara et al. [sevilla2018integration] suggest that flow may still capture global shape of the actor or objects. This explains why flow performs better on classes without object or with small objects compared to larger objects, see Table 8, as RGB does: when the subject is manipulating small objects, the network is not able to capture these details and it focuses on bigger structure like the person, thus generalizing better to out-of-context actions. We evaluated in Table 6 a late fusion of RGB and Flow, i.e., a two-stream model [TwoStream], and observe a small decrease of performance as both models tend to perform well on the same kind of classes and videos.
Next, we also benchmark other pose-based approaches. Our two other baselines based on explicit 2D or 3D poses perform quite poorly, as their respective performance on the Kinetics dataset is. This can be explained by the difficulty to extract accurate body keypoint coordinates for videos in-the-wild with abrupt camera and actor motion, blur, and occlusions. In particular, the low performance on Kinetics itself suggests this occurs also in the training set, leading to a poor model. We also compare to STGCN [STGCN] that uses OpenPose to estimate the pose, i.e., with more keypoint on the head than LCR-Net++. The performance is higher with 12.6% top-1 accuracy but remains lower than the SIP-Net baseline that does not explicitly compute poses but transfers the learned pose features to action recognition.
To explain the relatively poor performance of all methods, we argued that Kinetics classes might be too fine-grained and too difficult to distinguish when mimed. This is illustrated by the significantly higher top-5 accuracy (32.0%) than top-1 accuracy (14.2%), see Table 6. To further verify this statement, we trained a SIP-Net model on the Kinetics training videos from the 50 classes of Mimetics and report the results in Table 9. Top-1 accuracy increases to more than 25% and top-5 accuracy to more than 50%.
|Kinetics (400 classes)||14.2||32.0||22.7|
|Kinetics subset (50 classes)||25.1||51.4||38.3|
In this paper, we have highlighted the context biases of existing action recognition datasets and 3D CNN models. To benchmark performances on out-of-context actions, we have introduced the Mimetics dataset. Our experiments show that models leveraging body language via human pose are less prone to the context biases. Applying a shallow neural network such as a single convolution over features transferred from human poses performs surprisingly well compared to 3D action recognition applied in-the-wild. We think that Mimetics will allow to better understand what action recognition models learn and is a step towards designing more intelligent systems for human action recognition.
Appendix A Extended experiments on existing datasets
In this section, we provide more analysis about the performance of the three pose-based baselines on existing action recognition datasets. We first perform a parametric study of SIP-Net in Section A.1. We then use the various levels of ground-truth (see Table 3 of the main paper) to study the impact of using ground-truth or extracted tubes and poses (Section A.2).
Tubes. For datasets with ground-truth 2D poses, we compare the performance when using ground-truth tubes (GT Tubes) obtained from GT 2D poses, or estimated tubes (LCR Tubes) built from estimated 2D poses, see Section 4.1 of the main paper. In the latter case, tubes are labeled positive if the spatio-temporal IoU with a GT tube is over , and negative otherwise. When there is no tube annotation, we assume that all tubes are labeled with the video class label. Note that in some videos, no tube is extracted, in which case the videos are ignored when training, and considered as wrongly classified for test videos. In particular, this happens when only the head is visible, as well as for many clips with first person viewpoint, where only one hand or the main manipulated object is visible. We obtain no tube for 0.1% of the videos on PennAction, 2.5% on JHMDB, 2.7% on HMDB51, 6.7% on UCF101 and 15.3% on Kinetics.
a.1 SIP-Net baseline
We first present the results for the SIP-Net baseline with GT tubes (blue curve ‘GT tubes, Pose Feats’) and LCR tubes (green curve ‘LCR Tubes, Pose Feats’) on all datasets for varying clip length , see Figure 6. Overall, a larger clip size leads to a higher classification accuracy. This is in particular the case for datasets with longer videos such as NTU and Kinetics. This holds both when using GT tubes (blue curve) and LCR tubes (green curve). We keep =32 in the remaining of this paper.
Next, we measure the impact of applying transfer learning from the pose domain to action recognition. To this end, we compare the temporal convolution on LCR pose features (blue curve, ‘Pose Feats’), to features extracted from a Faster R-CNN model with ResNet50 backbone trained to classify actions (red curve, ‘Action Feats’). This latter method is not supposed to be state-of-the-art in action recognition, but it allows to fairly compare the pose features to action features, keeping the network architecture exactly the same, simply changing the learned weights. Note that such a frame-level action detector has been used in the spatio-temporal action detection literature[saha2016deep, weinzaepfel2015learning], before the rise of 3D CNNs. Results in Figure 6 show a clear drop of accuracy when using action features instead of pose features: about 20% on JHMDB-1 and PennAction, and around 5% on NTU for =32. Interestingly, this holds for T=1 on HMDB-1 and PennAction, i.e., without temporal integration, showing that ‘Pose feats’ are more powerful. To better understand why using pose features considerably increases performance compared to action features, we visualize the distances between features inside tubes in Figure 7. When training a per-frame detector specifically for actions, most features of a given tube are correlated. It is therefore hard to leverage temporal information from them. In contrast, LCR-Net++ pose features considerably change over time, as does the pose, deriving greater benefit from temporal integration. Figure 8 shows confusion matrices on PennAction when using ‘Pose feats’ (left) vs. ‘Action feats’ (right). With ‘Action feats’, confusions happen between the two tennis or the two baseball actions, while this is disambiguated with ‘Pose feats’.
a.2 Comparison between baselines
We compare the performance of the baselines using GT and LCR tubes, on the JHMDB-1, PennAction and NTU datasets in Table 10. On JHMDB-1 and PennAction, despite being a much simpler architecture, the SIP-Net baseline outperforms the methods based on explicit 2D-3D pose representations, both with GT and LCR tubes. Estimated 3D pose sequences are usually noisy and may lack temporal consistency. We also observe that the STGCN3D approach significantly outperforms its 2D counterpart (STGCN2D), confirming that 2D poses contain less discriminative and more ambiguous information.
On the NTU dataset, the 3D pose baseline obtains 74.8% accuracy when using GT tubes and estimated poses (STGCN3D on GT Tubes), compared to 81.5% reported in [STGCN] when using ground-truth 3D poses. This gap of 7% in a constrained environment is likely to increase for videos captured in the wild. The performance of the features-based baseline (SIP-Net) is lower, 66.4% on GT tubes, suggesting than SIP-Net performs better only in unconstrained scenarios.
|STGCN3D (GT 3D poses)||-||-||81.5|
Appendix B Per-class results on Mimetics
In Table 11, we present for each class the top-1 accuracy and the AP of the different methods. For the top-1 accuracy metric, SIP-Net obtains the best performance for 19 out of 50 classes, with a mean accuracy of 14.2%. The RGB 3D CNN baseline obtains the highest AP for 8 classes, which often correspond to classes in which manipulated objects are small, making the network less bias towards context (e.g. the ball for the action catching of throwing baseball). Table 11 also highlights that the recognition of mimed actions is a very challenging and open task, as none of the videos are correctly classified (i.e. 0% top-1 accuracy) by the 5 baselines for 5 out of the 50 classes.
|canoeing or kayaking||14||0.0||(1.5)||0.0||(5.3)||0.0||(2.6)||0.0||(2.8)||0.0||(8.2)||0.0||(3.9)|
|catching or throwing baseball||14||21.4||(27.0)||0.0||(22.9)||0.0||(9.2)||0.0||(5.9)||0.0||(2.6)||0.0||(17.6)|
|catching or throwing frisbee||14||21.4||(31.5)||21.4||(42.7)||7.1||(28.1)||0.0||(10.3)||0.0||(6.7)||21.4||(39.5)|
|clean and jerk||13||15.4||(25.3)||38.5||(47.7)||46.2||(52.3)||23.1||(43.0)||30.8||(47.5)||46.2||(50.1)|
|climbing a rope||14||0.0||(1.2)||0.0||(1.1)||0.0||(9.5)||0.0||(6.2)||0.0||(4.8)||0.0||(5.1)|
|eating ice cream||11||0.0||(4.7)||0.0||(11.3)||0.0||(4.4)||0.0||(2.4)||0.0||(1.8)||18.2||(21.5)|
|juggling soccer ball||18||11.1||(23.9)||5.6||(25.5)||50.0||(61.6)||0.0||(12.1)||27.8||(41.8)||44.4||(57.5)|
|playing bass guitar||13||0.0||(5.2)||7.7||(12.4)||7.7||(20.3)||0.0||(6.0)||0.0||(3.2)||15.4||(27.7)|
|punching person (boxing)||16||12.5||(22.8)||18.8||(30.3)||25.0||(31.3)||6.2||(19.8)||0.0||(8.8)||12.5||(20.3)|
|shooting goal (soccer)||14||7.1||(23.9)||0.0||(21.2)||7.1||(22.6)||7.1||(24.3)||0.0||(10.0)||14.3||(29.8)|
|skiing (not slalom or crosscountry)||10||0.0||(4.1)||20.0||(23.0)||0.0||(1.5)||0.0||(1.1)||0.0||(1.4)||0.0||(2.0)|
|walking the dog||15||6.7||(11.7)||0.0||(4.9)||0.0||(2.8)||0.0||(1.6)||0.0||(2.0)||0.0||(3.7)|
|avg (50 classes)||713||8.6||(15.6)||11.8||(21.1)||12.6||(20.7)||9.0||(15.4)||5.8||(11.3)||14.2||(22.7)|