Graph Embedded Pose Clustering for Anomaly Detection

12/26/2019 ∙ by Amir Markovitz, et al. ∙ 7

We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not. We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal. Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 12

Code Repositories

gepc

Graph Embedded Pose Clustering for Anomaly Detection


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection in video has been investigated extensively over the years. This is because the amount of video captured far surpasses our ability to manually analyze it. Anomaly detection algorithms are designed to help human operators deal with this problem. The question is how to define anomalies and how to detect them.

The decision of whether an action is normal or not is nuanced. In some cases, we are interested in detecting abnormal variations of an action. For example, an abnormal type of walking. We term this fine-grained anomaly detection. In other cases, we might be interested in defining normal actions and regard any other action as abnormal. For example, we might be interested in determining that dancing is normal, while gymnastics are abnormal. We call this coarse-grained anomaly detection.

We desire an algorithm that can handle both types of anomaly detection in a single, unified fashion. Such an algorithm should take as input an unlabeled set of videos that capture normal actions only (fine- or coarse-grained) and use that to train a model that will distinguish normal from abnormal actions.

We take advantage of the recent progress in human pose estimation and assume our algorithm takes human pose graphs as input. This offers several advantages. First, it abstracts the problem and lets the algorithm focus on human pose and not on irrelevant features such as viewing direction, illumination, or background clutter. In addition, a human pose can be represented as a compact graph, which makes analyzing, training and testing much faster.

Given a sequence of video frames, we use a pose estimation method to extract the keypoints of every person in each frame. Every person in a clip is represented as a temporal pose graph. We use a combination of an autoencoder and a clustering branch to map the training samples into a latent space where samples are soft clustered. Each sample is then represented by its soft-assignment to each of the

clusters. This can be understood as learning a bag-of-words representation for actions. Each cluster corresponds to an action-word, and each action is represented by its similarity to each of the action-words. Figure 1 gives an overview of our method.

Figure 1: Model Diagram (Inference Time): To score a video, we first perform pose estimation. The extracted poses are encoded using the encoder part of a Spatio-temporal graph autoencoder (ST-GCAE), resulting in a latent vector. The latent vector is soft-assigned to clusters using a deep clustering layer, with

denoting the probability of sample 

being assigned to cluster .

The soft-assignment vectors capture proportional data and the tool to measure their distribution is the Dirichlet Process Mixture Model. Once we fit the model to the data, we can obtain a normality score for each sample and determine if the action is to be classified as normal or not.

The algorithm thus consists of a series of abstractions. Using human pose graphs eliminates the need to deal with viewpoint and illumination changes. And the soft-assignment representation abstracts the type of data (fine-grained or coarse-grained) from the Dirichlet model.

We evaluate our algorithm in two settings. The first is the ShanghaiTech Campus [luo2017revisit] dataset, a large and extensively evaluated anomaly detection benchmark. This is a typical (fine-grained) anomaly detection benchmark in which normal behavior is taken to be walking, and the goal is to detect abnormal events, such as people running, fighting, riding bicycles, throwing objects, etc.

The second is a new problem setting we propose, and denote Coarse-grained anomaly detection. Instead of focusing on a single action (i.e., walking), as in the ShanghaiTech dataset, we construct a training set consisting of a varying number of actions that are to be regarded as normal. For example, the training set may consist of video clips of different dancing styles. At test time, every dance video should be classified as normal, while any other action should be classified as abnormal.

We demonstrate this new, challenging, Coarse-grained anomaly detection setting on two action classification datasets. First is the NTU-RGB+D dataset, where 3D body joints are detected using Kinect. Second is a larger and more challenging dataset that consists of 250 out of the 400 actions in the Kinetics400 dataset222We only use a subset of the classes as not all classes can be detected using human pose detectors.. For both datasets, we use a subset of the actions to define a training set of normal actions and use the rest of the videos to test if the algorithm can correctly distinguish normal from abnormal videos.

We conduct extensive experiments, compare to a number of competing approaches and find that our algorithm outperforms all of them.

To summarize, we propose three key contributions:

  • The use of embedded pose graphs and a Dirichlet process mixture for video anomaly detection;

  • A new coarse-grained setting for exploring broader aspects of video anomaly detection;

  • State-of-the-art AUC of for the ShanghaiTech Campus anomaly detection benchmark.

2 Background

2.1 Video Anomaly Detection

The field of anomaly detection is broad and has a large variation in setting and assumptions, as is evident by the different datasets proposed to evaluate methods in the field.

For our fine-grained experiment, we use the ShanghaiTech Campus dataset [luo2017revisit]. Containing 130 anomalous events in 13 different scenes, with various camera angles and lighting conditions, it is more diverse and significantly larger than all previous common datasets. It is presented in detail in section 4.1.

In recent years, numerous works tackled the problem of anomaly detection in video using deep learning based models. Those could be roughly categorized into reconstructive models, predictive models, and generative models.

Reconstructive models learn a feature representation for each sample and attempt to reconstruct a sample based on that embedding, often using Autoencoders  [Abati_2019_CVPR, Chong_2017, Hasan_2016_CVPR]

. Predictive model based methods aim to model the current frame based on a set of previous frames, often relying on recurrent neural networks

[luo2017remembering, luo2017revisit, medel2016anomaly] or 3D convolutions [sabokrou2017deep, zhao2017spatio]. In some cases, reconstruction-based models are combined with prediction based methods for improved accuracy [zhao2017spatio]. In both cases, samples poorly reconstructed or predicted are considered anomalous.

Generative models were also used to reconstruct, predict or model the distribution of the data, often using Variational Autoencoders (VAEs) [an2015variational] or GANs [akccay2019skip, lotter2015unsupervised, ravanbakhsh2017abnormal, ravanbakhsh2017training].

A method proposed by Liu et al[Liu_2018] uses a generative future frame prediction model and compares a prediction with its ground truth by evaluating differences in gradient-based features and optic flow. This method requires optic flow computation and generating a complete scene, which makes it costly and less robust to large scenery changes.

Recently, Morais et al[Morais_2019_CVPR] proposed an anomaly detection method using a fully connected RNN to analyze pose sequences. The method embeds a sequence, then uses reconstruction and prediction branches to generate past and future poses, respectively. Anomaly score is determined by the reconstruction and prediction errors of the model.

2.2 Graph Convolutional Networks

To represent human poses as graphs, the inner-graph relations are described using weighted adjacency matrices. Each matrix could be static or learnable and represent any kind of relation.

In recent years, many approaches were proposed for applying deep learning based methods to graph data. Kipf and Welling [kipf2017semi] proposed the notion of Fast Approximate Convolutions On Graphs. Following Kipf and Welling, both temporal and multiple adjacency extensions were proposed. Works by Yan et al[Yan2018SpatialTG] and Yu et al[Yu_2018] proposed temporal extensions, with the former work proposing the use of separable spatial and temporal graph convolutions (ST-GCN), applied sequentially. We follow the basic ST-GCN block design, illustrated in Figure 2.

Veličković et al[velickovic2018graph] proposed Graph Attention Networks, a GCN extension in which the weighting of neighboring nodes are inferred using an attention mechanism, relying on a fixed adjacency matrix only to determine neighboring nodes.

Shi et al[2sagcn2019cvpr] recently extended the concept of spatio-temporal graph convolutions by using several adjacency matrices, of which some are learned or inferred. Inferred adjacency is determined using an embedded similarity measure, optimized during training. Adjacency matrices are summed prior to applying the convolution.

2.3 Deep Clustering Models

Deep clustering methods aim to provide useful cluster assignments by optimizing a deep model under a cluster inducing objective. For example, several recent methods jointly embed and cluster data using unsupervised representation learning methods, such as autoencoders, with clustering modules [caron2018deep, Dizaji_2017, Wang_2016, xie2016unsupervised].

A method proposed by Xie et al[xie2016unsupervised], denoted Deep Embedded Clustering (DEC), proposed an alternating two-step approach. In the first step, a target distribution is calculated using the current cluster assignments. In the next step, the model is optimized to provide cluster assignments similar to the target distribution. Recent extensions tackled DEC’s susceptibility to degenerate solutions using regularization methods and various post-processing means [Dizaji_2017, haeusser2018associative].

3 Method

We design an anomaly detection algorithm that can operate in a number of different scenarios. The algorithm consists of a sequence of abstractions that are designed to help each step of the algorithm work better. First, we use a human pose detector on the input data. This abstracts the problem and prevents the next steps from dealing with nuisance parameters such as viewpoint or illumination changes.

Human actions are represented as space-time graphs and we embed (sub-sections 3.1, 3.2) and cluster (sub-section 3.3) them in some latent space. Each action is now represented as a soft-assignment vector to a group of base actions. This abstracts the underlying type of actions (i.e., fine-grained or coarse-grained), leading to the final stage of learning their distribution. The tool we use for learning the distribution of soft-assignment vectors is the Dirichlet process mixture (sub-section 3.4), and we fit a model to the data. This model is then used to determine if an action is normal or not.

3.1 Feature Extraction

We wish to capture the relations between body joints, while at the same time provide robustness to external factors such as appearance, viewpoint and lighting. Therefore, we represent a person’s pose with a graph.

Each node of the graph corresponds to a keypoint, a body joint, and each edge represents some relation between two nodes. Many keypoint relations exist, such as physical relations defined anatomically (e.g. the left wrist and elbow are connected) and action relations defined by movements that tend to be highly correlated in the context of a certain action (e.g. the left and right knees tend to move in opposite directions while running). The directions of the graph rise from the fact that some relations are learned during the optimization process and are not symmetric. A nice bonus with this representation is being compact, which is very important for efficient video analysis.

In order to extend this formulation temporally, pose keypoints extracted from a video sequence are represented as a temporal sequence of pose graphs. The temporal pose graph is a time series of human joint locations. Temporal domain adjacency could be similarly defined by connecting joints in successive frames, allowing us to perform graph convolution operations exploiting both spatial and temporal dimensions of our sequence of pose graphs.

We propose a deep temporal graph autoencoder based architecture for embedding the temporal pose graphs. Building on the basic block design of ST-GCN, presented in Figure 2, we substitute the basic GCN operator with a novel Spatial Attention Graph Convolution, presented next.

We use this building block to construct a Spatio-Temporal Graph Convolutional Auto-Encoder, or ST-GCAE. We use ST-GCAE to embed the spatio-temporal graph and take the embedding to be the starting point for our clustering branch.

Figure 2: Spatio-Temporal Graph Convolution Block: The basic block used for constructing ST-GCAE. A spatial attention graph convolution (Figure 3

) is followed by a temporal convolution and batch normalization. A residual connection is used.

3.2 Spatial Attention Graph Convolution

We propose a new graph operator, presented in Figure 3, that uses adjacency matrices of three types: Static, Globally-learned and Inferred (attention-based). Each adjacency type is applied with its own GCN, using separate weights. The outputs from the GCNs are stacked in the channel dimension. A convolution is applied as a learnable reduction measure for weighting the stacked outputs, and provides the required output channel number.

The three adjacency matrices capture different aspects of the model: (i) The use of body-part connectivity as a prior over node relations, represented using the static adjacency matrix. (ii) Dataset level keypoint relations, captured by the global adjacency matrix, and (iii) Sample specific relations, captured by inferred adjacency matrices. Finally, the learnable reduction measure weights the different outputs.

The static adjacency is fixed and shared by all layers. The globally-learnable matrix is learned individually at each layer, and applied equally to all samples during the forward pass. The inferred adjacency matrices are based on an attention mechanism that uses learned weights to calculate a sample specific adjacency matrix, a different one for every sample in a batch. For example, for a batch of size  of graphs with  nodes, the inferred adjacency size is , while other adjacencies are matrices.

The globally-learned adjacency is learned by initializing a fully-connected graph, with a complete, uniform, adjacency matrix. The matrix is jointly optimized with the rest of the model’s parameters during training. The computational overhead of this adjacency is small for graphs containing no more than a few dozen nodes.

An inferred adjacency matrix is constructed using a graph self-attention layer. After evaluating a few attention models we chose a simple multiplicative attention mechanism. First, we embed the input twice, using two sets of learned weights. We then transpose one of the embedded matrices and take the dot product between the two and normalize. We then get the inferred adjacency matrix. The attention mechanism chosen is modular and may be replaced with other common alternatives. Further details are provided in the supplementary material.

Figure 3: Spatial-Attention Graph Convolution: A zoom into our spatial graph convolving operator, comprised of three GCN [kipf2017semi] operators: One using a hard-coded physical adjacency matrix (), the second using a global adjacency matrix learned during training (), and the third using an adjacency matrix inferred using an attention submodule (

). A residual connection is used. GCN modules include batch normalization and ReLU activation, omitted for readability.

3.3 Deep Embedded Clustering

To build our dictionary of underlying actions, we take the training set samples and jointly embed and cluster them in some latent space. Each sample is then represented by its assignment probability to each of the underlying clusters. The objective is selected to provide distinct latent clusters, over which actions reside.

We adapt the notion of Deep Embedded Clustering [xie2016unsupervised] for clustering temporal graphs with our ST-GCAE architecture. The proposed clustering model consists of three parts, an encoder, a decoder, and a soft clustering layer.

Specifically, our ST-GCAE model maintains the graph’s structure but uses large temporal strides with an increasing channel number to compress an input sequence to a latent vector. The decoder uses temporal up-sampling layers and additional graph convolutional blocks, for gradually restoring original channel count and temporal dimension.

The ST-GCAE’s embedding is the starting point for clustering the data. The initial reconstruction based embedding is fine-tuned during our clustering optimization stage to reach the final clustering optimized embedding.

For each input sample , we denote the encoder’s latent embedding by , and the soft cluster assignment calculated using the clustering layer by . We denote the clustering layer’s parameters by . The probability  for the -th sample to be assigned to the -th cluster is:

(1)

We adopt the clustering objective and optimization algorithm proposed by [xie2016unsupervised]. The clustering objective is to minimize the KL divergence between the current model probabilistic clustering prediction and a target distribution :

(2)

The target distribution aims to strengthen current cluster assignments by normalizing and pushing each value closer to a value of either 0 or 1. Recurrent application of the function transforming to will eventually result in a hard assignment vector. Each member of the target distribution is calculated using the following equation:

(3)

The clustering layer is initialized by the K-means centroids calculated for the encoded training set. Optimization is done in Expectation-Maximization (EM) like fashion. During the Expectation step, the entire model is fixed and, the target distribution

is updated. During the Maximization stage, the model is optimized to minimize the clustering loss, .

3.4 Normality Scoring

This model supports two types of multimodal distributions. One is at the cluster assignment level; the other is at the soft-assignment vector level. For example, an action may be assigned to more than one cluster (cluster-level assignment), leading to a multimodal soft-assignment vector. The soft-assignment vectors themselves (that capture actions) can be modeled by a multimodal distribution as well.

The Dirichlet process mixture model (DPMM) is a useful measure for evaluating the distribution of proportional data. It meets our required setup: (i) An estimation (fitting) phase, during which a set of distribution parameters is evaluated, and (ii) An inference stage, providing a score for each embedded sample using the fitted model. A thorough overview of the model is given by Blei and Jordan [blei2006variational].

The DPMM is a common mixture extension to the unimodal Dirichlet distribution and uses the Dirichlet Process, an infinite-dimensional extension of the Dirichlet distribution. This model is multimodal and able to capture each mode as a mixture component. A fitted model has several modes, each representing a set of proportions that correspond to one normal behavior. At test time, each sample is scored by its log probability using the fitted model. Further explanations and discussion on the use of DPMM are available in [blei2006variational, dinari2019distributed].

3.5 Training

The training phase of the model consists of two stages, a pre-training stage for the autoencoder, in which the clustering branch of the network remains unchanged, and a fine-tuning stage in which both embedding and clustering are optimized. In detail:

Pre-Training: the model learns to encode and reconstruct a sequence by minimizing a reconstruction loss, denoted , which is an loss between the original temporal pose graphs and those reconstructed by ST-GCAE.

Fine-Tuning:

the model optimizes a combined loss function consisting of both the reconstruction loss and a clustering loss. Optimization is done such that the clustering layer is optimized w.r.t. 

, the decoder is optimized w.r.t.  and the encoder is optimized w.r.t. both. The initialization of the clustering layer is done via -means. As shown by [Dizaji_2017], while the encoder is optimized w.r.t. to both losses, the decoder is kept and acts as a regularizer for maintaining the embedding quality of the encoder. The combined loss for this stage is:

(4)

4 Experiments

We evaluated our model in two different settings, using three datasets. The first setting is the common video anomaly detection setting, which we denote as the Fine-grained setting. In this setting, the normal sample consists of a single class and we seek to find fine-grained variations compared to it. For this setting, we use the ShanghaiTech Campus dataset. The second is our new problem setting, which we denote Coarse-grained anomaly detection, in which we seek to find abnormal actions that are different from those defined as normal.

4.1 ShanghaiTech Campus

Dataset

The ShanghaiTech Campus dataset [luo2017revisit] is one of the largest and most diverse datasets available for video anomaly detection. Presenting mostly person-based anomalies, it contains 130 abnormal events captured in 13 different scenes with complex lighting conditions and camera angles. Clips contain any number of people, from no people at all to over 20 people. The dataset contains over 300 untrimmed training and 100 untrimmed testing clips ranging from 15 seconds to over a minute long.

Experimental Setting

An experiment is comprised of two data splits, a training split containing normal examples only and a test split containing both normal and abnormal examples. Training is conducted solely using the training split. A score is calculated for each frame individually, and the combined score is the area under ROC curve for the concatenation of all frame scores in the test set.

We evaluate video streams of unknown length using a sliding-window approach. We split the input pose sequence to fixed-length, overlapping segments and score each individually. For clips with more than a single person, each person is scored individually. The maximal score over all the people in the frame is taken. As the ShanghaiTech Campus dataset is not annotated for pose, we use a 2D pose estimation model to extract human pose from every clip.

We also evaluate our model using patch embeddings as input features instead of keypoint coordinates. Patches of pixel RGB data are cropped from around each keypoint. The patches are embedded using a CNN and patch feature vectors are used to embed each keypoint. All other aspects of the models are kept the same.

Given the use of a pose estimation model, the patch embedding may be taken from one of the pose estimation model’s hidden layers, requiring no additional computation compared to the coordinate-based variant, other than increased dimension for the input layer. Further details regarding this variant of our model, implementation, and the pose estimation method used are available in the supplemental material.

Evaluation

We follow the evaluation protocol of Luo et al[luo2017revisit] and report the Area under ROC Curve (AUC) for our model in Table 1. ’Pose’ denotes the use of keypoint coordinates as the initial graph node embedding. ’Patch’ denotes the use of patch embeddings vectors, as discussed in this section. Our model outperforms previous state of the art methods, both pose and pixel based, by a large margin.

4.2 Coarse-Grained Anomaly Detection

4.2.1 Experimental Setting

For our second setting of Coarse-Grained Anomaly Detection, a model is trained using a sample of a few action classes considered normal. Training is done without labels, in an unsupervised manner. The model is evaluated by its ability to tell whether a new unseen clip belongs to any of the actions that make up the normal sample. For this setting, we adopt two action recognition datasets to our needs. This gives us great flexibility and control over the type of normal/abnormal actions that we want to detect. The datasets are NTU-RGB+D and Kinetics-250 that are provided with clip level action labels.

ShanghaiTech Campus
Luo et al. [luo2017revisit] 0.680
Abati et al. [Abati_2019_CVPR] 0.725
Liu et al. [Liu_2018] 0.728
Morais et al. [Morais_2019_CVPR] 0.734
Ours - Pose 0.752
Ours - Patches 0.761
Table 1: Fine-Grained Anomaly Detection Results: Scores represent frame level AUC. [Morais_2019_CVPR] uses keypoint coordinates as input.

In this setting, we first select 3-5 action classes and denote them our split. Classes are grouped into two sets of samples, split samples, and non-split samples. All labels are dropped. No labels are used beyond this point, except for the final evaluation phase.

We conduct two complementary experiments. Few vs. Many where there are few normal actions (say 3-5) in the training set and many (tens or even hundreds) actions that are denoted abnormal in the test set. We then repeat the experiment but switch roles of the train and test sets and denote this as Many vs. Few.

We repeat the above experiments for two types of splits. The first kind, termed random splits, is made of sets of 3-5 classes selected at random from each dataset. The second, which we call meaningful splits, is made of action splits that are subjectively grouped following some binding logic regarding the action’s physical or environmental properties. A sample of meaningful and random splits is provided in Table 3. We use 10 random and 10 meaningful splits for evaluating each dataset.

NTU-RGB+D Kinetics-250
Few vs. Many Many vs. Few Few vs. Many Many vs. Few
Method Random Meaningful Random Meaningful Random Meaningful Random Meaningful
Supervised 0.86 0.83 0.82 0.90 0.77 0.71 0.63 0.82
Rec. Loss 0.50 0.54 0.53 0.54 0.45 0.46 0.51 0.61
OC-SVM 0.60 0.67 0.60 0.69 0.56 0.56 0.52 0.47
Liu et al. [Liu_2018] 0.57 0.64 0.56 0.63 0.55 0.60 0.55 0.58
Morais et al. [Morais_2019_CVPR] - - - - 0.57 0.59 0.56 0.58
Ours 0.73 0.82 0.72 0.85 0.65 0.73 0.62 0.74
Table 2: Coarse-Grained Experiment Results: Values represent area under the ROC curve (AUC). In bold are the results of the best performing unsupervised method. Underlined is the best method of all. For all experiments clusters, see section 3.3 for details. It should be noted that AUC=0.50 in case of random choice.

4.2.2 Methods Evaluated

We compare our algorithm to several anomaly detection algorithms. All algorithms but the last one are unsupervised:

Autoencoder reconstruction loss

We use the reconstruction loss of our ST-GCAE model. In all experiments, the ST-GCAE reached convergence prior to the deep clustering fine-tuning stage. Further optimization of the ST-GCAE yielded no consistent improvement in results.

Autoencoder based one-class SVM

We fit a one-class SVM model using the encoded pose sequence representations (denoted in section 3.3). During test time, the corresponding representation of each sample is scored using the fitted model.

Video anomaly detection methods

We train the Future Frame Prediction model proposed by Liu et al[Liu_2018] and the Skeleton Trajectory model proposed by Morais et al[Morais_2019_CVPR] using our various dataset splits. Anomaly scores for each video are obtained by averaging the per-frame scores provided by the model. As the method proposed by Morais et al. only handles 2D pose, it was not applied to the 3D annotated NTU dataset.

Classifier softmax scores

The supervised baseline uses a classifier trained to classify each of the classes from the dataset split. The classifier architecture is based on the one proposed by [Yan2018SpatialTG]. To handle the significantly smaller number of samples, we use a shallower variant. For classifier architecture and implementation details, see suppl.

During the evaluation phase, a sample is passed through the classifier and its softmax output values are recorded. Anomaly score in this method is calculated by either using the softmax vector’s max value or by using the Dirichlet normality score from section 3.4, using softmax probabilities as input. We found Dirichlet based scoring to perform better for most cases, and we report results based on it.

It is important to note that this method is fundamentally different from our method and the other baselines. The classifier based method is a supervised method, relying on class action labels that were not used by other methods. It is thus not directly comparable and is here for reference only.

Kinetics
Random 1 Arm wrestling (6), Crawling baby (77)
Presenting weather forecast (254),
Surfing crowd (336)
Dancing Belly dancing (18), Capoeira (43),
Line dancing (75), Salsa (283),
Tango (348), Zumba (399)
Gym Lunge (183), Pull Ups (255), Push Up (260),
Situp (305), Squat (330)
NTU-RGB+D
Office Answer phone (28), Play with phone/tablet (29),
Typing on a keyboard (30), Read watch (33)
Fighting Punching (50), Kicking (51), Pushing (52),
Patting on back (53)
Table 3: Split Examples: A subset of the random and meaningful splits used for evaluating Kinetics and NTU-RGB+D datasets. For each split we list the included classes. Numbers in parentheses are the numeric class labels. For a complete list, See suppl.

4.2.3 Datasets

Ntu-Rgb+d

The NTU-RGB+D dataset by Shahroudy et al[Shahroudy_2016_CVPR] consists of clips showing one or two people performing one of 60 action classes. Classes include both actions of a single person and two-person interactions, captured using static cameras. It is provided with 3D joint measurements that are estimated using a Kinect depth sensor.

For this dataset, we use a model configuration similar to the one used for the ShanghaiTech experiments, with dimensions adapted for 3D pose.

Kinetics-250

The Kinetics dataset by Kay et al[kay2017kinetics] is a collection of 400 action classes, each with over 400 clips that are 10 seconds long. The clips were downloaded from YouTube and may contain any number of people that are not guaranteed to be fully visible.

Since Kinetics was not intended originally for pose estimation, some classes are unidentifiable by human pose extraction methods, e.g., the hair braiding class contains mostly clips focused on arms and heads. For such videos, a full-body pose estimation algorithm will yield zero keypoints for most cases.

Therefore, we use a subset of Kinetics-400 that is suitable for evaluation using pose sequences. To do that, we turn to the action classification results of [Yan2018SpatialTG]. Using their publicly available model we pick a subset of the 250 best-performing action classes, ranked by their top-1 training classification accuracy. The accuracy of the class that had the lowest score is . We denote our subset Kinetics-250.

Due to the vast size of Kinetics (1000x larger than ShanghaiTech), we used a single GCN for the spatial convolution, using static adjacency matrices only, and no pooling. This makes this block identical to the one proposed by [Yan2018SpatialTG], used for this specific setting only. We quantify the degradation of this variant in the suppl. Kinetics is not annotated for pose and we use a 2D pose estimation model.

4.2.4 Evaluation

We report Area under ROC Curve (AUC) results in Table 2. As these datasets require clip level annotations, the sliding window approach is not required for our method, and each temporal pose graph is evaluated in a single forward pass, with the highest scoring person taken.

As can be seen, our algorithm outperforms all four competing (unsupervised) methods, often by a large margin. The algorithm works well in both random and meaningful split modes, as well as in the Few vs. many and Many vs. few settings. Observe, however, that the algorithm works better on the meaningful splits (compared to the random splits). We believe this is because meaningful splits share similar patterns.

The table also reveals the impact of the quality of pose estimation on results. That is, the NTU-RGB+D dataset is cleaner and the human pose is recovered using the Kinect depth sensor. As a result, the estimated poses are more accurate and the results are generally better than the Kinetics-250 dataset.

4.3 Fail Cases

Figure 4 shows some failure cases. The recovered pose graph is superimposed on the image. As can be seen, there is significant variability in scenes, viewpoints and poses of the people in a single clip. Depicted in column (a), a highly crowded scene causes numerous occlusions and people being partially detected. The large number of partially extracted people causes a large variation in model provided scores, and misses the abnormal skater for multiple frames.

The two failures depicted in columns (b-c) show the weakness of relying on extracted pose for representing actions in a clip. Column (b) shows a cyclist is very partially extracted by the pose estimation method and missed by the model. Column (c) shows a non-person related event, not handled by our model. Here, a vehicle crossing the frame.

4.4 Ablation Study

We conduct a number of experiments to evaluate the robustness of our model to noisy normal training sets, i.e., having some percentage of abnormal actions present in the training set, presented next. We also conduct experiments to evaluate the importance of key model components and the stages of our clustering approach, presented in the suppl.


(a)
(b) (c)
Figure 4: Failure Cases, ShanghaiTech: Frames overlayed with extracted pose. In Column (a), the large crowd is occluding the abnormal skater and each other causing multiple misses. Column (b) depicts a cyclist, considered abnormal. Fast movement caused pose estimation failure, preventing detection. Column (c) depicts a vehicle in the frame, which is not addressed by our method.
Robustness to Noise

In many scenarios, it is impossible to determine whether a dataset contains nothing but normal samples, and some robustness to noise is required. To evaluate the model’s robustness to the presence of abnormal examples in the normal training sample, we introduce a varying number of abnormal samples chosen at random to the training set. These are taken from the unused abnormal portion of the dataset. Results are presented in Figure 5. Our model is robust and handles large amount of abnormal data during training with little performance loss.

For most anomaly detection settings, events occurring at a rate are considered very frequent. Our model loses on average less than of performance when trained with this amount of distractions. When trained with abnormal noise, there is a considerable decline in performance. In this setting, the training set usually consists of 5 classes, so distraction rate may be larger than an individual underlying class.

Figure 5: AUC Loss for Training with Noisy Data: Performance of models trained for NTU-RGB+D splits when a percentage of abnormal samples are added at random. The model is robust to significant amount of noise. At , noise surpasses the amount of data for some of the underlying classes making up the split. Different curves denote different dataset splits.

5 Conclusion

We propose an anomaly detection algorithm that relies on estimated human poses. The human poses are represented as temporal pose graphs and we jointly embed and cluster them in a latent space. As a result, each action is represented as a soft-assignment vector in latent space. We analyze the distribution of these vectors using the Dirichlet Process Mixture Model. The normality score provided by the model is used to determine if the action is normal or not.

The proposed algorithm works on both fine-grained anomaly detection, where the goal is to detect variations of a single action (e.g., walking), as well as a new coarse-grained anomaly detection setting, where the goal is to distinguish between normal and abnormal actions.

Extensive experiments show that we achieve state-of-the-art results on ShanghaiTech, one of the leading (fine-grained) anomaly detection data sets. We also outperform existing unsupervised methods on our new coarse-grained anomaly detection test.

References

Appendix A Supplementary Material

The supplementary material provides additional ablation experiments, as well as details regarding experiment splits and results, and information regarding the proposed spatial attention graph convolution, and implementation of our method and of the baseline methods.

Specifically, Section B presents further ablation experiments conducted to evaluate our model. Section C presents the base action-words learned by our model in both settings.

In Section D we go into further details regarding the proposed spatial attention graph convolution operator. Section E provides implementation details for our method, and Section F describes the implementations the of baseline methods used.

For the Coarse-grained experiments, per-split results and class lists are available in Section G and in Section H respectively. Finally, the complete list of classes used for Kinetics-250 is provided in Section I.

Appendix B Ablation Experiments - Cont.

In this section we provide further ablation experiments used to evaluate different model components:

Input and Spatial Convolution

In Table 4 we evaluate the contribution of two key components of our configuration. First, the input representation for nodes. We compare the Pose and Patch keypoint representations.

In the Pose representation, each graph node is represented by its coordinate values () provided by the pose estimation model. In the Patch

representation, we use features extracted using a CNN from a patch surrounding each keypoint.

Then, we evaluate the spatial graph operator used. We deonote our spatial attention graph convolution by SAGC, and the single adjacency variant by GCN. It is evident that both the use of patches and of the spatial attention graph convolution play a key role in our results.

Clustering Components

We conducted a number of ablation tests on one of the splits to measure the importance of the number of clusters , the clustering initialization method, the proposed normality score, and the fine-tuning training stage. Results are summarized in Table 5.

The different columns correspond to different numbers of clusters. As can be seen, best results are usually achieved for

and we use that value through all our experiments in the coarse setting. Each pair of rows correspond to two normality scores that we evaluate. ”Dir.” stands for the Dirichlet based normality score. ”Max” simply takes the maximum value of the softmax layer, the soft-assignment vector. Our proposed normality score performs consistently better (except for the case of

).

The first two rows of the table evaluate the importance of initializing the clustering layer. Rows 3-4 show the improvement gained by using -means for initialization compared to the random initialization used in rows 1-2.

Next, we evaluate the importance of the fine-tuning stage. Models that were fine-tuned are denoted by DEC in the table. Models in which the fine-tuning stage was skipped are denoted by No DEC. Rows 3-4 show results without using the fine-tuning stage, while rows 5-6 show results with. As can be seen, results improve considerably (except for the case of ).

Method GCN SAGC
Pose Coords. 0.750 0.753
Patches 0.749 0.761
Table 4: Input and Spatial Convolution: Results of different model variations for the ShanghaiTech Campus dataset. Rows denote input node representations, Pose for keypoint coordinates, Pathces for surrounding patch features. Columns denotes different spatial convolutions: GCN uses the physical adjacency matrix only. SAGC is our proposed operator. SAGC provides a meaningful improvement when used with patch embedding. Values represent frame level AUC.
Method 5 20 50
Random init, DEC, Max 0.45 0.42 0.44
Random init, DEC, Dir. 0.48 0.52 0.49
K-means init, No DEC, Max 0.57 0.51 0.48
K-means init, No DEC, Dir. 0.51 0.59 0.57
K-means init, DEC, Max 0.58 0.71 0.72
K-means init, DEC, Dir. 0.68 0.82 0.74
Table 5: Clustering Components: Results for Kinetics-250 Few vs. Many Experiment, split ”Music”: Values represent Area under RoC curves. Column values represent the number of clusters. “Max” / “Dir.” denotes the normality scoring method used, the maximal softmax value or Dirichlet based model. Values represent frame level AUC. See section B for further details.

Appendix C Visualization of Action-words

It is instructive to look at the clusters of the different data sets (Figure 6). Top row shows some cluster centers in the fine-grained setting and bottom row shows some cluster centers in the coarse-grained setting. As can be seen, the variation in the fine grained setting is mainly due to viewpoint, because most of the actions are variation on walking. On the other hand, the variability of the coarse-grained data set demonstrate the large variation in the actions that handled by our algorithm.

Fine-grained

In this setting, actions close to different cluster centroids depict common variations of the singular action taken to be normal, in this case, walking directions. The dictionary action words depict clear, unoccluded and full body samples from normal actions.

Coarse-grained

Frames selected from clips corresponding to base words extracted from a model trained on the Kinetics-250 dataset, split Random 6. Here, actions close to the centroids depict an essential manifestation of underlying action classes depicted. Several clusters in this case depict the underlying actions used to construct the split: Image (d) shows a sample from the ’presenting weather’ class. Facing the camera, pointing at a screen with the left arm while keeping the right one mostly static is highly representative of presenting weather; Image (e) depicts the common pose from the ’arm wrestling’ class and, Image (f) does the same for the ’crawling’ class.


(a) (b) (c)

(d) (e) (f)

Figure 6: Base Words: Samples closest to different cluster centroids, extracted from a trained model. Top: Fine-grained. ShanghaiTech model action words. Bottom: Coarse-grained. Actions words from a model trained on Kinetics-250, split Random 6. For the fine-grained setting, clusters capture small variations of the same action. However, for coarse-grained, actions close to the centroids vary considerably, and depict an essential manifestation of underlying actions depicted.

Appendix D Spatial Attention Graph Convolution

We will now present in detail several components of our spatial attention graph convolution layer. It is important to note that every kind of adjacency is applied independently, with different convolutional weights. After concatenating outputs from all GCNs, dimensions are reduced using a learnable convolution operator.

For this section, denotes the number of samples, is the number of graph nodes and is the number of channels. During the spatial processing phase, pose from each frame is processed independently of temporal relations.

GCN Modules

We use three GCN operators, each corresponding to a different kind of adjacency matrices. Following each GCN we apply batch normalization and a ReLU activation. If a single adjacency matrix is provided, as in the static and globally-learnable cases, it is applied equally to all inputs. In the inferred case, every sample is applied the corresponding adjacency matrix.

Attention Mechanism

Generally, the attention mechanism is modular and can be replaced by any graph attention model meeting the same input and output dimensions. There are several alternatives ([vaswani2017attention, velickovic2018graph]) which come at significant computational cost. We chose a simpler mechanism, inspired by [Luong_2015, 2sagcn2019cvpr]. Each sample’s node feature matrix, shaped , is multiplied by two separate attention weight matrices shaped . One is transposed and the dot product between the two is taken, followed by normalization. We found this simple mechanism to be useful and powerful.

Appendix E Implementation Details

Pose Estimation

For extracting pose graphs from the ShanghaiTech dataset we used Alphapose [fang2017rmpe]. Pose tracking is done using Poseflow [xiu2018poseflow]. Each keypoint is provided with a confidence value. For Kinetics-250 we use the publicly available keypoints333https://github.com/open-mmlab/mmskeleton extracted using Openpose[Cao_2017]. The above datasets use 2D keypoints with confidence values.

The NTU-RGB+D dataset is provided with 3D keypoint annotations, acquired using a Kinect sensor. For 3D annotations, there are 25 keypoints for each person.

Patch Inputs

The ShanghaiTech model variant using patch features as input network embeddings works as following: First, a pose graph is extracted. Then, around each keypoint in the corresponding frame, a patch is cropped. Given that pose estimation models rely on object detectors (Alphapose uses FasterRCNN[ren2015faster]), intermediate features from the detector may be used with no added computation. For simplicity, we embedded each patch using a publically available ResNet model444https://github.com/akamaster/pytorch_resnet_cifar10. Features used as input are the dimensional output of the global average pooling layer. Other than the input layer’s shape, no changes were made to the network.

Architecture

A symmetric structure was used for ST-GCAE. Temporal downsampling by factors of 2 and 3 were applied in the second and forth blocks. The decoder is symmetrically reversed. We use  clusters for NTU-RGB+D and Kinetics-250 and  clusters for ShanghaiTech. During the training stage, the samples were augmented using random rotations and flips. During the evaluation we average results for each sample over its augmented variants. Pre- and post-processing practices were applied equally to our method and all baseline methods.

Training

Each model begins with a pre-training stage, where the clustering loss isn’t used. A fine-tuning stage of roughly equal length follows during which the model is optimized using the combined loss, with the clustering loss coefficient  for all experiments. The Adam optimizer [kingma_adam] is used.

Appendix F Baseline Implementation Details

Video anomaly detection methods

The evaluation of the future frame prediction method by Liu et al[Liu_2018] was conducted using their publicly available implementation555https://github.com/stevenliuwen/ano_pred_cvpr2018/. Similarly, the evaluation of the Trajectory based anomaly detection method by Morais et al[Morais_2019_CVPR] was also conducted using their publicly available implementation666https://github.com/RomeroBarata/skeleton_based_anomaly_detection. Training was done using default parameters used by the authors, and changes were only made to adapt the data loading portion of the models to our datasets.

Classifier softmax scores

The classifier based supervised baseline used for comparison is based on the basic ST-GCN block used for our method. We use a model based on the architecture proposed by Yan et al[Yan2018SpatialTG], using their implementation777https://github.com/yysijie/st-gcn/. For the Few vs. Many experiments we use 6 ST-GCN blocks, two with 64 channel outputs, two with 128 channels and two with 256. This is the smaller model of the two, designed for the smaller amount of data available for the Few vs. Many experiments. For the Many vs. Few experiments we use 9 ST-GCN blocks, three with 64 channel outputs, three with 128 channels and three with 256. Both architectures use residual connections in each block and a temporal kernel size of 9. In both, the last layers with 64 and 128 channels perform temporal downsampling by a factor of 2. Training was done using the Adam optimizer.

The method provides a probability vector of per-class assignments. The vector is used as the input to the Dirichlet based normality scoring method that was used by our model. The scoring function’s parameters are fitted using the training set data considered “normal”, and in test time, each sample is scored using the fitted parameters.

Appendix G Detailed Experiment Results

Detailed results are provided for each dataset, method and setting. Results for NTU-RGB+D are provided in page 6 and for Kinetics-250 in page 7.

We use “sup.” to denote the supervised, classifier-based baseline in all figures. This method is fundamentally different from all others, and uses the class labels for supervision.

One can observe that for all settings our method is the top performer in most splits compared to unsupervised methods, often by a large margin.

Appendix H Class Splits Table

The list of random and meaningful splits used for evaluation is available in Table 8 for NTU-RGB+D and Table 9 for Kinetics-250.

Random splits were used to objectively evaluate the ability of a model to capture a specific subset of unrelated actions. Meaningful splits were chosen subjectively to contain a binding logic regarding the action’s physical or environmental properties, e.g. actions depicting musicians playing or actions one would likely see in a gym.

Figure 7 provides the top-1 training classification accuracy achieved by Yan et al[Yan2018SpatialTG] for each class in Kinetics-400 in descending order. It is used to show our cutoff point for choosing the Kinetics-250 classes.

Few vs. Many Many vs. Few
Method Rec. Loss OC-SVM FFP [Liu_2018] Ours Sup. Rec. Loss OC-SVM FFP [Liu_2018] Ours Sup.
Arms 0.58 0.77 0.67 0.86 0.69 0.31 0.67 0.67 0.73 0.97
Brushing 0.41 0.64 0.69 0.74 0.86 0.66 0.58 0.70 0.73 0.97
Dressing 0.60 0.68 0.62 0.80 0.87 0.61 0.74 0.63 0.80 0.86
Dropping 0.42 0.71 0.62 0.89 0.87 0.47 0.68 0.61 0.79 0.91
Glasses 0.49 0.77 0.51 0.86 0.82 0.41 0.66 0.55 0.76 0.94
Handshaking 0.87 0.51 0.70 0.99 0.90 0.87 0.85 0.72 0.71 0.71
Office 0.45 0.56 0.71 0.73 0.84 0.43 0.57 0.62 0.69 0.91
Fighting 0.81 0.76 0.62 0.99 0.78 0.77 0.84 0.61 0.99 0.88
Touching 0.40 0.68 0.60 0.78 0.72 0.40 0.64 0.55 0.66 0.98
Waving 0.37 0.62 0.65 0.71 0.90 0.38 0.59 0.65 0.74 0.89
Average 0.54 0.67 0.64 0.84 0.83 0.54 0.69 0.63 0.76 0.90
Random 1 0.38 0.65 0.64 0.76 0.85 0.51 0.56 0.65 0.61 0.95
Random 2 0.50 0.56 0.54 0.72 0.79 0.50 0.58 0.51 0.84 0.89
Random 3 0.64 0.54 0.54 0.77 0.93 0.64 0.64 0.51 0.84 0.61
Random 4 0.38 0.66 0.73 0.79 0.74 0.43 0.59 0.71 0.62 0.96
Random 5 0.53 0.53 0.50 0.58 0.83 0.51 0.53 0.52 0.59 0.78
Random 6 0.41 0.64 0.54 0.80 0.89 0.45 0.63 0.54 0.75 0.94
Random 7 0.44 0.53 0.51 0.66 0.87 0.46 0.52 0.51 0.63 0.74
Random 8 0.65 0.54 0.57 0.85 0.82 0.63 0.72 0.56 0.89 0.78
Random 9 0.45 0.69 0.52 0.62 0.86 0.48 0.62 0.52 0.61 0.81
Random 10 0.61 0.64 0.54 0.75 0.93 0.64 0.54 0.55 0.74 0.67
Average 0.50 0.60 0.57 0.73 0.86 0.53 0.60 0.56 0.69 0.82
Table 6: Coarse Grained Experiment Results - NTU-RGB+D: Values represent area under the curve (AUC). In bold are the results of the best performing unsupervised method. Underlined is the best method of all. “Sup.” denotes the supervised baseline. “FFP” denotes the Future frame prediction method by Liu et al[Liu_2018].
Few vs. Many Many vs. Few
Method Rec. OC-SVM FFP [Liu_2018] TBAD [Morais_2019_CVPR] Ours Sup. Rec. OC-SVM FFP [Liu_2018] TBAD [Morais_2019_CVPR] Ours Sup.
Batting 0.40 0.46 0.58 0.64 0.86 0.76 0.55 0.46 0.57 0.64 0.77 0.90
Cycling 0.41 0.56 0.61 0.59 0.80 0.63 0.62 0.54 0.64 0.54 0.68 0.81
Dancing 0.30 0.63 0.53 0.68 0.87 0.73 0.84 0.37 0.54 0.57 0.97 0.87
Gym 0.56 0.54 0.57 0.59 0.74 0.83 0.50 0.61 0.54 0.60 0.74 0.58
Jumping 0.44 0.42 0.61 0.53 0.70 0.52 0.65 0.49 0.59 0.52 0.67 0.80
Lifters 0.68 0.62 0.62 0.64 0.61 0.79 0.57 0.51 0.57 0.58 0.70 0.84
Music 0.43 0.52 0.61 0.60 0.82 0.90 0.57 0.50 0.59 0.64 0.62 0.62
Riding 0.49 0.61 0.60 0.53 0.56 0.66 0.65 0.52 0.61 0.55 0.76 0.88
Skiing 0.42 0.45 0.71 0.54 0.68 0.62 0.54 0.51 0.63 0.51 0.59 0.90
Throwing 0.41 0.58 0.51 0.6 0.68 0.70 0.62 0.46 0.53 0.65 0.90 0.95
Average 0.46 0.56 0.60 0.59 0.73 0.71 0.61 0.47 0.58 0.58 0.74 0.82
Random 1 0.48 0.55 0.55 0.53 0.53 0.81 0.39 0.61 0.58 0.54 0.63 0.58
Random 2 0.42 0.55 0.56 0.65 0.71 0.69 0.49 0.39 0.56 0.59 0.57 0.87
Random 3 0.49 0.54 0.55 0.56 0.70 0.75 0.57 0.49 0.44 0.55 0.55 0.53
Random 4 0.49 0.48 0.48 0.56 0.56 0.56 0.53 0.43 0.52 0.51 0.65 0.56
Random 5 0.41 0.60 0.61 0.61 0.71 0.76 0.66 0.41 0.57 0.52 0.62 0.56
Random 6 0.46 0.66 0.54 0.54 0.94 0.90 0.57 0.56 0.49 0.62 0.87 0.56
Random 7 0.38 0.46 0.59 0.54 0.57 0.67 0.48 0.45 0.59 0.54 0.54 0.54
Random 8 0.37 0.53 0.56 0.56 0.63 0.88 0.50 0.69 0.56 0.56 0.65 0.77
Random 9 0.40 0.56 0.52 0.64 0.59 0.80 0.59 0.49 0.54 0.57 0.55 0.76
Random 10 0.52 0.59 0.53 0.59 0.52 0.85 0.54 0.60 0.60 0.57 0.53 0.52
Average 0.45 0.56 0.55 0.57 0.65 0.77 0.51 0.52 0.55 0.56 0.62 0.63
Table 7: Coarse Grained Experiment Results - Kinetics-250: Values represent area under the curve (AUC). In bold are the results of the best performing unsupervised method. Underlined is the best method of all. “Sup.” denotes the supervised baseline. “FFP” denotes the Future frame prediction method by Liu et al[Liu_2018]. “TBAD” denotes the Trajectory based anomaly detection method by Morais et al[Morais_2019_CVPR].
NTU-RGB+D
Arms Pointing to something with finger (31), Salute (38), Put the palms together (39), Cross hands in front (say stop) (40)
Brushing Drink water (1), Brushing teeth (3), Brushing hair (4)
Dressing Wear jacket (14), Take off jacket (15), Wear a shoe (16), Take off a shoe (17)
Dropping Drop (5), Pickup (6), Sitting down (8), Standing up (from sitting position) (9)
Glasses Wear on glasses (18), Take off glasses (19), Put on a hat/cap (20), Take off a hat/cap (21)
Handshaking Hugging other person (55), Giving something to other person (56), Touch other person’s pocket (57), Handshaking (58)
Office Make a phone call/answer phone (28), Playing with phone/tablet (29), Typing on a keyboard (30),
Check time (from watch) (33)
Fighting Punching/slapping other person (50), Kicking other person (51), Pushing other person (52), Pat on back of other person (53)
Touching Touch head (headache) (44), Touch chest (stomachache/heart pain) (45), Touch back (backache) (46),
Touch neck (neckache) (47)
Waving Clapping (10), Hand waving (23), Pointing to something with finger (31), Salute (38)
Random 1 Brushing teeth (3), Pointing to something with finger (31), Nod head/bow (35), Salute (38)
Random 2 Walking apart from each other (0), Throw (7), Wear on glasses (18), Hugging other person (55)
Random 3 Brushing teeth (3), Tear up paper (13), Wear jacket (14), Staggering (42)
Random 4 Eat meal/snack (2), Writing (12), Taking a selfie (32), Falling (43)
Random 5 Playing with phone/tablet (29), Check time (from watch) (33), Rub two hands together (34), Pushing other person (52)
Random 6 Eat meal/snack (2), Take off glasses (19), Take off a hat/cap (21), Kicking something (24)
Random 7 Drop (5), Tear up paper (13), Wear on glasses (18), Put the palms together (39)
Random 8 Falling (43), Kicking other person (51), Point finger at the other person (54), Point finger at the other person (54)
Random 9 Wear on glasses (18), Rub two hands together (34), Falling (43), Punching/slapping other person (50)
Random 10 Throw (7), Clapping (10), Use a fan (with hand or paper)/feeling warm (49), Giving something to other person (56)
Table 8: Complete List of Splits - NTU-RGB+D: The splits used for evaluation for NTU-RGB+D dataset. Numbers are the numeric class labels. Often split names carry no significance and were chosen to be one of the split classes.
          
Figure 7: Kinetics Classes by Classification Accuracy: Presented are the sorted top-1 accuracy values for Kinetics-400 classes. Each tuple denotes class ranking and training classification accuracy, as achieved by the classification method proposed by Yan et al[Yan2018SpatialTG]. The dashed line shows the cut-off accuracy used for selecting classes to be included in Kinetics-250.
Kinetics
Batting Golf driving (143), Golf putting (144), Hurling (sport) (162), Playing squash or racquetball (246), Playing tennis (247)
Cycling Riding a bike (268), Riding mountain bike (272), Riding unicycle (276), Using segway (376)
Dancing Belly dancing (19), Capoeira (44), Country line dancing (76), Salsa dancing (284), Tango dancing (349), Zumba (400)
Gym Lunge (184), Pull Ups (265), Push Up (261), Situp (306), Squat (331)
Jumping High jump (152), Jumping into pool (173), Long jump (183), Triple jump (368)
Lifters Bench pressing (20), Clean and jerk (60), Deadlifting (89), Front raises (135), Snatch weight lifting (319)
Music Playing accordion (218), Playing cello (224), Playing clarinet (226), Playing drums (231), Playing guitar (233),
Playing harp (235)
Riding Lunge (184), Pull Ups (256), Push Up (261), Situp (306), Squat (331)
Skiing Roller skating (281), Skateboarding (307), Skiing slalom (311), Tobogganing (361)
Throwing Hammer throw (149), Javelin throw (167), Passing american football (in game) (209), Shot put (299), Throwing axe (357),
Throwing discus (359)
Random 1 Climbing tree (69), Juggling fire (171), Marching (193), Shaking head (290), Using segway (376)
Random 2 Drop kicking (106), Golf chipping (142), Pole vault (254), Riding scooter (275), Ski jumping (308)
Random 3 Bench pressing (20), Hammer throw (149), Playing didgeridoo (230), Sign language interpreting (304),
Wrapping present (395)
Random 4 Cleaning floor (61), Ice fishing (164), Using segway (376), Waxing chest (388)
Random 5 Barbequing (15), Golf chipping (142), Kissing (177), Lunge (184)
Random 6 Arm wrestling (7), Crawling baby (78), Presenting weather forecast (255), Surfing crowd (337)
Random 7 Bobsledding (29), Canoeing or kayaking (43), Dribbling basketball (100), Playing ice hockey (236)
Random 8 Playing basketball (221), Playing tennis (247), Squat (331)
Random 9 Golf putting (144), Juggling fire (171), Walking the dog (379)
Random 10 Jumping into pool (173), Krumping (180), Presenting weather forecast (255)
Table 9: Complete List of Splis - Kinetics-250: The splits used for evaluation for Kinetics-250 dataset. Numbers are the numeric class labels. Often split names carry no significance and were chosen to be one of the split classes.

Appendix I Kinetics-250 Class List

  1. Abseiling (1)

  2. Air drumming (2)

  3. Archery (6)

  4. Arm wrestling (7)

  5. Arranging flowers (8)

  6. Assembling computer (9)

  7. Auctioning (10)

  8. Barbequing (15)

  9. Bartending (16)

  10. Beatboxing (17)

  11. Belly dancing (19)

  12. Bench pressing (20)

  13. Bending back (21)

  14. Biking through snow (23)

  15. Blasting sand (24)

  16. Blowing glass (25)

  17. Blowing out candles (28)

  18. Bobsledding (29)

  19. Bookbinding (30)

  20. Bouncing on trampoline (31)

  21. Bowling (32)

  22. Braiding hair (33)

  23. Breakdancing (35)

  24. Building cabinet (39)

  25. Building shed (40)

  26. Bungee jumping (41)

  27. Busking (42)

  28. Canoeing or kayaking (43)

  29. Capoeira (44)

  30. Carrying baby (45)

  31. Cartwheeling (46)

  32. Catching or throwing softball (51)

  33. Celebrating (52)

  34. Cheerleading (56)

  35. Chopping wood (57)

  36. Clapping (58)

  37. Clean and jerk (60)

  38. Cleaning floor (61)

  39. Climbing a rope (67)

  40. Climbing tree (69)

  41. Contact juggling (70)

  42. Cooking chicken (71)

  43. Country line dancing (76)

  44. Cracking neck (77)

  45. Crawling baby (78)

  46. Curling hair (81)

  47. Dancing ballet (85)

  48. Dancing charleston (86)

  49. Dancing gangnam style (87)

  50. Dancing macarena (88)

  51. Deadlifting (89)

  52. Dining (92)

  53. Disc golfing (93)

  54. Diving cliff (94)

  55. Doing aerobics (96)

  56. Doing nails (98)

  57. Dribbling basketball (100)

  58. Driving car (104)

  59. Driving tractor (105)

  60. Drop kicking (106)

  61. Dunking basketball (108)

  62. Dying hair (109)

  63. Eating burger (110)

  64. Eating spaghetti (117)

  65. Exercising arm (120)

  66. Extinguishing fire (122)

  67. Feeding birds (124)

  68. Feeding fish (125)

  69. Feeding goats (126)

  70. Filling eyebrows (127)

  71. Finger snapping (128)

  72. Flying kite (131)

  73. Folding clothes (132)

  74. Front raises (135)

  75. Frying vegetables (136)

  76. Gargling (138)

  77. Giving or receiving award (141)

  78. Golf chipping (142)

  79. Golf driving (143)

  80. Golf putting (144)

  81. Grooming horse (147)

  82. Gymnastics tumbling (148)

  83. Hammer throw (149)

  84. Headbanging (150)

  85. High jump (152)

  86. Hitting baseball (154)

  87. Hockey stop (155)

  88. Hopscotch (157)

  89. Hula hooping (160)

  90. Hurdling (161)

  91. Hurling (sport) (162)

  92. Ice climbing (163)

  93. Ice fishing (164)

  94. Ice skating (165)

  95. Ironing (166)

  96. Javelin throw (167)

  97. Jetskiing (168)

  98. Jogging (169)

  99. Juggling balls (170)

  100. Juggling fire (171)

  101. Juggling soccer ball (172)

  102. Jumping into pool (173)

  103. Jumpstyle dancing (174)

  104. Kicking field goal (175)

  105. Kicking soccer ball (176)

  106. Kissing (177)

  107. Knitting (179)

  108. Krumping (180)

  109. Laughing (181)

  110. Long jump (183)

  111. Lunge (184)

  112. Making bed (187)

  113. Making snowman (190)

  114. Marching (193)

  115. Massaging back (194)

  116. Milking cow (198)

  117. Motorcycling (200)

  118. Mowing lawn (202)

  119. News anchoring (203)

  120. Parkour (208)

  121. Passing american football (in game) (209)

  122. Passing american football (not in game)(210)

  123. Picking fruit (215)

  124. Playing accordion (218)

  125. Playing badminton (219)

  126. Playing bagpipes (220)

  127. Playing basketball (221)

  128. Playing bass guitar (222)

  129. Playing cello (224)

  130. Playing chess (225)

  131. Playing clarinet (226)

  132. Playing cricket (228)

  133. Playing didgeridoo (230)

  134. Playing drums (231)

  135. Playing flute (232)

  136. Playing guitar (233)

  137. Playing harmonica (234)

  138. Playing harp (235)

  139. Playing ice hockey (236)

  140. Playing kickball (238)

  141. Playing organ (240)

  142. Playing paintball (241)

  143. Playing piano (242)

  144. Playing poker (243)

  145. Playing recorder (244)

  146. Playing saxophone (245)

  147. Playing squash or racquetball (246)

  148. Playing tennis (247)

  149. Playing trombone (248)

  150. Playing trumpet (249)

  151. Playing ukulele (250)

  152. Playing violin (251)

  153. Playing volleyball (252)

  154. Playing xylophone (253)

  155. Pole vault (254)

  156. Presenting weather forecast (255)

  157. Pull ups (256)

  158. Pumping fist (257)

  159. Punching bag (259)

  160. Punching person (boxing) (260)

  161. Push up (261)

  162. Pushing car (262)

  163. Pushing cart (263)

  164. Reading book (265)

  165. Riding a bike (268)

  166. Riding camel (269)

  167. Riding elephant (270)

  168. Riding mechanical bull (271)

  169. Riding mountain bike (272)

  170. Riding or walking with horse (274)

  171. Riding scooter (275)

  172. Riding unicycle (276)

  173. Robot dancing (278)

  174. Rock climbing (279)

  175. Rock scissors paper (280)

  176. Roller skating (281)

  177. Running on treadmill (282)

  178. Sailing (283)

  179. Salsa dancing (284)

  180. Sanding floor (285)

  181. Scrambling eggs (286)

  182. Scuba diving (287)

  183. Shaking head (290)

  184. Shaving head (293)

  185. Shearing sheep (295)

  186. Shooting basketball (297)

  187. Shot put (299)

  188. Shoveling snow (300)

  189. Shuffling cards (302)

  190. Side kick (303)

  191. Sign language interpreting (304)

  192. Singing (305)

  193. Situp (306)

  194. Skateboarding (307)

  195. Ski jumping (308)

  196. Skiing (not slalom or crosscountry) (30

  197. Skiing crosscountry (310)

  198. Skiing slalom (311)

  199. Skipping rope (312))

  200. Skydiving (313)

  201. Slacklining (314)

  202. Sled dog racing (316)

  203. Smoking hookah (318)

  204. Snatch weight lifting (319)

  205. Snorkeling (322)

  206. Snowkiting (324)

  207. Spinning poi (327)

  208. Springboard diving (330)

  209. Squat (331)

  210. Stomping grapes (333)

  211. Stretching arm (334)

  212. Stretching leg (335)

  213. Strumming guitar (336)

  214. Surfing crowd (337)

  215. Surfing water (338)

  216. Sweeping floor (339)

  217. Swimming backstroke (340)

  218. Swimming breast stroke (341)

  219. Swimming butterfly stroke (342)

  220. Swinging legs (344)

  221. Tai chi (347)

  222. Tango dancing (349)

  223. Tap dancing (350)

  224. Tapping guitar (351)

  225. Tapping pen (352)

  226. Tasting beer (353)

  227. Testifying (355)

  228. Throwing axe (357)

  229. Throwing discus (359)

  230. Tickling (360)

  231. Tobogganing (361)

  232. Training dog (364)

  233. Trapezing (365)

  234. Trimming or shaving beard (366)

  235. Triple jump (368)

  236. Tying tie (371)

  237. Using segway (376)

  238. Vault (377)

  239. Waiting in line (378)

  240. Walking the dog (379)

  241. Washing feet (381)

  242. Water skiing (384)

  243. Waxing chest (388)

  244. Waxing eyebrows (389)

  245. Welding (392)

  246. Windsurfing (394)

  247. Wrapping present (395)

  248. Wrestling (396)

  249. Yoga (399)

  250. Zumba (400)