Graph Embedded Pose Clustering for Anomaly Detection
We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of "bag of words" representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not. We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal. Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods.READ FULL TEXT VIEW PDF
Detecting abnormal events in video is commonly framed as a one-class
We introduce an algorithmic method for population anomaly detection base...
Anomaly detection in data analysis is an interesting but still challengi...
Anomaly detection methods generally target the learning of a normal imag...
Standard methods for anomaly detection assume that all features are obse...
One-class learning is the classic problem of fitting a model to data for...
Our work seeks to gather and distribute sensitive information from refug...
Graph Embedded Pose Clustering for Anomaly Detection
Anomaly detection in video has been investigated extensively over the years. This is because the amount of video captured far surpasses our ability to manually analyze it. Anomaly detection algorithms are designed to help human operators deal with this problem. The question is how to define anomalies and how to detect them.
The decision of whether an action is normal or not is nuanced. In some cases, we are interested in detecting abnormal variations of an action. For example, an abnormal type of walking. We term this fine-grained anomaly detection. In other cases, we might be interested in defining normal actions and regard any other action as abnormal. For example, we might be interested in determining that dancing is normal, while gymnastics are abnormal. We call this coarse-grained anomaly detection.
We desire an algorithm that can handle both types of anomaly detection in a single, unified fashion. Such an algorithm should take as input an unlabeled set of videos that capture normal actions only (fine- or coarse-grained) and use that to train a model that will distinguish normal from abnormal actions.
We take advantage of the recent progress in human pose estimation and assume our algorithm takes human pose graphs as input. This offers several advantages. First, it abstracts the problem and lets the algorithm focus on human pose and not on irrelevant features such as viewing direction, illumination, or background clutter. In addition, a human pose can be represented as a compact graph, which makes analyzing, training and testing much faster.
Given a sequence of video frames, we use a pose estimation method to extract the keypoints of every person in each frame. Every person in a clip is represented as a temporal pose graph. We use a combination of an autoencoder and a clustering branch to map the training samples into a latent space where samples are soft clustered. Each sample is then represented by its soft-assignment to each of theclusters. This can be understood as learning a bag-of-words representation for actions. Each cluster corresponds to an action-word, and each action is represented by its similarity to each of the action-words. Figure 1 gives an overview of our method.
denoting the probability of samplebeing assigned to cluster .
The soft-assignment vectors capture proportional data and the tool to measure their distribution is the Dirichlet Process Mixture Model. Once we fit the model to the data, we can obtain a normality score for each sample and determine if the action is to be classified as normal or not.
The algorithm thus consists of a series of abstractions. Using human pose graphs eliminates the need to deal with viewpoint and illumination changes. And the soft-assignment representation abstracts the type of data (fine-grained or coarse-grained) from the Dirichlet model.
We evaluate our algorithm in two settings. The first is the ShanghaiTech Campus [luo2017revisit] dataset, a large and extensively evaluated anomaly detection benchmark. This is a typical (fine-grained) anomaly detection benchmark in which normal behavior is taken to be walking, and the goal is to detect abnormal events, such as people running, fighting, riding bicycles, throwing objects, etc.
The second is a new problem setting we propose, and denote Coarse-grained anomaly detection. Instead of focusing on a single action (i.e., walking), as in the ShanghaiTech dataset, we construct a training set consisting of a varying number of actions that are to be regarded as normal. For example, the training set may consist of video clips of different dancing styles. At test time, every dance video should be classified as normal, while any other action should be classified as abnormal.
We demonstrate this new, challenging, Coarse-grained anomaly detection setting on two action classification datasets. First is the NTU-RGB+D dataset, where 3D body joints are detected using Kinect. Second is a larger and more challenging dataset that consists of 250 out of the 400 actions in the Kinetics400 dataset222We only use a subset of the classes as not all classes can be detected using human pose detectors.. For both datasets, we use a subset of the actions to define a training set of normal actions and use the rest of the videos to test if the algorithm can correctly distinguish normal from abnormal videos.
We conduct extensive experiments, compare to a number of competing approaches and find that our algorithm outperforms all of them.
To summarize, we propose three key contributions:
The use of embedded pose graphs and a Dirichlet process mixture for video anomaly detection;
A new coarse-grained setting for exploring broader aspects of video anomaly detection;
State-of-the-art AUC of for the ShanghaiTech Campus anomaly detection benchmark.
The field of anomaly detection is broad and has a large variation in setting and assumptions, as is evident by the different datasets proposed to evaluate methods in the field.
For our fine-grained experiment, we use the ShanghaiTech Campus dataset [luo2017revisit]. Containing 130 anomalous events in 13 different scenes, with various camera angles and lighting conditions, it is more diverse and significantly larger than all previous common datasets. It is presented in detail in section 4.1.
In recent years, numerous works tackled the problem of anomaly detection in video using deep learning based models. Those could be roughly categorized into reconstructive models, predictive models, and generative models.
Reconstructive models learn a feature representation for each sample and attempt to reconstruct a sample based on that embedding, often using Autoencoders [Abati_2019_CVPR, Chong_2017, Hasan_2016_CVPR]
. Predictive model based methods aim to model the current frame based on a set of previous frames, often relying on recurrent neural networks[luo2017remembering, luo2017revisit, medel2016anomaly] or 3D convolutions [sabokrou2017deep, zhao2017spatio]. In some cases, reconstruction-based models are combined with prediction based methods for improved accuracy [zhao2017spatio]. In both cases, samples poorly reconstructed or predicted are considered anomalous.
Generative models were also used to reconstruct, predict or model the distribution of the data, often using Variational Autoencoders (VAEs) [an2015variational] or GANs [akccay2019skip, lotter2015unsupervised, ravanbakhsh2017abnormal, ravanbakhsh2017training].
A method proposed by Liu et al. [Liu_2018] uses a generative future frame prediction model and compares a prediction with its ground truth by evaluating differences in gradient-based features and optic flow. This method requires optic flow computation and generating a complete scene, which makes it costly and less robust to large scenery changes.
Recently, Morais et al. [Morais_2019_CVPR] proposed an anomaly detection method using a fully connected RNN to analyze pose sequences. The method embeds a sequence, then uses reconstruction and prediction branches to generate past and future poses, respectively. Anomaly score is determined by the reconstruction and prediction errors of the model.
To represent human poses as graphs, the inner-graph relations are described using weighted adjacency matrices. Each matrix could be static or learnable and represent any kind of relation.
In recent years, many approaches were proposed for applying deep learning based methods to graph data. Kipf and Welling [kipf2017semi] proposed the notion of Fast Approximate Convolutions On Graphs. Following Kipf and Welling, both temporal and multiple adjacency extensions were proposed. Works by Yan et al. [Yan2018SpatialTG] and Yu et al. [Yu_2018] proposed temporal extensions, with the former work proposing the use of separable spatial and temporal graph convolutions (ST-GCN), applied sequentially. We follow the basic ST-GCN block design, illustrated in Figure 2.
Veličković et al. [velickovic2018graph] proposed Graph Attention Networks, a GCN extension in which the weighting of neighboring nodes are inferred using an attention mechanism, relying on a fixed adjacency matrix only to determine neighboring nodes.
Shi et al. [2sagcn2019cvpr] recently extended the concept of spatio-temporal graph convolutions by using several adjacency matrices, of which some are learned or inferred. Inferred adjacency is determined using an embedded similarity measure, optimized during training. Adjacency matrices are summed prior to applying the convolution.
Deep clustering methods aim to provide useful cluster assignments by optimizing a deep model under a cluster inducing objective. For example, several recent methods jointly embed and cluster data using unsupervised representation learning methods, such as autoencoders, with clustering modules [caron2018deep, Dizaji_2017, Wang_2016, xie2016unsupervised].
A method proposed by Xie et al. [xie2016unsupervised], denoted Deep Embedded Clustering (DEC), proposed an alternating two-step approach. In the first step, a target distribution is calculated using the current cluster assignments. In the next step, the model is optimized to provide cluster assignments similar to the target distribution. Recent extensions tackled DEC’s susceptibility to degenerate solutions using regularization methods and various post-processing means [Dizaji_2017, haeusser2018associative].
We design an anomaly detection algorithm that can operate in a number of different scenarios. The algorithm consists of a sequence of abstractions that are designed to help each step of the algorithm work better. First, we use a human pose detector on the input data. This abstracts the problem and prevents the next steps from dealing with nuisance parameters such as viewpoint or illumination changes.
Human actions are represented as space-time graphs and we embed (sub-sections 3.1, 3.2) and cluster (sub-section 3.3) them in some latent space. Each action is now represented as a soft-assignment vector to a group of base actions. This abstracts the underlying type of actions (i.e., fine-grained or coarse-grained), leading to the final stage of learning their distribution. The tool we use for learning the distribution of soft-assignment vectors is the Dirichlet process mixture (sub-section 3.4), and we fit a model to the data. This model is then used to determine if an action is normal or not.
We wish to capture the relations between body joints, while at the same time provide robustness to external factors such as appearance, viewpoint and lighting. Therefore, we represent a person’s pose with a graph.
Each node of the graph corresponds to a keypoint, a body joint, and each edge represents some relation between two nodes. Many keypoint relations exist, such as physical relations defined anatomically (e.g. the left wrist and elbow are connected) and action relations defined by movements that tend to be highly correlated in the context of a certain action (e.g. the left and right knees tend to move in opposite directions while running). The directions of the graph rise from the fact that some relations are learned during the optimization process and are not symmetric. A nice bonus with this representation is being compact, which is very important for efficient video analysis.
In order to extend this formulation temporally, pose keypoints extracted from a video sequence are represented as a temporal sequence of pose graphs. The temporal pose graph is a time series of human joint locations. Temporal domain adjacency could be similarly defined by connecting joints in successive frames, allowing us to perform graph convolution operations exploiting both spatial and temporal dimensions of our sequence of pose graphs.
We propose a deep temporal graph autoencoder based architecture for embedding the temporal pose graphs. Building on the basic block design of ST-GCN, presented in Figure 2, we substitute the basic GCN operator with a novel Spatial Attention Graph Convolution, presented next.
We use this building block to construct a Spatio-Temporal Graph Convolutional Auto-Encoder, or ST-GCAE. We use ST-GCAE to embed the spatio-temporal graph and take the embedding to be the starting point for our clustering branch.
We propose a new graph operator, presented in Figure 3, that uses adjacency matrices of three types: Static, Globally-learned and Inferred (attention-based). Each adjacency type is applied with its own GCN, using separate weights. The outputs from the GCNs are stacked in the channel dimension. A convolution is applied as a learnable reduction measure for weighting the stacked outputs, and provides the required output channel number.
The three adjacency matrices capture different aspects of the model: (i) The use of body-part connectivity as a prior over node relations, represented using the static adjacency matrix. (ii) Dataset level keypoint relations, captured by the global adjacency matrix, and (iii) Sample specific relations, captured by inferred adjacency matrices. Finally, the learnable reduction measure weights the different outputs.
The static adjacency is fixed and shared by all layers. The globally-learnable matrix is learned individually at each layer, and applied equally to all samples during the forward pass. The inferred adjacency matrices are based on an attention mechanism that uses learned weights to calculate a sample specific adjacency matrix, a different one for every sample in a batch. For example, for a batch of size of graphs with nodes, the inferred adjacency size is , while other adjacencies are matrices.
The globally-learned adjacency is learned by initializing a fully-connected graph, with a complete, uniform, adjacency matrix. The matrix is jointly optimized with the rest of the model’s parameters during training. The computational overhead of this adjacency is small for graphs containing no more than a few dozen nodes.
An inferred adjacency matrix is constructed using a graph self-attention layer. After evaluating a few attention models we chose a simple multiplicative attention mechanism. First, we embed the input twice, using two sets of learned weights. We then transpose one of the embedded matrices and take the dot product between the two and normalize. We then get the inferred adjacency matrix. The attention mechanism chosen is modular and may be replaced with other common alternatives. Further details are provided in the supplementary material.
). A residual connection is used. GCN modules include batch normalization and ReLU activation, omitted for readability.
To build our dictionary of underlying actions, we take the training set samples and jointly embed and cluster them in some latent space. Each sample is then represented by its assignment probability to each of the underlying clusters. The objective is selected to provide distinct latent clusters, over which actions reside.
We adapt the notion of Deep Embedded Clustering [xie2016unsupervised] for clustering temporal graphs with our ST-GCAE architecture. The proposed clustering model consists of three parts, an encoder, a decoder, and a soft clustering layer.
Specifically, our ST-GCAE model maintains the graph’s structure but uses large temporal strides with an increasing channel number to compress an input sequence to a latent vector. The decoder uses temporal up-sampling layers and additional graph convolutional blocks, for gradually restoring original channel count and temporal dimension.
The ST-GCAE’s embedding is the starting point for clustering the data. The initial reconstruction based embedding is fine-tuned during our clustering optimization stage to reach the final clustering optimized embedding.
For each input sample , we denote the encoder’s latent embedding by , and the soft cluster assignment calculated using the clustering layer by . We denote the clustering layer’s parameters by . The probability for the -th sample to be assigned to the -th cluster is:
We adopt the clustering objective and optimization algorithm proposed by [xie2016unsupervised]. The clustering objective is to minimize the KL divergence between the current model probabilistic clustering prediction and a target distribution :
The target distribution aims to strengthen current cluster assignments by normalizing and pushing each value closer to a value of either 0 or 1. Recurrent application of the function transforming to will eventually result in a hard assignment vector. Each member of the target distribution is calculated using the following equation:
The clustering layer is initialized by the K-means centroids calculated for the encoded training set. Optimization is done in Expectation-Maximization (EM) like fashion. During the Expectation step, the entire model is fixed and, the target distributionis updated. During the Maximization stage, the model is optimized to minimize the clustering loss, .
This model supports two types of multimodal distributions. One is at the cluster assignment level; the other is at the soft-assignment vector level. For example, an action may be assigned to more than one cluster (cluster-level assignment), leading to a multimodal soft-assignment vector. The soft-assignment vectors themselves (that capture actions) can be modeled by a multimodal distribution as well.
The Dirichlet process mixture model (DPMM) is a useful measure for evaluating the distribution of proportional data. It meets our required setup: (i) An estimation (fitting) phase, during which a set of distribution parameters is evaluated, and (ii) An inference stage, providing a score for each embedded sample using the fitted model. A thorough overview of the model is given by Blei and Jordan [blei2006variational].
The DPMM is a common mixture extension to the unimodal Dirichlet distribution and uses the Dirichlet Process, an infinite-dimensional extension of the Dirichlet distribution. This model is multimodal and able to capture each mode as a mixture component. A fitted model has several modes, each representing a set of proportions that correspond to one normal behavior. At test time, each sample is scored by its log probability using the fitted model. Further explanations and discussion on the use of DPMM are available in [blei2006variational, dinari2019distributed].
The training phase of the model consists of two stages, a pre-training stage for the autoencoder, in which the clustering branch of the network remains unchanged, and a fine-tuning stage in which both embedding and clustering are optimized. In detail:
Pre-Training: the model learns to encode and reconstruct a sequence by minimizing a reconstruction loss, denoted , which is an loss between the original temporal pose graphs and those reconstructed by ST-GCAE.
the model optimizes a combined loss function consisting of both the reconstruction loss and a clustering loss. Optimization is done such that the clustering layer is optimized w.r.t., the decoder is optimized w.r.t. and the encoder is optimized w.r.t. both. The initialization of the clustering layer is done via -means. As shown by [Dizaji_2017], while the encoder is optimized w.r.t. to both losses, the decoder is kept and acts as a regularizer for maintaining the embedding quality of the encoder. The combined loss for this stage is:
We evaluated our model in two different settings, using three datasets. The first setting is the common video anomaly detection setting, which we denote as the Fine-grained setting. In this setting, the normal sample consists of a single class and we seek to find fine-grained variations compared to it. For this setting, we use the ShanghaiTech Campus dataset. The second is our new problem setting, which we denote Coarse-grained anomaly detection, in which we seek to find abnormal actions that are different from those defined as normal.
The ShanghaiTech Campus dataset [luo2017revisit] is one of the largest and most diverse datasets available for video anomaly detection. Presenting mostly person-based anomalies, it contains 130 abnormal events captured in 13 different scenes with complex lighting conditions and camera angles. Clips contain any number of people, from no people at all to over 20 people. The dataset contains over 300 untrimmed training and 100 untrimmed testing clips ranging from 15 seconds to over a minute long.
An experiment is comprised of two data splits, a training split containing normal examples only and a test split containing both normal and abnormal examples. Training is conducted solely using the training split. A score is calculated for each frame individually, and the combined score is the area under ROC curve for the concatenation of all frame scores in the test set.
We evaluate video streams of unknown length using a sliding-window approach. We split the input pose sequence to fixed-length, overlapping segments and score each individually. For clips with more than a single person, each person is scored individually. The maximal score over all the people in the frame is taken. As the ShanghaiTech Campus dataset is not annotated for pose, we use a 2D pose estimation model to extract human pose from every clip.
We also evaluate our model using patch embeddings as input features instead of keypoint coordinates. Patches of pixel RGB data are cropped from around each keypoint. The patches are embedded using a CNN and patch feature vectors are used to embed each keypoint. All other aspects of the models are kept the same.
Given the use of a pose estimation model, the patch embedding may be taken from one of the pose estimation model’s hidden layers, requiring no additional computation compared to the coordinate-based variant, other than increased dimension for the input layer. Further details regarding this variant of our model, implementation, and the pose estimation method used are available in the supplemental material.
We follow the evaluation protocol of Luo et al. [luo2017revisit] and report the Area under ROC Curve (AUC) for our model in Table 1. ’Pose’ denotes the use of keypoint coordinates as the initial graph node embedding. ’Patch’ denotes the use of patch embeddings vectors, as discussed in this section. Our model outperforms previous state of the art methods, both pose and pixel based, by a large margin.
For our second setting of Coarse-Grained Anomaly Detection, a model is trained using a sample of a few action classes considered normal. Training is done without labels, in an unsupervised manner. The model is evaluated by its ability to tell whether a new unseen clip belongs to any of the actions that make up the normal sample. For this setting, we adopt two action recognition datasets to our needs. This gives us great flexibility and control over the type of normal/abnormal actions that we want to detect. The datasets are NTU-RGB+D and Kinetics-250 that are provided with clip level action labels.
|Luo et al. [luo2017revisit]||0.680|
|Abati et al. [Abati_2019_CVPR]||0.725|
|Liu et al. [Liu_2018]||0.728|
|Morais et al. [Morais_2019_CVPR]||0.734|
|Ours - Pose||0.752|
|Ours - Patches||0.761|
In this setting, we first select 3-5 action classes and denote them our split. Classes are grouped into two sets of samples, split samples, and non-split samples. All labels are dropped. No labels are used beyond this point, except for the final evaluation phase.
We conduct two complementary experiments. Few vs. Many where there are few normal actions (say 3-5) in the training set and many (tens or even hundreds) actions that are denoted abnormal in the test set. We then repeat the experiment but switch roles of the train and test sets and denote this as Many vs. Few.
We repeat the above experiments for two types of splits. The first kind, termed random splits, is made of sets of 3-5 classes selected at random from each dataset. The second, which we call meaningful splits, is made of action splits that are subjectively grouped following some binding logic regarding the action’s physical or environmental properties. A sample of meaningful and random splits is provided in Table 3. We use 10 random and 10 meaningful splits for evaluating each dataset.
|Few vs. Many||Many vs. Few||Few vs. Many||Many vs. Few|
|Liu et al. [Liu_2018]||0.57||0.64||0.56||0.63||0.55||0.60||0.55||0.58|
|Morais et al. [Morais_2019_CVPR]||-||-||-||-||0.57||0.59||0.56||0.58|
We compare our algorithm to several anomaly detection algorithms. All algorithms but the last one are unsupervised:
We use the reconstruction loss of our ST-GCAE model. In all experiments, the ST-GCAE reached convergence prior to the deep clustering fine-tuning stage. Further optimization of the ST-GCAE yielded no consistent improvement in results.
We fit a one-class SVM model using the encoded pose sequence representations (denoted in section 3.3). During test time, the corresponding representation of each sample is scored using the fitted model.
We train the Future Frame Prediction model proposed by Liu et al. [Liu_2018] and the Skeleton Trajectory model proposed by Morais et al. [Morais_2019_CVPR] using our various dataset splits. Anomaly scores for each video are obtained by averaging the per-frame scores provided by the model. As the method proposed by Morais et al. only handles 2D pose, it was not applied to the 3D annotated NTU dataset.
The supervised baseline uses a classifier trained to classify each of the classes from the dataset split. The classifier architecture is based on the one proposed by [Yan2018SpatialTG]. To handle the significantly smaller number of samples, we use a shallower variant. For classifier architecture and implementation details, see suppl.
During the evaluation phase, a sample is passed through the classifier and its softmax output values are recorded. Anomaly score in this method is calculated by either using the softmax vector’s max value or by using the Dirichlet normality score from section 3.4, using softmax probabilities as input. We found Dirichlet based scoring to perform better for most cases, and we report results based on it.
It is important to note that this method is fundamentally different from our method and the other baselines. The classifier based method is a supervised method, relying on class action labels that were not used by other methods. It is thus not directly comparable and is here for reference only.
|Random 1||Arm wrestling (6), Crawling baby (77)|
|Presenting weather forecast (254),|
|Surfing crowd (336)|
|Dancing||Belly dancing (18), Capoeira (43),|
|Line dancing (75), Salsa (283),|
|Tango (348), Zumba (399)|
|Gym||Lunge (183), Pull Ups (255), Push Up (260),|
|Situp (305), Squat (330)|
|Office||Answer phone (28), Play with phone/tablet (29),|
|Typing on a keyboard (30), Read watch (33)|
|Fighting||Punching (50), Kicking (51), Pushing (52),|
|Patting on back (53)|
The NTU-RGB+D dataset by Shahroudy et al. [Shahroudy_2016_CVPR] consists of clips showing one or two people performing one of 60 action classes. Classes include both actions of a single person and two-person interactions, captured using static cameras. It is provided with 3D joint measurements that are estimated using a Kinect depth sensor.
For this dataset, we use a model configuration similar to the one used for the ShanghaiTech experiments, with dimensions adapted for 3D pose.
The Kinetics dataset by Kay et al. [kay2017kinetics] is a collection of 400 action classes, each with over 400 clips that are 10 seconds long. The clips were downloaded from YouTube and may contain any number of people that are not guaranteed to be fully visible.
Since Kinetics was not intended originally for pose estimation, some classes are unidentifiable by human pose extraction methods, e.g., the hair braiding class contains mostly clips focused on arms and heads. For such videos, a full-body pose estimation algorithm will yield zero keypoints for most cases.
Therefore, we use a subset of Kinetics-400 that is suitable for evaluation using pose sequences. To do that, we turn to the action classification results of [Yan2018SpatialTG]. Using their publicly available model we pick a subset of the 250 best-performing action classes, ranked by their top-1 training classification accuracy. The accuracy of the class that had the lowest score is . We denote our subset Kinetics-250.
Due to the vast size of Kinetics (1000x larger than ShanghaiTech), we used a single GCN for the spatial convolution, using static adjacency matrices only, and no pooling. This makes this block identical to the one proposed by [Yan2018SpatialTG], used for this specific setting only. We quantify the degradation of this variant in the suppl. Kinetics is not annotated for pose and we use a 2D pose estimation model.
We report Area under ROC Curve (AUC) results in Table 2. As these datasets require clip level annotations, the sliding window approach is not required for our method, and each temporal pose graph is evaluated in a single forward pass, with the highest scoring person taken.
As can be seen, our algorithm outperforms all four competing (unsupervised) methods, often by a large margin. The algorithm works well in both random and meaningful split modes, as well as in the Few vs. many and Many vs. few settings. Observe, however, that the algorithm works better on the meaningful splits (compared to the random splits). We believe this is because meaningful splits share similar patterns.
The table also reveals the impact of the quality of pose estimation on results. That is, the NTU-RGB+D dataset is cleaner and the human pose is recovered using the Kinect depth sensor. As a result, the estimated poses are more accurate and the results are generally better than the Kinetics-250 dataset.
Figure 4 shows some failure cases. The recovered pose graph is superimposed on the image. As can be seen, there is significant variability in scenes, viewpoints and poses of the people in a single clip. Depicted in column (a), a highly crowded scene causes numerous occlusions and people being partially detected. The large number of partially extracted people causes a large variation in model provided scores, and misses the abnormal skater for multiple frames.
The two failures depicted in columns (b-c) show the weakness of relying on extracted pose for representing actions in a clip. Column (b) shows a cyclist is very partially extracted by the pose estimation method and missed by the model. Column (c) shows a non-person related event, not handled by our model. Here, a vehicle crossing the frame.
We conduct a number of experiments to evaluate the robustness of our model to noisy normal training sets, i.e., having some percentage of abnormal actions present in the training set, presented next. We also conduct experiments to evaluate the importance of key model components and the stages of our clustering approach, presented in the suppl.
In many scenarios, it is impossible to determine whether a dataset contains nothing but normal samples, and some robustness to noise is required. To evaluate the model’s robustness to the presence of abnormal examples in the normal training sample, we introduce a varying number of abnormal samples chosen at random to the training set. These are taken from the unused abnormal portion of the dataset. Results are presented in Figure 5. Our model is robust and handles large amount of abnormal data during training with little performance loss.
For most anomaly detection settings, events occurring at a rate are considered very frequent. Our model loses on average less than of performance when trained with this amount of distractions. When trained with abnormal noise, there is a considerable decline in performance. In this setting, the training set usually consists of 5 classes, so distraction rate may be larger than an individual underlying class.
We propose an anomaly detection algorithm that relies on estimated human poses. The human poses are represented as temporal pose graphs and we jointly embed and cluster them in a latent space. As a result, each action is represented as a soft-assignment vector in latent space. We analyze the distribution of these vectors using the Dirichlet Process Mixture Model. The normality score provided by the model is used to determine if the action is normal or not.
The proposed algorithm works on both fine-grained anomaly detection, where the goal is to detect variations of a single action (e.g., walking), as well as a new coarse-grained anomaly detection setting, where the goal is to distinguish between normal and abnormal actions.
Extensive experiments show that we achieve state-of-the-art results on ShanghaiTech, one of the leading (fine-grained) anomaly detection data sets. We also outperform existing unsupervised methods on our new coarse-grained anomaly detection test.
The supplementary material provides additional ablation experiments, as well as details regarding experiment splits and results, and information regarding the proposed spatial attention graph convolution, and implementation of our method and of the baseline methods.
In Section D we go into further details regarding the proposed spatial attention graph convolution operator. Section E provides implementation details for our method, and Section F describes the implementations the of baseline methods used.
In this section we provide further ablation experiments used to evaluate different model components:
In Table 4 we evaluate the contribution of two key components of our configuration. First, the input representation for nodes. We compare the Pose and Patch keypoint representations.
In the Pose representation, each graph node is represented by its coordinate values () provided by the pose estimation model. In the Patch
representation, we use features extracted using a CNN from a patch surrounding each keypoint.
Then, we evaluate the spatial graph operator used. We deonote our spatial attention graph convolution by SAGC, and the single adjacency variant by GCN. It is evident that both the use of patches and of the spatial attention graph convolution play a key role in our results.
We conducted a number of ablation tests on one of the splits to measure the importance of the number of clusters , the clustering initialization method, the proposed normality score, and the fine-tuning training stage. Results are summarized in Table 5.
The different columns correspond to different numbers of clusters. As can be seen, best results are usually achieved for
and we use that value through all our experiments in the coarse setting. Each pair of rows correspond to two normality scores that we evaluate. ”Dir.” stands for the Dirichlet based normality score. ”Max” simply takes the maximum value of the softmax layer, the soft-assignment vector. Our proposed normality score performs consistently better (except for the case of).
The first two rows of the table evaluate the importance of initializing the clustering layer. Rows 3-4 show the improvement gained by using -means for initialization compared to the random initialization used in rows 1-2.
Next, we evaluate the importance of the fine-tuning stage. Models that were fine-tuned are denoted by DEC in the table. Models in which the fine-tuning stage was skipped are denoted by No DEC. Rows 3-4 show results without using the fine-tuning stage, while rows 5-6 show results with. As can be seen, results improve considerably (except for the case of ).
|Random init, DEC, Max||0.45||0.42||0.44|
|Random init, DEC, Dir.||0.48||0.52||0.49|
|K-means init, No DEC, Max||0.57||0.51||0.48|
|K-means init, No DEC, Dir.||0.51||0.59||0.57|
|K-means init, DEC, Max||0.58||0.71||0.72|
|K-means init, DEC, Dir.||0.68||0.82||0.74|
It is instructive to look at the clusters of the different data sets (Figure 6). Top row shows some cluster centers in the fine-grained setting and bottom row shows some cluster centers in the coarse-grained setting. As can be seen, the variation in the fine grained setting is mainly due to viewpoint, because most of the actions are variation on walking. On the other hand, the variability of the coarse-grained data set demonstrate the large variation in the actions that handled by our algorithm.
In this setting, actions close to different cluster centroids depict common variations of the singular action taken to be normal, in this case, walking directions. The dictionary action words depict clear, unoccluded and full body samples from normal actions.
Frames selected from clips corresponding to base words extracted from a model trained on the Kinetics-250 dataset, split Random 6. Here, actions close to the centroids depict an essential manifestation of underlying action classes depicted. Several clusters in this case depict the underlying actions used to construct the split: Image (d) shows a sample from the ’presenting weather’ class. Facing the camera, pointing at a screen with the left arm while keeping the right one mostly static is highly representative of presenting weather; Image (e) depicts the common pose from the ’arm wrestling’ class and, Image (f) does the same for the ’crawling’ class.
We will now present in detail several components of our spatial attention graph convolution layer. It is important to note that every kind of adjacency is applied independently, with different convolutional weights. After concatenating outputs from all GCNs, dimensions are reduced using a learnable convolution operator.
For this section, denotes the number of samples, is the number of graph nodes and is the number of channels. During the spatial processing phase, pose from each frame is processed independently of temporal relations.
We use three GCN operators, each corresponding to a different kind of adjacency matrices. Following each GCN we apply batch normalization and a ReLU activation. If a single adjacency matrix is provided, as in the static and globally-learnable cases, it is applied equally to all inputs. In the inferred case, every sample is applied the corresponding adjacency matrix.
Generally, the attention mechanism is modular and can be replaced by any graph attention model meeting the same input and output dimensions. There are several alternatives ([vaswani2017attention, velickovic2018graph]) which come at significant computational cost. We chose a simpler mechanism, inspired by [Luong_2015, 2sagcn2019cvpr]. Each sample’s node feature matrix, shaped , is multiplied by two separate attention weight matrices shaped . One is transposed and the dot product between the two is taken, followed by normalization. We found this simple mechanism to be useful and powerful.
For extracting pose graphs from the ShanghaiTech dataset we used Alphapose [fang2017rmpe]. Pose tracking is done using Poseflow [xiu2018poseflow]. Each keypoint is provided with a confidence value. For Kinetics-250 we use the publicly available keypoints333https://github.com/open-mmlab/mmskeleton extracted using Openpose[Cao_2017]. The above datasets use 2D keypoints with confidence values.
The NTU-RGB+D dataset is provided with 3D keypoint annotations, acquired using a Kinect sensor. For 3D annotations, there are 25 keypoints for each person.
The ShanghaiTech model variant using patch features as input network embeddings works as following: First, a pose graph is extracted. Then, around each keypoint in the corresponding frame, a patch is cropped. Given that pose estimation models rely on object detectors (Alphapose uses FasterRCNN[ren2015faster]), intermediate features from the detector may be used with no added computation. For simplicity, we embedded each patch using a publically available ResNet model444https://github.com/akamaster/pytorch_resnet_cifar10. Features used as input are the dimensional output of the global average pooling layer. Other than the input layer’s shape, no changes were made to the network.
A symmetric structure was used for ST-GCAE. Temporal downsampling by factors of 2 and 3 were applied in the second and forth blocks. The decoder is symmetrically reversed. We use clusters for NTU-RGB+D and Kinetics-250 and clusters for ShanghaiTech. During the training stage, the samples were augmented using random rotations and flips. During the evaluation we average results for each sample over its augmented variants. Pre- and post-processing practices were applied equally to our method and all baseline methods.
Each model begins with a pre-training stage, where the clustering loss isn’t used. A fine-tuning stage of roughly equal length follows during which the model is optimized using the combined loss, with the clustering loss coefficient for all experiments. The Adam optimizer [kingma_adam] is used.
The evaluation of the future frame prediction method by Liu et al. [Liu_2018] was conducted using their publicly available implementation555https://github.com/stevenliuwen/ano_pred_cvpr2018/. Similarly, the evaluation of the Trajectory based anomaly detection method by Morais et al. [Morais_2019_CVPR] was also conducted using their publicly available implementation666https://github.com/RomeroBarata/skeleton_based_anomaly_detection. Training was done using default parameters used by the authors, and changes were only made to adapt the data loading portion of the models to our datasets.
The classifier based supervised baseline used for comparison is based on the basic ST-GCN block used for our method. We use a model based on the architecture proposed by Yan et al. [Yan2018SpatialTG], using their implementation777https://github.com/yysijie/st-gcn/. For the Few vs. Many experiments we use 6 ST-GCN blocks, two with 64 channel outputs, two with 128 channels and two with 256. This is the smaller model of the two, designed for the smaller amount of data available for the Few vs. Many experiments. For the Many vs. Few experiments we use 9 ST-GCN blocks, three with 64 channel outputs, three with 128 channels and three with 256. Both architectures use residual connections in each block and a temporal kernel size of 9. In both, the last layers with 64 and 128 channels perform temporal downsampling by a factor of 2. Training was done using the Adam optimizer.
The method provides a probability vector of per-class assignments. The vector is used as the input to the Dirichlet based normality scoring method that was used by our model. The scoring function’s parameters are fitted using the training set data considered “normal”, and in test time, each sample is scored using the fitted parameters.
We use “sup.” to denote the supervised, classifier-based baseline in all figures. This method is fundamentally different from all others, and uses the class labels for supervision.
One can observe that for all settings our method is the top performer in most splits compared to unsupervised methods, often by a large margin.
Random splits were used to objectively evaluate the ability of a model to capture a specific subset of unrelated actions. Meaningful splits were chosen subjectively to contain a binding logic regarding the action’s physical or environmental properties, e.g. actions depicting musicians playing or actions one would likely see in a gym.
Figure 7 provides the top-1 training classification accuracy achieved by Yan et al. [Yan2018SpatialTG] for each class in Kinetics-400 in descending order. It is used to show our cutoff point for choosing the Kinetics-250 classes.
|Few vs. Many||Many vs. Few|
|Method||Rec. Loss||OC-SVM||FFP [Liu_2018]||Ours||Sup.||Rec. Loss||OC-SVM||FFP [Liu_2018]||Ours||Sup.|
|Few vs. Many||Many vs. Few|
|Method||Rec.||OC-SVM||FFP [Liu_2018]||TBAD [Morais_2019_CVPR]||Ours||Sup.||Rec.||OC-SVM||FFP [Liu_2018]||TBAD [Morais_2019_CVPR]||Ours||Sup.|
|Arms||Pointing to something with finger (31), Salute (38), Put the palms together (39), Cross hands in front (say stop) (40)|
|Brushing||Drink water (1), Brushing teeth (3), Brushing hair (4)|
|Dressing||Wear jacket (14), Take off jacket (15), Wear a shoe (16), Take off a shoe (17)|
|Dropping||Drop (5), Pickup (6), Sitting down (8), Standing up (from sitting position) (9)|
|Glasses||Wear on glasses (18), Take off glasses (19), Put on a hat/cap (20), Take off a hat/cap (21)|
|Handshaking||Hugging other person (55), Giving something to other person (56), Touch other person’s pocket (57), Handshaking (58)|
|Office||Make a phone call/answer phone (28), Playing with phone/tablet (29), Typing on a keyboard (30),|
|Check time (from watch) (33)|
|Fighting||Punching/slapping other person (50), Kicking other person (51), Pushing other person (52), Pat on back of other person (53)|
|Touching||Touch head (headache) (44), Touch chest (stomachache/heart pain) (45), Touch back (backache) (46),|
|Touch neck (neckache) (47)|
|Waving||Clapping (10), Hand waving (23), Pointing to something with finger (31), Salute (38)|
|Random 1||Brushing teeth (3), Pointing to something with finger (31), Nod head/bow (35), Salute (38)|
|Random 2||Walking apart from each other (0), Throw (7), Wear on glasses (18), Hugging other person (55)|
|Random 3||Brushing teeth (3), Tear up paper (13), Wear jacket (14), Staggering (42)|
|Random 4||Eat meal/snack (2), Writing (12), Taking a selfie (32), Falling (43)|
|Random 5||Playing with phone/tablet (29), Check time (from watch) (33), Rub two hands together (34), Pushing other person (52)|
|Random 6||Eat meal/snack (2), Take off glasses (19), Take off a hat/cap (21), Kicking something (24)|
|Random 7||Drop (5), Tear up paper (13), Wear on glasses (18), Put the palms together (39)|
|Random 8||Falling (43), Kicking other person (51), Point finger at the other person (54), Point finger at the other person (54)|
|Random 9||Wear on glasses (18), Rub two hands together (34), Falling (43), Punching/slapping other person (50)|
|Random 10||Throw (7), Clapping (10), Use a fan (with hand or paper)/feeling warm (49), Giving something to other person (56)|
|Batting||Golf driving (143), Golf putting (144), Hurling (sport) (162), Playing squash or racquetball (246), Playing tennis (247)|
|Cycling||Riding a bike (268), Riding mountain bike (272), Riding unicycle (276), Using segway (376)|
|Dancing||Belly dancing (19), Capoeira (44), Country line dancing (76), Salsa dancing (284), Tango dancing (349), Zumba (400)|
|Gym||Lunge (184), Pull Ups (265), Push Up (261), Situp (306), Squat (331)|
|Jumping||High jump (152), Jumping into pool (173), Long jump (183), Triple jump (368)|
|Lifters||Bench pressing (20), Clean and jerk (60), Deadlifting (89), Front raises (135), Snatch weight lifting (319)|
|Music||Playing accordion (218), Playing cello (224), Playing clarinet (226), Playing drums (231), Playing guitar (233),|
|Playing harp (235)|
|Riding||Lunge (184), Pull Ups (256), Push Up (261), Situp (306), Squat (331)|
|Skiing||Roller skating (281), Skateboarding (307), Skiing slalom (311), Tobogganing (361)|
|Throwing||Hammer throw (149), Javelin throw (167), Passing american football (in game) (209), Shot put (299), Throwing axe (357),|
|Throwing discus (359)|
|Random 1||Climbing tree (69), Juggling fire (171), Marching (193), Shaking head (290), Using segway (376)|
|Random 2||Drop kicking (106), Golf chipping (142), Pole vault (254), Riding scooter (275), Ski jumping (308)|
|Random 3||Bench pressing (20), Hammer throw (149), Playing didgeridoo (230), Sign language interpreting (304),|
|Wrapping present (395)|
|Random 4||Cleaning floor (61), Ice fishing (164), Using segway (376), Waxing chest (388)|
|Random 5||Barbequing (15), Golf chipping (142), Kissing (177), Lunge (184)|
|Random 6||Arm wrestling (7), Crawling baby (78), Presenting weather forecast (255), Surfing crowd (337)|
|Random 7||Bobsledding (29), Canoeing or kayaking (43), Dribbling basketball (100), Playing ice hockey (236)|
|Random 8||Playing basketball (221), Playing tennis (247), Squat (331)|
|Random 9||Golf putting (144), Juggling fire (171), Walking the dog (379)|
|Random 10||Jumping into pool (173), Krumping (180), Presenting weather forecast (255)|
Air drumming (2)
Arm wrestling (7)
Arranging flowers (8)
Assembling computer (9)
Belly dancing (19)
Bench pressing (20)
Bending back (21)
Biking through snow (23)
Blasting sand (24)
Blowing glass (25)
Blowing out candles (28)
Bouncing on trampoline (31)
Braiding hair (33)
Building cabinet (39)
Building shed (40)
Bungee jumping (41)
Canoeing or kayaking (43)
Carrying baby (45)
Catching or throwing softball (51)
Chopping wood (57)
Clean and jerk (60)
Cleaning floor (61)
Climbing a rope (67)
Climbing tree (69)
Contact juggling (70)
Cooking chicken (71)
Country line dancing (76)
Cracking neck (77)
Crawling baby (78)
Curling hair (81)
Dancing ballet (85)
Dancing charleston (86)
Dancing gangnam style (87)
Dancing macarena (88)
Disc golfing (93)
Diving cliff (94)
Doing aerobics (96)
Doing nails (98)
Dribbling basketball (100)
Driving car (104)
Driving tractor (105)
Drop kicking (106)
Dunking basketball (108)
Dying hair (109)
Eating burger (110)
Eating spaghetti (117)
Exercising arm (120)
Extinguishing fire (122)
Feeding birds (124)
Feeding fish (125)
Feeding goats (126)
Filling eyebrows (127)
Finger snapping (128)
Flying kite (131)
Folding clothes (132)
Front raises (135)
Frying vegetables (136)
Giving or receiving award (141)
Golf chipping (142)
Golf driving (143)
Golf putting (144)
Grooming horse (147)
Gymnastics tumbling (148)
Hammer throw (149)
High jump (152)
Hitting baseball (154)
Hockey stop (155)
Hula hooping (160)
Hurling (sport) (162)
Ice climbing (163)
Ice fishing (164)
Ice skating (165)
Javelin throw (167)
Juggling balls (170)
Juggling fire (171)
Juggling soccer ball (172)
Jumping into pool (173)
Jumpstyle dancing (174)
Kicking field goal (175)
Kicking soccer ball (176)
Long jump (183)
Making bed (187)
Making snowman (190)
Massaging back (194)
Milking cow (198)
Mowing lawn (202)
News anchoring (203)
Passing american football (in game) (209)
Passing american football (not in game)(210)
Picking fruit (215)
Playing accordion (218)
Playing badminton (219)
Playing bagpipes (220)
Playing basketball (221)
Playing bass guitar (222)
Playing cello (224)
Playing chess (225)
Playing clarinet (226)
Playing cricket (228)
Playing didgeridoo (230)
Playing drums (231)
Playing flute (232)
Playing guitar (233)
Playing harmonica (234)
Playing harp (235)
Playing ice hockey (236)
Playing kickball (238)
Playing organ (240)
Playing paintball (241)
Playing piano (242)
Playing poker (243)
Playing recorder (244)
Playing saxophone (245)
Playing squash or racquetball (246)
Playing tennis (247)
Playing trombone (248)
Playing trumpet (249)
Playing ukulele (250)
Playing violin (251)
Playing volleyball (252)
Playing xylophone (253)
Pole vault (254)
Presenting weather forecast (255)
Pull ups (256)
Pumping fist (257)
Punching bag (259)
Punching person (boxing) (260)
Push up (261)
Pushing car (262)
Pushing cart (263)
Reading book (265)
Riding a bike (268)
Riding camel (269)
Riding elephant (270)
Riding mechanical bull (271)
Riding mountain bike (272)
Riding or walking with horse (274)
Riding scooter (275)
Riding unicycle (276)
Robot dancing (278)
Rock climbing (279)
Rock scissors paper (280)
Roller skating (281)
Running on treadmill (282)
Salsa dancing (284)
Sanding floor (285)
Scrambling eggs (286)
Scuba diving (287)
Shaking head (290)
Shaving head (293)
Shearing sheep (295)
Shooting basketball (297)
Shot put (299)
Shoveling snow (300)
Shuffling cards (302)
Side kick (303)
Sign language interpreting (304)
Ski jumping (308)
Skiing (not slalom or crosscountry) (30
Skiing crosscountry (310)
Skiing slalom (311)
Skipping rope (312))
Sled dog racing (316)
Smoking hookah (318)
Snatch weight lifting (319)
Spinning poi (327)
Springboard diving (330)
Stomping grapes (333)
Stretching arm (334)
Stretching leg (335)
Strumming guitar (336)
Surfing crowd (337)
Surfing water (338)
Sweeping floor (339)
Swimming backstroke (340)
Swimming breast stroke (341)
Swimming butterfly stroke (342)
Swinging legs (344)
Tai chi (347)
Tango dancing (349)
Tap dancing (350)
Tapping guitar (351)
Tapping pen (352)
Tasting beer (353)
Throwing axe (357)
Throwing discus (359)
Training dog (364)
Trimming or shaving beard (366)
Triple jump (368)
Tying tie (371)
Using segway (376)
Waiting in line (378)
Walking the dog (379)
Washing feet (381)
Water skiing (384)
Waxing chest (388)
Waxing eyebrows (389)
Wrapping present (395)