1 Introduction
Video Anomaly Detection (VAD) [chalapathy2019deep]
is a challenging yet essential task in computer vision, which is to recognize frames with abnormal events in the video, such as criminal activities and traffic accidents. However, the undisputed nature that abnormal events are far more infrequent than the normal ones makes it nearly impossible to utilize conventional discriminative methods for VAD. Thus, selfsupervised methods are widely deployed in the field to learn the normal pattern. In this way, detecting anomalies can be viewed as a task of recognizing outofdistribution samples.
Pixelbased methods [liu2018ano_pred, ionescu2019object, SSMT, gong2019memorizing, zaheer2020old, cai2021appearance, chang2020clustering]
have been extensively studied for VAD. In those methods, intuitive motion representations, such as optical flow and frame gradients, have been exploited to improve the model sensitivity towards dynamics. Recently, thanks to the success of pose estimation algorithms, pose, as a more clean and wellstructured data, gradually attracted the attention of researchers
[2019Multi, 2020Learning, markovitz2020graph]. Since pose is immune from background noise and contains highlevel semantics, it is naturally a convenient feature for humanrelated VAD. Consequently, posebased method is viewed as a more promising approach and expected to outperform the pixelbased counterpart.Nonetheless, from previous works, even possessing the aforementioned advantages, the performance gain of posebased methods is still limited. Herein, we claim the reason lies in the fact that, different from image anomaly detection that relies on a static feature only (e.g., appearance), VAD depends more on dynamic features. For example, as shown in the top half of Figure 1, the man in the red box may be judged as normal if only given the right frame, while the confidence of this assertion could be weakened if the left frame is available, since he is jumping on the crosswalk. Therefore, an effective motion representation is essential to regular video pattern learning in VAD. However, in posebased methods, motion representation, like optical flow, is hard to obtain without RGB information. In previous posebased methods [2020Learning, markovitz2020graph, 2019Multi], researchers usually leverage individual keypoints to encode visual features, which lack intuitive motion representation. In such a situation, the detection model will be overwhelmed to learn both motion and normality simultaneously, which undermines its performance. Hence, it is urgent to find a substitute for posebased methods.
In this paper, we propose a novel Motion Prior Regularity Learner (MoPRL) to alleviate the aforementioned limitations in posebased methods. MoPRL is composed of two submodules: Motion Embedder (ME) and SpatialTemporal Transformer (STT). Specifically, ME is designed to extract the spatialtemporal representation of input poses from the perspective of probabilistic. Inspired by the commonly used frame gradients [yu2020cloze] in pixelbased methods, we model the pose motion based on the displacement between center point of the poses among adjacent frames. However, directly applying the displacement as the motion representation is oversimplified. Given the common assumption that the anomaly rarely happens, we further transform such movement into the probability domain. In detail, we obtain the motion prior, which represents an explicit distribution of displacement on the training data, by statistics. In this way, as shown in the bottom half of Figure 1, to represent corresponding motion, every pose displacement is mapped to a certain probability based on the motion prior. Equipped with a designed pose masking strategy, STT is then deployed as a taskspecific model to learn the regular patterns with the input of poses and their motion features from ME. Different from previous RNNbased or CNNbased frameworks, transformer is adopted for its selfsupervised and sequential input structure, which naturally fit the pose regularity learning. In summary, the main contributions of this paper are three folds:

Motion Embedder is proposed to intuitively represent pose motion in the probability domain, which provides effective pose motion representation for its regularity learning.

SpatialTemporal Transformer with pose masking and divided attention is further deployed to model regularity in pose trajectories. To the best of our knowledge, it is the first time that transformer has been applied to video anomaly detection.

The proposed MoPRL, which consists of Motion Embedder and SpatialTemporal Transformer, achieves the stateoftheart performance on two most challenging datasets. Ablation studies demonstrate the effectiveness of each module, and the insight for future work based on the failure case analysis is provided.
2 Related Works
2.1 Selfsupervised Video Anomaly Detection
In selfsupervised video anomaly detection, anomalies are recognized as outliers of the distribution of normality. The pixelbased framework handles the problem by reconstruction and prediction. In
[hasan2016learning], Hasan et al. reconstructed the normal appearance [dalal2005histograms] and motion [dalal2006human] features to learn video regularity. In [luo2017revisit], the authors leveraged sparsing coding to enforce adjacent frames to be encoded with similar reconstruction coefficients. In [liu2018ano_pred], Liu et al. proposed a future frame prediction framework with the optical flow [ilg2017flownet] as additional input. In [Nguyen_2019_ICCV], Nguyen et al. proposed a crosschannel translation framework to learn the coherence between motion and appearance. In a recent work [liu2021hybrid], Liu et al. exploited Conditional Variational Autoencoder to capture the correlation between the frame and the optical flow.
Recently, posebased methods have been popular because of their efficiency and immunity from background noise. In [2020Learning], the authors proposed a connected RNN for learning pose regularity with decomposed keypoints. In [2019Multi], Rodrigues et al. deployed a multitimescale prediction framework to model trajectories. Moreover, Markovitz1 et al.[markovitz2020graph] learned poses graph embeddings with autoencoders and generated softassignments via clustering. However, it is rather difficult to obtain a sophisticated dynamic feature in posebased methods due to the lack of RGB information. In this work, we proposed a novel Motion Embedder to generate posebased motion representation from probability domain.
2.2 Vision Transformer
Transformer [vaswani2017attention] has gradually become a mainstream framework for computer vision community [lin2019tsm, wang2016temporal, ren2015faster, su2018cascaded, qing2021temporal] for its tremendous potential in sequence modeling. It has achieved competitive even superior performance compared with CNNbased methods in image classification [dosovitskiy2020vit, liu2021Swin, pmlrv139touvron21a], object detection[2020End], semantic segmentation[zheng2021rethinking], etc. Dosovitskiy et al.[dosovitskiy2020vit] viewed an image as a patch sequence and constructed ViT to achieve effective recognition. Carion et al.[2020End] regarded object detection problem as a set direction prediction task and solve it with a transformerbased encoderdecoder called DETR. Similarly, SETR[zheng2021rethinking] is proposed for image context modeling in semantic segmentation. Transformer is also deployed to estimate 3D human pose as in [zheng20213d]. The authors build a divided temporalspatial transformer to model the pose sequence. HOTNet [huang2020hot] fully exploits the correlation between joints and object corners to obtain more accurate estimation. In this work, we also utilize a divided transformer similar with [zheng20213d] to process the scaled pose data obtained from our former Motion Embedder.
3 Methods
In this section, we introduce our proposed posebased video anomaly detection method called Motion Prior Regularity Learner (MoPRL), which consists of two submodules as shown in Figure 2, namely Motion Embedder (ME) and SpatialTemporal Transformer (STT
). We first utilize a pose detector to obtain the pose trajectories. Unlike pixelbased methods, which adopt the widelyused optical flow as the motion representation, MoPRL models the posedbased motion representation as a probability distribution according to the statistical velocity and fuses spatial and temporal representation via the ME. Then, STT is applied to learn the spatialtemporal regularity with a selfsupervised reconstruction task. In this way, the model learns the distribution of the normal samples; thus, the anomalies could be detected according to the Frame Anomaly Score, which will be discussed in detail.
3.1 Task Definition
Given the training set = {, …, } and the test set = {(, ), …, (, )}, where represents the frames and {0, 1} indicates the label of normality or anomaly. There are only normal samples in training set and both normal and abnormal ones in test set. We denote = {, …, } and ={, …, }, which means each frame contains trajectory sequences of human pose and each sequence consists of single poses. We denote = {, …, }, where means the th joint in the th pose, and represents the maximum number of joints in single pose. Moreover, joint is represented as a coordinate . A sliding window strategy is applied to match a single frame with its corresponding trajectories. The goal of posebased video anomaly detection is to distinguish the anomaly with in the test set according to those human poses.
3.2 Pose Preprocessing
Following the prepossessing operation proposed in [2020Learning], we decompose the original pose into a local normalized pose and a global center point. We first calculate the center point and the size of the human box according to the maximum and minimum coordinates of the keypoints, and then normalize the pose as based on the human box size, where is the normalized coordinates. The normalized pose unifies the scale in different distances, so even tiny changes from far poses can be amplified and captured.
3.3 Motion Embedder
Since dense motion features, such as optical flow, cannot be obtained, posebased methods are lack of effective motion representations. In this work, we propose a multistep approach to get the intuitive pose motion and embed it with pose via the novel Motion Embedder (ME). We first calculate normalized displacement between adjacent poses in sequence and then obtain an explicit discrete distribution describing the training dataset displacement statistic. After this, we choose a predefined distribution (e.g., Rayleigh or Gaussian) to fit the discretized distribution and obtain its continuous version, which we refer to as motion prior. In the end, we leverage both the normalized pose and its motion probability, which represent spatial and temporal information, respectively, to obtain the motion embedded pose. We will introduce ME in detail in the following subsections.
Displacement Calculation. The displacement between each pose could be regarded as an average velocity during a short period. Thus, we consider utilizing displacement to construct the foundation of motion representation. Empirically, the displacement is calculated as follows:
(1) 
(2) 
where represents the pose displacement between adjacent frames, which is also the average velocity from pose to pose . Similar to what we have done to pose, we also normalize the velocity to obtain a normalized version which eliminates the influence of perspectives. Nonetheless, directly leveraging as the motion representation is oversimplified that the essence of a normal motion (normality is common) would be overlooked, leading to that the spatialtemporal feature cannot be effectively represented. We resolve this issue from the perspective of the probability.
Probabilities as Scaling Factors. We first obtain the statistic of the displacements by counting the modes of their normalized version in the training set. Intuitively, we utilize a predefined distribution function to fit this discretized data distribution, by means of which we obtain a continuous displacement distribution. We call this continuous distribution as Motion Prior. Based on the fact that different distribution models its lowfrequency part in different manners, we should carefully select an appropriate prior to ensure that it fits more with the realworld distribution, which is believed to be beneficial for the quality of the representation derived from Motion Embedder. As shown in Figure 3, we can tell that the real distribution of the displacement corresponds more to the Rayleigh distribution, and then is the Gaussian. The experimental results also demonstrate that Rayleigh prior has the best performance. In order to obtain a versatile representation that contains both temporal and spatial information, we expect to combine the normalized pose, which stands for spatial information, and the motion prior, which is actually the temporal representation, we consider employing the probability in motion prior as a scaling factor:
(3) 
where is the selected prior to fit the discretized explicit distribution of displacement statistic, and the scaling mechanism is as follows:
(4) 
where is the pose feature after the scaling operation and represents motion embedded pose that fuses the spatial and temporal information for the th pose. This is exactly the reason that we call this module Motion Embedder. It is worth noting that, to avoid numerical error, we additionally deploy an affine transformation to the scaling factor, for it may be used as a denominator. Consequently, as shown in Figure 4, we can obtain a pose with a larger size if the emergence frequency is lower. will then be used as the input of the following module.
3.4 SpatialTemporal Transformer
To learn the regularity of human pose trajectories, we proposed to utilize transformerbased module to process the motion embedding aforementioned, because of its acknowledged advantage of modeling sequential data. However, the orthodox transformer model results in a computational complexity of (where is the number of joints in a single pose, the pose number in a single trajectory) and grows exponentially with the increasing and . Thus, inspired by [gberta_2021_ICML, zheng20213d], we divided attention mechanism into spatial and temporal parts to decrease the computational complexity to . We call this variant of transformer SpatialTemporal Transformer (STT). Specifically, STT contains an layer spatial transformer and an layer temporal transformer. Aiming to fully exploit the potential of STT, we view and
as hyperparameters and experimentally identify their value, which will be shown in experiment section. In following subsections, we introduce the structure of STT.
Masked Pose Embedding. Before the transformer blocks, we first obtain the embedding of joints. Specifically, for joint , we follow the mask operation proposed in [devlin2018bert]
to obtain its masked version, and then we map it into the embedding space to obtain the joint vector
, where is the embedding dimension, as the following equation:(5) 
where is the mask function that operated on with a certain probability, the learnable embedding matrix. Moreover, represents a learnable spatial position embedding (SPE) deployed to encode the spatial position of the th joint in a single pose. Notably, the mask operation only works during the training. After this, we obtain the embedding of the th pose as . As a result, the embedding matrix for a whole trajectory is .
Spatial Transformer. We manage to model the trajectory on spatial domain with a layer Spatial Transformer. To be noticed, we conduct selfattention on dimension of joint number, i.e., . Without loss of generality, we denote the input trajectory of the th layer as , where . The multilayer attention operation is given by:
(6) 
(7) 
(8) 
where is the query, key and value matrix, , , the corresponding project heads. The subscript
indicates a tensor after layer normalization.
and represents softmax operation and fullyconnected layer, respectively. Actually, we leverage multihead selfattention as our attention operation for stronger representation. As it has been a common structure, we ignore its formulation here for simplicity. Please refer to [vaswani2017attention] for more details.Temporal Transformer. We then model pose trajectories on the temporal domain with a layer Temporal Transformer. Taking the output of Spatial Transformer as input, we first incorporate temporal position information for each joints embedding as follows:
(9) 
where represents the th joints embedding in the th frame of , a learnable temporal position embedding (TPE). Thus, the trajectory embedding matrix can be obtained accordingly. Following the same steps in Equation 6, 7 and 8 in aforementioned spatial transformer, we finally obtain the spatialtemporal output .
3.5 Selfsupervised Training
In this subsection, we introduce our selfsupervised training to learn the regularity in human pose trajectories. Specifically, we achieve this via the commonly used reconstructive method. With the help of a reconstruction head, we are able to learn the distribution of the normal samples.
Reconstruction Head. As shown in the left side of Figure 2, taking motion embedded trajectory as input, the Reconstruction Head recovers the normalized trajectory from the output of the STT .
(10) 
Objective Functions. The final objective function can be described as follows:
(11) 
where is the confidence score of each pose joint coming from pose detector, and is the reconstructed joint in . Similar to the operation in [2019Multi], we also normalize the raw joints confidence.
3.6 Inference
Frame Anomaly Score. In this section, we introduce the mechanism that the proposed method detects framelevel anomaly with human pose trajectories. Firstly, an anomaly score , where and represent the th trajectory in the th frame, will be obtained from each pose trajectory via MoPRL. The anomaly score is the norm of the difference between and , given by:
(12) 
Since each frame may contain multiple trajectories, we select the highest as the framelevel anomaly score :
(13) 
The higher frame anomaly score suggests higher possibility for the current frame to be abnormal.
4 Experiments
In this section, we first introduce experimental details of proposed MoPRL. Extensive experiments are then conducted on two challenging datasets to evaluate the effectiveness and superiority of MoPRL with convincing qualitative examples.
Methods  SHT  SHTHR  Corridor 

MPEDRNN [2020Learning]  73.40  75.40  64.27 
GEPC [markovitz2020graph]  75.50  \  \ 
MTP [2019Multi]  76.03  77.04  67.12 
Ours  81.26  82.38  70.66 
Methods  Appearance Representation  Motion Representation  Sequence Modeling  AUC 
MPEDRCNN [2020Learning]  Pose Embedding  RNN [schuster1997bidirectional]  72.20  
ME  73.08  
Ours  68.32  
STT  68.56  
ME  76.92  
ME  STT  81.26 
4.1 Datasets and Setup
Datasets. We evaluate our method on two most challenging datasets: ShanghaiTech [luo2017revisit] contains 330 training videos and 107 testing ones. It consists of 13 training scenes and 12 testing scenes. And ShanghaiTechHR [2020Learning] is a subset of ShanghaiTech containing only HumanRelated anomaly with 101 testing videos. Corridor [2019Multi], a recent dataset for video anomaly detection with largest size, contains 10 abnormal classes in a single scene. It includes both single and multiple person anomalies.
Pose Estimator
. For fair comparison, we adopt the same pose estimator with other compared methods to avoid the variance caused by pose quality. Specifically, we use the tools
[fang2017rmpe, xiu2018poseflow] to obtain the pose trajectories as in [markovitz2020graph] on ShanghaiTech. While for the Corridor dataset, we extract pose trajectories with tools [openpose, chen2018real] as in [2019Multi]. Note that each pose joint is provided with a confidence score.Implementation Details. We apply AdamW [kingma2014adam] optimizer with an initial learning rate of and adopt a warmup schedule with 1000 steps. Empirically, the layers number of Spatial Transformer and Temporal Transformers are both set to . The batch size is and the dimension of vector embedding is 128. Each trajectory contains 8 poses () and each pose contains joints for AlphaPose [fang2017rmpe] results and joints for OpenPose [openpose] results (e.g., or ). To obtain the pose sequences, we sample the pose trajectory using sliding windows with window size of
and stride
. Following BERT [devlin2018bert], the mask ratio of poses is set to . We normalize the framelevel anomaly scores in each scenario for final evaluation as in [Nguyen_2019_ICCV]. And all experiments are conducted on the entire dataset without division by scenarios. More hyperparameters experiment reports are included in Supplementary.Evaluation Metrics
. Following the conventions, Area Under Curve (AUC) is calculated as the evaluation metric. A higher AUC indicates better anomaly detection performance.
4.2 Comparison with the StateoftheArts
Table 1 illustrates the comparison results of our MoPRL with other stateofthearts on two popular datasets respectively, namely ShanghaiTech (SHT) and Corridor. We can observe that our method can obtain consistent and significant performance improvement compared with other posebased methods on all conducted datasets. Qualitative examples are further provided to illustrate the anomaly prediction results of our method, as shown in Figure 5. Surprisingly, we can find that our method can not only detect obvious anomalous events (e.g., "Biking"), but also is capable of observing some subtle activities, such as "Robbing" happening in a far distance away from the camera. Besides, our MoPRL is sensitive to the the anomalies with extreme movements (e.g., "running" and "pushing") rather than the appearancebased anomaly events (e.g., "holding the suspicious object"), which is reasonable owing to the fact that original RGB information (e.g., clothing and belongings) is absent when only pose modality is adopted as input. However, many objectrelated anomaly classes (e.g., "wearing a mask" and "carring a box") are included in the Corridor dataset, which accounts for the reason that why performance achieved in ShanghaiTech dataset is higher than the Corridor dataset by a large margin (+10.6%). Therefore, our proposed method can achieve effective and nontrivial performance in abnormal poses discrimination.
4.3 Ablation Studies
In this section, we first conduct a series of exploration studies of each module proposed in our MoPRL method. As shown in Table 2, a linear autoencoder (including only an encoder and a decoder) with normalized poses as input is adopted as our baseline, which can only obtain 68.32% AUC in our experiments. Then the proposed motion embedder is also applied to other posebased method [2020Learning] to show the generalization ability and necessity of motion representation.
Motion Representation. As shown in Table 2, with the help of ME, significant AUC improvement can be observed even without sequence modeling (+8.60%), which demonstrates that motion representation is essential for posebased anomaly detection methods. Without loss of generality, we continue to evaluate the necessity of motion representation in a classical multitask posebased method named as MPEDRCNN [2020Learning], which adopts RNN for sequence modeling of input pose embedding. To be clarified, we reproduce the MPEDRCNN method to fit the proposed motion prior of ME, and only the reconstruction task with local normalized pose is applied to align the setting of our MoPRL for fair comparison. As expected, the proposed method can bring consistent performance improvement as listed in Table 2. It confirms that ME can truly benefit other posebased methods, and the motion representation should be accounted as an essential factor in developing posebased methods. However, the huge contrast of performance gain in different frameworks (8.60% vs. 0.88%) also raises a concern that such a distributionbased handcrafted feature extractor is still not an optimal way for all posebased methods. Future research can focus more on how to develop general motion representations for pose modality.
Sequence Modeling. We further explore the impact of sequence modeling which is considered as the core of spatialtemporal regularity learning. However, our STT can only bring trivial performance improvement (+0.24%) without motion representation. After combining with ME, STT can further boost the overall performance by a great margin (+4.34%), which reveals that the proposed ME module can help to obtain the discriminative motion clues of input poses, and the STT module can further model the temporal interdependencies. Thus, we argue that the normalized poses are inferior to describe the motion dynamics of pose trajectories, which hinders the potential of such sequence modeling. We also observe that RNN [schuster1997bidirectional] adopted in [2020Learning] actually leads to the performance declination (3.84%) when ME is applied. It demonstrates that the RNNbased model actually limits the performance gain from such motion representations. We hypothesize that, unlike transformerbased STT benefiting from big data, the cascaded and historydependent RNN underfits such large data size with limited model capacity. The visualization of reconstructed poses from STT and RNN (please refer to Supplementary) further confirms this assumption. Compared with STT, RNN model reconstructs poses with huge deviation even for normal samples. And such deviation actually shadows the distinction between normality and anomaly brought by the motion prior. Conclusively, both the motion representation and effective sequence modeling are important and indispensable for posedbased VAD.
4.4 Analysis on Motion Embedder
Motion Prior Type. As mentioned in Section 3.3, ME is designed to represent intuitive pose motion by probability prior. Since the distribution of such prior directly controls the probability of pose motion and explicitly decides how rare should the extreme motion be, the selection of prior matters to the final performance. In this section, we chose several explicit priors with obviously different shapes to quantitatively evaluate the impact of the motion prior type. As shown in the left half of Figure 6, we select uniform prior and Gaussian prior to compare with the Rayleigh prior we applied in our experiments. Referring to Figure 3, the result shows the larger difference between the selected prior and the statistical distribution on the training dataset is, the more the performance of MoPRL decline (12.07% with uniform prior and 7.47% with Gaussian prior). It demonstrates that the model can indeed benefit from of selected motion prior. And the gain will increase if the the discrepancy between the prior and the realwolrd motion distribution decreases.
SpatialTemporal Fusion Type. In order to obtain the spatialtemporal input for STT, ME fuses the motion prior into poses. Hence, the fusion operation mentioned in Section 3.3 should be thoroughly explored to verify its effectiveness. In this section, besides the division, we also deploy other common operation to conduct such fusion. As shown in the right half of Figure 6, the results show that MoPRL is less sensitive to different scalerelated operations (multiply with 80.14% and division with 81.12%). While the common fusion strategy in pixelbased methods, e.g., addition, does not work in posebased methods, impairing the model performance compared with baseline (62.69% versus 67.58% ). In this case, poses may just be moved from their original spatial location and perturbed by the fusion. It demonstrates the effectiveness and rationality of our proposed fusion method.
4.5 Analysis on SpatialTemporal Transformer
Attention  Computational Complexity  AUC 

No Attention  \  76.92 
Joint  77.88  
Spatial Only  78.64  
Temporal Only  80.45  
SpatialTemporal  81.26 
Attention Mechanism. A major concern of transformerbased models is its exponentially increasing complexity with input size. Thus, to alleviate this issue, we applied divided spatial and temporal attention. as shown in Table 3, we list comparative results among different attention mechanism for quantitatively evaluation. We establish the baseline as the model without any attention but with motion prior. The joint attention represents the vanilla transformer taking the entire sequence as input. It improves only with the most heavy computational burden. Moreover, observing the declination compared with spatial only () or temporal only () case, we claim that the indifferent attention among the entire sequences would actually harm the performance. Furthermore, the temporal modeling matters more to the performance with the most significant performance boosting (), which demonstrates that motion information is essential to the video anomaly detection. Finally, the highest performance with SpatialTemporal Attention demonstrates the effectiveness of our design.
/  1  2  4  6  8 

1  81.02  81.04  81.22  80.39  80.19 
2  81.15  81.26  80.14  80.86  79.42 
4  75.34  79.98  77.41  80.29  79.03 
6  78.25  74.92  78.67  79.79  78.50 
8  74.25  77.34  76.94  79.28  79.07 
Model Depth. In this section, we ablate several combinations of the layer depth in both spatial and temporal dimension. The results listed in Table 4 demonstrates that MoPRL does not actually benefit from a deeper attention structure. The deepest model brings 2.19% decrease. Further, compared with the deepest temporal attention which brings an average 2.26% decrease, the deepest spatial attention leads to a more significant average drop (3.39%).
Pose Masking and Position Embedding. In this section, we verify the the effectiveness of pose masking and position embedding. We leverage pose masking as an operation of data augmentation. As shown in the left of Figure 7, model performance first improves as the pose mask ratio increases and then drops to about 81.0% but still higher than the performance of the case without pose masking, which certainly demonstrates that model benefits from such perturbation. However, model would fail to handle regularity learning under a too large ratio. Furthermore, we also evaluate the performance change brought by different position embeddings strategies. As shown in the right of Figure 7, different position embedding approaches utilized on MoPRL bring consistent performance gain, and temporal position embedding contributes more to the model performance.
4.6 Runtime Analysis
Running speed is an essential factor for online video anomaly detection. The proposed MoPRL is deployed on an single NVIDIA RTX 3090 GPU. In inference, the best AUC performance setting of MoPRL runs at 108 fps, and the shallowest setting of MoPRL runs at 158 fps with 81.02% AUC. We also report the running speed of another posebased method [2020Learning] runs at 161 fps and pixelbased method [liu2021hybrid]
runs at about 66 fps on the same GPU. The speed bottleneck of posebased pipelines lies in the feature extraction, i.e., pose estimation and tracking process, which usually runs at 15 fps. Thus, the endtoend speed could be
14 fps.5 Discussion
Failure Cases. Posebased methods depend heavily on the quality of the estimated poses. Therefore, when the pose estimator fails to extract the pose structure, MoPRL would have poor performance (e.g., when the objects are occluded or fastmoving). Besides, the tracking algorithm often captures inaccurate trajectories in a crowd scene, which directly leads to a serious impediment for MoPRl to detect. Moreover, failure cases are also observed in objectrelative anomalies (e.g., human with a trailer or a mask, etc.) and motion direction anomalies (e.g., sudden turning around.). The absence of RGB information probably causes this, and Motion Embedder alone cannot capture directional motion feature.
Limitations. Although this work adopts the highlevel pose features extracted from the video, we still find it too challenging to only take the displacement for pose motion modeling. Other essential characteristics, like motion direction, are ignored. Thus, in the future, we suggest fully considering all possible motion features to construct a completed and universal pose motion representation for related tasks.
Broader Impacts. The two surveillance video datasets [2019Multi, luo2017revisit] utilized in this work may lead to potential risk of portrait rights, since the frames would inevitably contains unwitting passersby though curated collected.
6 Conclusion
In this paper, we propose a novel Motion Prior Regularity Learner (MoPRL) for posebased video anomaly detection. MoPRL takes pose motion probability from the prior statistics on the training dataset as the intuitive dynamic representation via the proposed Motion Embedder. Then, MoPRL models pose trajectories regularity with the a spatialtemporal transformer equipped with divided attention. It achieves the stateoftheart on two challenging mainstream datasets. Ablation studies and failure case analysis provide insights for future works.