Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events

Along with the development of the modern smart city, human-centric video analysis is encountering the challenge of diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individual, or collective behavior. However, limited by the scale of available surveillance video datasets, few existing human analysis approaches report their performances on such complex events. To this end, we present a new large-scale dataset, named Human-in-Events or HiEve (human-centric video analysis in complex events), for understanding human motions, poses, and actions in a variety of realistic events, especially crowd complex events. It contains a record number of poses (>1M), the largest number of action labels (>56k) for complex events, and one of the largest number of trajectories lasting for long terms (with average trajectory length >480). Besides, an online evaluation server is built for researchers to evaluate their approaches. Furthermore, we conduct extensive experiments on recent video analysis approaches, demonstrating that the HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at .


page 1

page 2

page 4

page 5

page 6

page 7

page 8

page 9


PANDA: A Gigapixel-level Human-centric Video Dataset

We present PANDA, the first gigaPixel-level humAN-centric viDeo dAtaset,...

Towards a Forensic Event Ontology to Assist Video Surveillance-based Vandalism Detection

The detection and representation of events is a critical element in auto...

Moments in Time Dataset: one million videos for event understanding

We present the Moments in Time Dataset, a large-scale human-annotated co...

Towards Long-Form Video Understanding

Our world offers a never-ending stream of visual stimuli, yet today's vi...

LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

Recognizing facts is the most fundamental step in making judgments, henc...

Dressing in the Wild by Watching Dance Videos

While significant progress has been made in garment transfer, one of the...

I Introduction

The development of modern smart city highly relies on the advancement of human-centric analysis. Multimedia understanding is one of the essential technologies for visual analysis requiring many human-centered and event-driven visual understanding tasks such as human pose estimation, pedestrian tracking and action recognition.

Recently, several public datasets (e.g. MSCOCO [1], PoseTrack [2], UCF-Crime [3]) have been proposed to handle the aforementioned tasks. However, they have some limitations when applied to real scenarios with various complex events such as dining, earthquake escape, subway getting-off and collision. First, most benchmarks focus on normal or relatively simple scenes. These scenes either have few occlusions or contain many easily-predictable motions and poses. Second, the coverage and scale of existing benchmarks are still limited. For example, although the UCF-Crime dataset contains challenging scenes, it only has coarse video-level action labels which may not be enough for fine-grained action recognition. Similarly, although the numbers of pose labels in MSCOCO and PoseTrack are sufficiently large for simple scenes with limited occlusion, they lack realistic scenes containing crowded scenes & complex events.

To this end, we present a new large-scale human-centric dataset, named Human-in-Events (HiEve), for understanding a hierarchy of human motions, poses, and actions in a variety of realistic complex events, especially crowded & complex events. Among all datasets for realistic crowd scenarios, HiEve has substantially larger scales and complexity, which contains a record number of poses (>1M), action labels (>56k) and long trajectories (with average trajectory length >480). Compared with existing datasets, HiEve contains more comprehensive and larger-scale annotations in more complex scenes, making it more adequate to develop new human-centric analysis techniques and evaluate them in realistic scenes. (Table I provides a quantitative comparison of the HiEve dataset with related datasets in light of their natures and scales.)

Additionally, we build an online evaluation server in order to enable timely and scalable evaluation on the held-out test videos. We also retrain existing state-of-the-art solutions on HiEve to benchmark their performance, demonstrating that the HiEve is challenging and of great value for advancing human-centric video analysis.

Dataset # pose # box # traj.(avg) # action pose track surveillance complex events
MSCOCO [1] 105,698 105,698 NA NA
MPII [4] 14,993 14,993 NA 410
CrowdPose [5] 80,000 80,000 NA NA
PoseTrack [2] 267,000 26,000 5,245(49) NA
MOT16[6] NA 292,733 1,276(229) NA
MOT17 NA 901,119 3,993(226) NA
MOT20 [7] NA 1,652,040 3457(478) NA
Avenue [8] NA NA NA 15
UCF-Crime [3] NA NA NA 1,900
Ours 1,099,357 1,302,481 2,687(485) 56,643
Table I: Comparison between HiEve and existing datasets. “NA” indicates not available. “” denotes approximated value. “traj.” means trajectory and “avg” indicates average trajectory length.

Ii Related Works and Comparison

Ii-a Multi-object Tracking Datasets

Different from single-object tracking, multi-object tracking (MOT) does not solely on sophisticated appearance models to track objects in frames. In recent years, there is a corpus of datasets that provide multi-object bounding-box and track annotations in video sequences, which have fostered the development of this field. PETS [9] is an early proposed multi-sensor video dataset, it includes annotation of crowd person count and tracking of individual within a crowd. Its sequences are all shot in the same scene, which leads to relatively simple samples. KITTI [10] tracking datasets features videos from a vehicle-mounted camera and focuses on street scenarios, it owns 2D & 3D bounding-boxes and tracklets annotations. Meanwhile, single-sensor makes it less variety of video angles. The most popular benchmark to data for evaluation of tracking is MOT-Challenge [7], which shows pedestrians from a variety of different viewpoints. However, with the rapid development of MOT algorithms and the limitation of scale and complexion of the MOT-Challenge dataset, it could not accurately reflect the tracking performance of each method on complex scenarios of the real-world.



Figure 1: (a) Keypoints definition (b) Example pose and bounding-box annotations from our dataset.

Ii-B Pose Estimation & Tracking Datasets

Human pose estimation in images has made great progress over the last few years. For single-person pose estimation, LSP [11], FLIC [12] are the two most representative benchmarks, the former focuses on sports scenes while the latter is collected from popular Hollywood movie sequences. Compared with LSP, FLIC only labels 10 upper body joints and owns a smaller data scale. As the natural extension of single-person pose estimation, multi-person pose estimation has gained much importance recently for its ability to tackle the varying numbers of people. WAF [13] is the first to establish a benchmark for multi-person pose estimation with simplified key-point and body definition. Then, MPII [4] and MSCOCO [1] datasets are proposed to further advanced the multi-person pose estimation task by their diversity, difficulty in the human pose. Specially, MSCOCO is regarded as the most widely used large-scale dataset with 105698 pose annotations in hundreds of every day activities. Taking tracking task into consideration, PoseTrack [2] builds a new video dataset which provides multi-person pose estimation and articulated tracking annotations. Similar to PoseTrack, JAT [14], a recently proposed massive CG dataset, simulates realistic urban scenarios for human pose estimation and tracking.

Figure 2: Samples of different actions from our training set and testing set.
Figure 3: The distribution of different scenes in HiEve dataset.


(a) MPII
(c) HiEve
Figure 4: CrowdIndex distributions of MSCOCO and our HiEve dataset. MSCOCO is dominated by uncrowded images. while HiEve dataset pays more attention on crowded cases.
Category sub-events Complex Emergency event fighting quarreling accident robbery Complex Daily event after-school shopping getting-off dining Simple Daily event walking playing waiting
Figure 5: The classification of sub-events.
Figure 6: The distribution of sub-events. Different colors represent different kinds of events.

Ii-C Action Recognition Datasets

There are two human action video datasets that have emerged as the standard benchmarks for action recognition task: HMDB-51 [15] and UCF-101 [16]. HMDB-51 is mainly collected from movie sequences and contains 51 action categories. UCF-101 is one of the datasets with the largest number of action categories (101 classes) and samples, which significantly promote the development of action recognition task. Aimed to recognize the realistic anomalous behavior, Avenue [8] and UCF-Crime [3] are proposed. UCF-Crime annotates 13 anomalies in real-world surveillance videos, such as fighting, accident and robbery, etc. Recently, action recognition datasets with larger scale and detailed object information are constructed [17, 18] to facilitate the advancement and evaluation on video analysis techniques, however, most of the content in these videos are collected from either unrealistic movie clips or uncrowded scenarios.

Ii-D Comparisons

These related datasets mentioned above have served the community very well, but their usefulness is now expiring for some of following reasons: (1) Most of them are focusing on normal or simple scenes (e.g. street, sports scene, single-person movement), which owns few occlusions and is relatively simple for prediction of motions or poses. (2) The coverage and scale of them are no longer applicable to the evaluation of the state-of-the-art algorithms. (3) Multiple human-centric video analysis tasks need to learn and evaluate on multiple benchmarks. Overall, compared with these datasets, our datasets has the following unique characteristics:

  • [leftmargin=10pt]

  • HiEve dataset covers a wide range of human-centric understanding tasks including motion, pose, and action, while the previous workshops only focus on a subset of our tasks.

  • HiEve dataset has substantially larger data scales, including the currently largest number of poses (>1M), the largest number of complex-event action labels (>56k), and one of the largest number of trajectories with long terms (with average trajectory length >480).

  • HiEve dataset focuses on the challenging scenes under various crowd & complex events (such as dining, earthquake escape, subway getting-off, and collision, cf. Figure 1), while the related workshops are mostly related to normal or relatively simple scenes.

In a nutshell, our dataset contains more comprehensive and larger-scale annotations in various complex-event scenes, making it more capable of evaluating the human-centric analyzing techniques in realistic scenes.

Iii The HiEve dataset

Iii-a Collection and Annotation

Collection We start by selecting several crowded places with complex and diverse events for video collection (e.g. shopping mall, school, subway station, airport). Most of these videos are selected from our own private sequences and contain complex interactions between persons. We also shot some sequences on campus & street from various angles. Then, to further increase the variety and complexity of behavior in videos, we searched some videos which record unusual scenes (e.g. jail, factory) and anomalous events (e.g. fighting, earthquake, robbery) on YouTube. Moreover, data redundancy is avoided through manual checking. In order to protect the privacy of relevant personnel and units, we blurred the faces and the key text in the videos. Finally, 32 real-world video sequences are collected, each containing one or more complex events.

Figure 7: The number of tracks, objects and poses in events. Different colors represent different kinds of events.


(a) PoseTrack
(b) HiEve
Figure 8: The distribution of the length of track in PoseTrack and HiEve dataset.
(a) MOT17
(b) MOT20
(c) PoseTrack
(d) HiEve
Figure 9: The distribution of the number of people per frame in MOT17, MOT20, PoseTrack and HiEve dataset. The scenes in HiEve dataset owns more people.

Annotation In our dataset, the bounding-boxes, keypoint-based poses, human identities, and human actions are all manually annotated. The annotation procedure is as follows: First, similar to the MOT dataset, we annotate bounding boxes for all moving pedestrians (e.g. running, walking, fighting, riding) and static people (e.g. standing, sitting, lying). A unique track ID is assigned to each person until it moves out of the camera field-of-view.

Second, we annotate poses for each person in the entire video. Different from PoseTrack and COCO, our annotated pose for each body contains 14 key-points (Figure 1): nose, chest, shoulders, elbows, wrists, hips, knees, ankles. Specially, we skip pose annotation which falls into any of the following conditions: (1) strong occlusion (2) area of the bounding box is less than 500 pixels. Figure 1 presents some pose and bounding-box annotation examples.

Third, we annotate actions of all individuals in every 20 frames in a video. For group actions, we assign the action label to each group member involved in this group activity. In total, we defined 14 action categories: walking-alone, walking-together, running-alone, running-together, riding, sitting-talking, sitting-alone, queuing, standing-alone, gathering, fighting, fall-over, walking-up-down-stairs, crouching-bowing. Finally, all annotations are double-checked to ensure their quality.

Iii-B HiEve Statistics

Our dataset contains 32 video sequences with most of them longer than 900 frames. The total length of them is 33 minutes and 18 seconds. We split these video sequences into 19,13 videos for training and testing respectively. Table I shows the basic statistics of our HiEve dataset: It has 49,820 frames, 1,302,481 bounding-box annotations, 2,687 track annotations, 1,099,357 human pose annotations, and 56,643 action annotations, which is the largest scale human-centric datasets to our knowledge. Moreover, our video sequences are collected from 9 different scenes: airport, dining hall, indoor, jail, mall, square, school, station and street. Figure 4 shows the distribution of scenes in our HiEve dataset.

We could group our video sequences into 11 sub-events: fighting, quarreling, accident, robbery, after-school, shopping, getting-off, dining, walking, playing and waiting. Then, according to the complexity of these sub-events, we divided these sub-events into 3 categories: complex emergency event, complex daily event and simple daily event. The sub-events belonging to these 3 events are shown in Figure 6. We count the number of poses, objects, and tracks for the above 3 events in Figure 8, which prove that the complex events we define do have more human-centric information. Moreover, Figure 6 presents the average frame number of each sub-event. It can be observed that complex events dominate the HiEve dataset.

Figure 10: Sample distribution of all action classes in the HiEve dataset.
Figure 11: The distribution of the number of concurrent action in HiEve dataset. Different colors represent different kinds of events.

To further illustrate the characteristics of our dataset, we conduct the following statistical analysis for each task.

First, we count the number of people per frame in our dataset, Figure 9 demonstrates that the scenes in our video sequence own more people than MOT17 and PoseTrack [2], which makes our tracking task more difficult. Although MOT-20 [7] collects some video sequences with more people (up to 141 people), its quantity of frames is limited.

Second, we adopt the Crowd Index defined in Crowdpose [5] to measure the crowding level of our dataset. For a given frame, its Crowd Index(CI) is computed as:


where is the total number of persons in this frame. denotes the number of joints from the human instance and is the number of joints located in bounding-box of the human instance but not belonging to the person. We evaluate the Crowd Index distributions of our HiEve dataset and the widely used pose dataset MSCOCO [1] and MPII [4]. Figure 4 shows that our HiEve dataset pays more attention to crowded scenes while other benchmarks are dominated by uncrowded ones. This characteristic allows the state-of-the-art methods on our dataset to avoid only focusing on simple cases and ignoring crowded ones.

Third, we analyze the ratio of disconnected human tracks in our dataset. Disconnected human tracks are defined as trajectory annotations where the bounding boxes are not available on some frames due to the following reasons: (1) One object temporally moves out of the camera screen and moves back sometime later. (2) One object is severely occluded by foreground objects or certain obstacles for a long time so that annotators can not assign an approximate bounding box to it (as exemplified in Figure 13). It is noticeable that in datasets like PoseTrack [2], the reappearance of one individual in the scene is taken as the start of a distinct trajectory instead of the continuation of the original track before disappearing, in this manner these datasets will contain more tracks with shorter endurance (as reflected in Figure 8). In contrast, in HiEve we assign the tracks before and after disappearing with the same ID, in this way to encourage algorithm which can properly handle long-term re-identification. The number of disconnected and continuous tracks are reported in Figure 12. The statistic results show that the proportion of disconnected tracks is nonnegligible. This characteristic makes HiEve more difficult and challenging for robust and accurate tracking.

Finally, the distribution of all action classes in our dataset is exhibited in Figure 11, which could be regarded as a long-tailed sample distribution. Figure 11 demonstrates the complex events in our dataset have more concurrent events, which means that the complexity and difficulty of identifying behaviors in such scenes will increases.

Overall, these statistics further prove that HiEve is a large-scale and challenging dataset, which is dominated by complex events.

(a) training set
(b) test set
Figure 12: The number of disconnected and continuous tracks in (a) training set and (b) test set.
Figure 13: Examples of disconnected tracks (highlighted with bounding box)

Iii-C Challenges

The proposed HiEve dataset poses the four challenges:

Multi-person Motion Tracking. This task is proposed to estimate the location and corresponding trajectory of each identity throughout a video. Specially, we provide two sub-tracks:

  • [leftmargin=10pt]

  • Public: In this sub-track, all competitors can only use public object detection results provided by us, which is obtained via Faster-RCNN [19].

  • Private: Participants in this sub-track are able to generate their own detection bounding-box through any approaches.

Crowd Pose Estimation. This task is similar to the ones covered by existing datasets like MPII Pose and MSCOCO Keypoints, but our dataset involves more real-scene pose patterns in various complex events.

Crowd Pose Tracking. This task requires to provide temporally consistent poses for all people visible in the videos. Compared with PoseTrack, our dataset is much larger in scale and includes more frequent occlusions.

Person-level Action Recognition. The action recognition task requires participants to simultaneously detect specific individuals and assign correct action labels to it on every sampled frame. This task is similar to the AVA challenge [18], however, it should be noted that our action recognition track does not only contain atomic level action definition but also involves more interaction and occlusion among individuals, making recognition more difficult.

Iv Evaluation

Iv-a Multi-person tracking

  • [leftmargin=10pt]

  • MOTA & MOTP [6]: They are standard metrics to evaluate object tracking performance in video sequences. MOTA measures the ratio of false-positive, missing target, and identity switch. MOTP measures the trajectory similarity between predicted results and ground-truth. This measurement is adopted in the tracks of multi-person tracking and multi-person pose estimation and tracking. Our final performance ranking in this track is based on MOTA.

  • ID F1 Score [20]: The ratio of correctly identified detections over the average number of ground-truth and computed detections.

  • ID Sw [20]: The total number of identity switches.

Iv-B Multi-person pose estimation

  • [leftmargin=12pt]

  • AP@ We adopt Average Precision (AP) for measuring multi-person pose accuracy. The evaluation protocol is similar to DeepCut [21]: First, if a pose prediction has the highest PCKh [4] (head-normalized Percentage of Correct Keypoints, where is a distance threshold to determine whether a detected keypoint is matched to an annotated keypoint) with a certain ground-truth, then it can be assigned to the ground truth (GT). Unassigned predictions are counted as false positives. Finally, Average Precision (AP) is computed according to the area under the precision-recall curve.

  • w-AP@ To further avoid the methods only focusing on simple cases uncrowded scenarios in the dataset (although Figure 4 has shown that our dataset contains a large number of crowded and complex scenarios), we will assign larger weights to a test image during evaluation if it owns: (1) higher Crowd Index (2) anomalous behavior (e.g. fighting, fall-over, crouching-bowing). To be specific, the weights for the frame in one video sequence can be formulated as:

    where is the crowd index on frame calculated via Equation 1, denotes the number of categories of anomalous actions. During our evaluation, the coefficients are set to respectively. The values of AP calculated with assigned weights are called weighted AP (w-AP).

  • AP@avg We take the average value of AP@0.5, AP@0.75, and AP@0.9 as an overall measurement of keypoint estimation results, where 0.5, 0.75, and 0.9 are specific distance threshold for computing PCKh.

  • w-AP@avg We take the average value of w-AP@0.5, w-AP@0.75, and w-AP@0.9 as an overall measurement of keypoint estimation results on weighted video frames, where 0.5, 0.75, and 0.9 are specific distance threshold for computing PCKh. Our final performance ranking in this track is based on w-AP@avg.

Iv-C Pose tracking

  • [leftmargin=10pt]

  • MOTA & MOTP in tracking tasks are also adopted to pose tracking for evaluation. Our final performance ranking of pose tracking is based on MOTA.

  • AP We calculate AP for pose tracking evaluation in the same way that introduced in the multi-person pose estimation.

Iv-D Action recognition

  • [leftmargin=10pt]

  • f-mAP@ The frame mAP (f-mAP) is a common metric to evaluate spatial action detection accuracy on a single frame. To be specific, each prediction consists of a bounding box and a predicted action label, if it has overlap larger than a certain threshold α with an unmatched ground-truth box of the same label, then it is taken as true positive, otherwise it is a false positive. This process is conducted on each frame annotated with action boxes. The AP value is computed for each label as the area under the precision-recall curve and the mean AP value is computed by averaging the AP value of each label. Additionally, to avoid evaluation on objects which are visually ambiguous or impossible to tell the behavior, we filter out the bounding boxes of extremely small size or severely occluded by others from annotations in the test set, consequently, only of the annotated boxes are adopted to evaluate the performance.

  • wf-mAP@ Considering the unbalanced distribution of the action categories in the data set, we appropriately assigned smaller weights to the test samples belonging to dominated categories. In addition, we also assign a larger weight to frames under crowd and occluded scenarios to encourage models that perform better in complex scenes. To be specific, the weight is calculated in the same way as the pose estimation track but a higher factor is adopted. Similar to Weighted mAP, the frame mAP value calculated with these assigned weights is called weighted frame-mAP (wf-mAP for short)

  • f-mAP@avg We report f-AP@0.5, f-AP@0.6, and f-AP@0.75, where 0.5, 0.6, and 0.75 are specific IOU threshold to determine true/false positive, and then take their mean value as an overall measurement value of f-mAP, we denote this measurement as f-mAP@avg.

  • wf-mAP@avg Similarly, we report wf-AP@0.5, wf-AP@0.6, and wf-AP@0.75, then take their mean value as an overall measurement value of wf-mAP, we denote this measurement as wf-mAP@avg and our final performance ranking in this track is based on wf-mAP@avg.

V Experiments

In this section, we first introduce the baseline methods we evaluated in the HiEve dataset for the four challenges. Then, their performance will be reported respectively.

DeepSORT [22] 27.12 70.47 28.55 8.50% 41.45% 5894 42668 2220
MOTDT [23] 26.09 76.50 32.88 8.70% 54.56% 6318 43577 1599
IOUtracker [24] 38.59 76.23 38.62 28.33% 27.60% 9640 28993 4153
Table II: Results of multi-person tracking baselines.
Method w-AP@avg w-AP@0.5 w-AP@0.75 w-AP@0.9 AP@avg AP@0.5 AP@0.75 AP@0.9
DHRN [25] 47.12 53.66 46.42 41.28 50.72 57.10 50.11 44.95
Simple Baseline [26] 41.36 47.67 40.73 35.68 44.56 50.84 43.85 38.98
RMPE [27] 49.56 57.85 47.90 42.92 53,26 61,26 51.69 46.84
Table III: Results of multi-person pose estimation.
PoseFlow [28] 44.17 48.33 60.10
LightTrack [29] 27.44 55.23 29.36
Table IV: Results of multi-pose tracking baselines.
Method wf-mAP@avg wf-mAP@0.5 wf-mAP@0.6 wf-mAP@0.75 f-mAP@avg f-mAP@0.5 f-mAP@0.6 f-mAP@0.75
RPN+I3D 6.88 9.65 7.91 3.07 8.31 11.01 9.65 4.26
Faster R-CNN+I3D 10.13 13.35 11.57 5.49 10.95 14.50 12.33 6.01
Transformer+I3D 7.28 9.88 8.32 3.65 7.03 9.32 8.10 3.66
Table V: Results of action recognition baselines.

V-a Multi-person tracking


  • [leftmargin=10pt]

  • DeepSORT [22]. Based on the SORT [30] algorithm, it introduces an offline model pre-trained on person re-ID datasets. In real-time object tracking, appearance features are extracted by the pre-trained model, and simple nearest neighbor queries is performed, which could improve the multi-object tracking performance under occlusion. At the same time, it effectively reduces the number of identity switches.

  • MOTDT [23]. MOTDT tackles unreliable detection by selecting candidates from outputs of both detection and tracks. Besides, A new scoring function for candidate selection is formulated by an efficient R-FCN, which shares computations on the entire image.

  • IOUtracker [24]. IOUtracker proposes a very simple and efficient tracking algorithm, which only leverages the detection results and designed an IOU strategy to improve the performance of multi-objective tracking with extra low cost.

Implementation Details

Faster R-CNN [19] is used to obtain the public results of bounding-boxes firstly. In MOTDT and DeepSORT, we use the HiEve train set and the ground truth to fine-tuning the official deep models in these methods. Then, according to the official codes of baselines, we evaluate them in the HiEve test dataset with the public detection results. The threshold of detections is set to be 0.2. All the experiments were carried out on a single NVIDIA-2080 GPU.

Results and Analysis

The results of these three baselines are shown in Table II. We can make the observation that all of the three methods perform poorly on our dataset, because our dataset has complex scenes and a large number of overlapping targets, making identification and tracking more difficult. IOUtracker performs best on our dataset, while MOTDT [23] and DeepSORT [22] have a relatively bad performance. The reason is that our dataset contains a large number of crowded scenes and occlusions, so it is hard for ReID model to extract corresponding features for many overlapping bounding boxes. ReID model is quite important in MOTDT and DeepSORT, while IOUtracker does not use image information, so it performs better than the other two methods.

V-B Multi-pose estimation


  • [leftmargin=10pt]

  • RMPE [27]

    . RMPE is a new multi-person pose estimation framework. It designs a new symmetric spatial transformer network (SSTN) to transform and correct the inaccurate object localization. Besides, pose NMS is proposed to solve the problem of redundant human detections.

  • Simple-Baseline [26]. Its model for pose estimation is based on a few deconvolutional layers added on a backbone network (ResNet [31]). Extensive experiments show that it’s a simple and strong baseline for pose estimation and tracking.

  • DHRN [25]. DHRN aims to learn reliable high-resolution representations for pose estimation. High-to-low resolutions subnetworks are added one by one to form more stages, and multi-resolution subnetworks are connected in parallel. And thus, DHRN could maintain high-resolution representations through the whole process.

Implementation Details

For the above top-down methods, we take the detection results of YOLO v3 [32] as their input. All the performance is reported using their official codes and models. In DHRN, the HRNet-32 model pre-trained on MPII [4] is selected to be the backbone with input object size 256. For Simple Baseline, we use ResNet-50 [31] model pre-trained on MPII and adopt flip strategy for testing. In RMPE, SPPE [33] is the backbone during testing.

Results and Analysis

We present their evaluation results in Table III. As we can see, although DHRN [25] is the current state-of-the-art pose estimation model, RMPE [27] performs better on our dataset. The reason might be the poor performance of the object detector in crowded scenes. The STN module in RMPE is able to correct inaccurate detection results, while DHRN is more susceptible to the quality of detection boxes. In addition, RMPE’s pose-NMS algorithm can better filter out redundant low-quality poses in dense scenes. Compared with RMPE and DHRN, the Simple Baseline’s performance is far inferior to them when tackling with the crowded poses under complex scenes.

V-C Pose tracking


  • [leftmargin=10pt]

  • PoseFlow [28]. It’s an efficient pose tracker based on flows and top-down approaches. An online optimization framework is designed to build the association of cross-frame poses and form pose flows (PF-Builder). Then, a novel pose flow non-maximum suppression (PF-NMS) is designed to robustly reduce redundant pose flows and re-link temporal disjoint ones.

  • LightTrack [29]. LightTrack is an effective light-weight framework for online human pose tracking. It unifies single-person pose tracking with multi-person identity association and sheds first light upon bridging keypoint tracking with object tracking. Moreover, a Siamese Graph Convolution Network (SGCN) is proposed for human pose matching as a Re-ID module.

Implementation Details

In LightTrack, YOLO v3, Siamese GCN, and MobileNet are selected as the keyframe detector, ReID module, and pose estimator respectively. We use DeepMatching to extract dense correspondences between adjacent frames in PoseFlow. All weights of model inherit from pre-trained models on MSCOCO [1].

Results and Analysis

The performance comparison of these two methods is presented in Table IV. As expected, the flow-based algorithm PoseFlow achieves higher performance while LightTrack [29] mainly aims to strike a balance between speed and accuracy.

V-D Action recognition


  • [leftmargin=10pt]

  • RPN+I3D [34]. This is a strong baseline model for AVA challenge, where the I3D [35]

    network is applied for feature extraction and classification, and the feature from the labelled key-frame is fed to RPN 

    [19] for region proposal.

  • Faster R-CNN+I3D. We further improve the baseline in [34] for better localization. To be specific, since the scenario in HiEve is crowded and simple I3D feature will downsample the resolution of features, we do not predict region directly on the I3D feature, instead, an independent Faster R-CNN detector [19] is applied on the input key-frame to obtain the bounding box proposals.

  • Video Transformer Network [36]. The Video Transformer Network (VTN) takes the I3D network as backbone and applies a key-value attention mechanism to model the interaction among objects before the classification layer to improve recognition results.

Implementation Details

For three baselines of action recognition, we adopt the RGB-I3D [35] network with Inception-V1, initialized with Kinetics-pretrained weights, as a video feature extractor. In RPN+I3D, following [34], we generate region proposals by RPN on key-frame feature and implement action classification and box regression with I3D head. In Faster R-CNN+I3D, we use detection results of a Faster R-CNN detector as ROIs and perform action classification on RoI aligned features. In VTN, we use the same Faster R-CNN detection results as RoIs, but employ the transformer head in [36] for action classification. For all experiments, we fix the statistics of batch norm layers in I3D and use batches of 8 clips, each with 20 frames of input size

. We train the model for 50 epochs, using the SGD optimizer with momentum of 0.9. The learning rate is initially set to 0.01 and decreased by 0.1 at 30 epochs. The experiments are carried out on one NVIDIA-TITAN-RTX GPU.

Results and Analysis

The performance of these three baselines are shown in Table V. The model employing I3D [35]

for action classification and Faster R-CNN for person detection performs best on our dataset, outperforming that using I3D for both detection and classification. It is probably because our dataset contains a large number of crowded scenes, which is challenging on detection. Therefore, utilizing a high-quality detector, e.g., Faster R-CNN, significantly improves the detection performance, and the detected bounding boxes also provide more useful information for action recognition. The method with transformer is superior on AVA 

[37] dataset but performs comparatively poor on our dataset, it might be because AVA dataset contains many action categories that focus on human-human and human-object interaction, while our dataset pays more attention to the actions of one identity under complex event conditions. The I3D head could be more suitable for summarizing an individual’s behavior along a period of time, while the transformer lays too much emphasis on the context.

Vi Conclusion

We present HiEve, a large-scale dataset for human-centric video analysis. The HiEve dataset covers a wide range of crowded scenes and complex events. We report the results of plenty of approaches in our dataset. Extensive experiments show that the HiEve is a challenging dataset for pose estimation, pose tracking, multi-person tracking, and action recognition.


  • [1] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in

    European conference on computer vision

    .   Springer, 2014, pp. 740–755.
  • [2] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose estimation and tracking,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 5167–5176.
  • [3]

    W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.
  • [4] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
  • [5] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 863–10 872.
  • [6] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
  • [7] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé, “Mot20: A benchmark for multi object tracking in crowded scenes,” arXiv preprint arXiv:2003.09003, 2020.
  • [8] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2720–2727.
  • [9] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance.   IEEE, 2009, pp. 1–6.
  • [10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2012, pp. 3354–3361.
  • [11] S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose estimation.” in bmvc, vol. 2, no. 4.   Citeseer, 2010, p. 5.
  • [12] B. Sapp and B. Taskar, “Modec: Multimodal decomposable models for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3674–3681.
  • [13] M. Eichner and V. Ferrari, “We are family: Joint pose estimation of multiple persons,” in European conference on computer vision.   Springer, 2010, pp. 228–242.
  • [14] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara, “Learning to detect and track visible and occluded body joints in a virtual world,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 430–446.
  • [15] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
  • [16] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,” 2017.
  • [18] C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik, “Ava: A video dataset of spatio-temporally localized atomic visual actions,” CVPR, pp. 6047–6056, 2018.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [20] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in European Conference on Computer Vision.   Springer, 2016, pp. 17–35.
  • [21] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4929–4937.
  • [22] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP).   IEEE, 2017, pp. 3645–3649.
  • [23] L. Chen, H. Ai, Z. Zhuang, and C. Shang, “Real-time multiple people tracking with deeply learned candidate selection and person re-identification,” in 2018 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2018, pp. 1–6.
  • [24] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2017, pp. 1–6.
  • [25] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
  • [26] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 466–481.
  • [27] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-person pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2334–2343.
  • [28] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose flow: Efficient online pose tracking,” arXiv preprint arXiv:1802.00977, 2018.
  • [29] G. Ning and H. Huang, “Lighttrack: A generic framework for online top-down human pose tracking,” arXiv preprint arXiv:1905.02822, 2019.
  • [30] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 3464–3468.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [32] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv, 2018.
  • [33] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision.   Springer, 2016, pp. 483–499.
  • [34] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “A better baseline for ava,” arXiv preprint arXiv:1807.10066, 2018.
  • [35] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  • [36] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 244–253.
  • [37] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6047–6056.