A Large Scale Urban Surveillance Video Dataset for Multiple-Object Tracking and Behavior Analysis

04/26/2019
by   Guojun Yin, et al.
USTC
0

Multiple-object tracking and behavior analysis have been the essential parts of surveillance video analysis for public security and urban management. With billions of surveillance video captured all over the world, multiple-object tracking and behavior analysis by manual labor are cumbersome and cost expensive. Due to the rapid development of deep learning algorithms in recent years, automatic object tracking and behavior analysis put forward an urgent demand on a large scale well-annotated surveillance video dataset that can reflect the diverse, congested, and complicated scenarios in real applications. This paper introduces an urban surveillance video dataset (USVD) which is by far the largest and most comprehensive. The dataset consists of 16 scenes captured in 7 typical outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. Over 200k video frames are annotated carefully, resulting in more than 3:7 million object bounding boxes and about 7:1 thousand trajectories. We further use this dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly behavior analysis and explore the robustness of these methods in urban congested scenarios.

READ FULL TEXT VIEW PDF

page 1

page 3

page 5

11/13/2015

UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking

In recent years, numerous effective multi-object tracking (MOT) methods ...
04/20/2018

Vision Meets Drones: A Challenge

In this paper we present a large-scale visual object detection and track...
09/27/2014

Audio Surveillance: a Systematic Review

Despite surveillance systems are becoming increasingly ubiquitous in our...
08/06/2019

A fast multi-object tracking system using an object detector ensemble

Multiple-Object Tracking (MOT) is of crucial importance for applications...
08/30/2021

Robust Privacy-Preserving Motion Detection and Object Tracking in Encrypted Streaming Video

Video privacy leakage is becoming an increasingly severe public problem,...
01/18/2016

A Comparative Study of Object Trackers for Infrared Flying Bird Tracking

Bird strikes present a huge risk for aircraft, especially since traditio...
07/20/2017

leave a trace - A People Tracking System Meets Anomaly Detection

Video surveillance always had a negative connotation, among others becau...

1 Introduction

With the rapid development of digital acquisition and storage technologies, video surveillance has become one of the most important safety monitoring methods widely used all around the world. As a very active research field in the computer vision area, the main goal of the research on the surveillance video is to effectively analyze and extract information from a large amount of unstructured video data acquired by the surveillance cameras, automatically detect, track and identify the targets, analyze various behaviors of the targets, understand various events occurring in the scene, and alarm suspicious events, to provide technical support for public security.

Among various research topics in surveillance video analysis and scene recognition, multiple-object tracking and behavior analysis is one of the major research fields. After the proposition of the concepts of

intelligent transportation and smart city, more and more researchers have begun to focus on object tracking and behavior analysis in the surveillance videos [1, 2]. However, the explosive growth of the number of vehicles and populations has resulted in more congested and complicated urban environments, which brings many new challenges to the research on the surveillance video.

As many object tracking and anomaly behavior analysis algorithms have been proposed to deal with the congested and complicated scenarios [3, 4, 5, 6] , the corresponding public challenging datasets are required to provide the fair comparison. However, there are only several real-world urban surveillance video datasets serving the purpose of evaluating the performance and robustness of object tracking and behavior algorithms. And most of the existing surveillance video datasets [16] used in the previous works are relatively small and simple, which makes them less qualified to assess the performance in real-world applications for more and more congested and complex scenarios.

Figure 1: The congested traffic scenes in urban environments. The ID numbers in the corners represent the scenarios respectively: street, crossroads, hospital entrance, school gate, park, pedestrian mall and public square.

 

Dataset #Clips #Annotated frames #Tracks #Boxes #Density Camera Task
MIT Traffic[2] 20 520 - 2054 3.95 static Pedestrian detection
Caltech Pedestrian[1] 11 250000 - 350000 1.40 dynamic Pedestrian detection
Daimler Pedestrian * [7] 1 21790 - 56492 2.59 dynamic Pedestrian detection
KITTI Tracking ** [8] 21 7924 917 8896 1.12 dynamic Multiple object tracking
MOT Challenge 2015 2D [9] 22 11286 1221 101345 8.98 diverse Multiple object tracking
MOT Challenge 2016 [10] 14 11235 1342 292733 26.06 diverse Multiple object tracking
MOT Challenge 2017 [10] *** 14 11235 1331 300373 26.73 diverse Multiple object tracking
Our dataset 32 211706 7173 3758821 17.75 static Object detection and tracking

 

  • * The statistics of Daimler Pedestrian detection dataset only includes test split.

  • ** The statistics of KITTI Tracking dataset only includes training split and boxes include the bounding boxes of DontCare labels.

  • *** The sequences in MOT17 Challenge are the same as MOT16 sequences with a new, more accurate ground truth.

Table 1: Comparison of existing datasets in urban environments (object tracking and detection).

In the current work, we propose a large scale urban surveillance video dataset (USVD) with congested and complex scenarios for multiple-object tracking and anomaly behavior analysis. To the best of our knowledge, it is to-date the largest and most realistic public dataset for real video surveillance. There are mainly four advantages in our dataset compared with the existing datasets.

Realistic. All the data are from the real public surveillance scenes, which enables the evaluation of computer vision algorithms on direct application to the real-world.

Complex. The dataset is comprised of typical scenarios with different congested scenes. There are frequent occlusion, deformation, various viewpoints and diverse targets in these congested scenes.

Large Scale. The dataset consists of over annotated frames and more than million bounding boxes of about thousand unique trajectories.

Well-Annotated. All the bounding boxes are manually annotated and checked. The annotation includes location, size, object category, occlusion, and trajectory identity.

We also use the proposed dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly detection and explore the robustness of these methods in urban congested conditions.

2 Related Works

In the recent years, the computer vision community has created benchmarks for video related tasks such as scene recognition, pedestrian & object detection, object tracking, action recognition, anomaly behavior detection and etc. Despite the potential pitfalls of such datasets, they have proved to be extremely helpful to advance the state-of-the-arts in the corresponding areas [11, 9, 12, 13]. An overview of examples of the existing datasets in urban environments for object detection and tracking is shown in Tab. 1.

Real Urban Video Datasets. The MIT Traffic dataset [2] is an example of the recent efforts to build more realistic urban traffic surveillance video datasets for research on pedestrian detection and activity analysis. It includes a traffic video sequence of 90 minutes long, recorded by a stationary camera and the whole sequence is divided into 20 clips. The size of the scene is 720 by 480. In order to evaluate the performance of human detection on this dataset, the ground truth of the pedestrians of some sampled frames are manually labeled. There are in total 520 annotated frames and 2054 bounding boxes in the dataset.

The Caltech Pedestrian dataset [1, 14] consists of approximately 10 hours of Hz video taken from a vehicle driving through the regular traffic in an urban environment. All the data is roughly divided in half, setting aside 6 sessions for training and 5 for testing. About frames with a total of bounding boxes and unique pedestrians are annotated.

The Daimler Monocular Pedestrian Detection dataset [7] is another dataset for pedestrian detection in urban environments. The training set contains pedestrian samples (image cut-outs at resolution) and 6744 additional full images without pedestrians for extracting negative samples. The test set contains an independent sequence with more than images and pedestrian labels (fully visible or partially occluded), captured from a vehicle during a 27 min driving through the urban traffic.

Although they are helpful to evaluate the performance of pedestrian detection algorithms in real surveillance video to some extent, the three datasets mentioned above only include some uncrowded urban scenes and a few annotated bounding boxes, which is relatively simple compared to the complex and congested traffic conditions nowadays.

MOT Datasets. Recently, the KITTI benchmark [8]

is introduced for challenges in autonomous driving, which includes stereo flow, odometry, road and lane estimation, object detection as well as tracking. Some of the sequences include crowded pedestrian crossings, making the dataset quite challenging, but the camera is moving, while the conventional traffic surveillance video varies greatly.

MOT challenge 2015 [9], challenge 2016 [10] and challenge 2017 [10] are the recent challenging benchmarks for multiple object tracking. The videos in the benchmarks are diverse, and some of which are selected from the existing datasets, KITTI Tracking dataset. However, the dataset consists of various video types including surveillance videos and moving-camera videos. Therefore, it motivates us to establish a public challenging urban surveillance video dataset which is more realistic to evaluate the performance of various algorithms for object tracking and behavior analysis.

(a) Raw data
(b) Ground truth
(c) Detection results
(d) Trajectory distribution
Figure 2: An example sequence in the proposed USVD dataset.

3 Large Scale Urban Surveillance Video Dataset

An example sequence in our proposed dataset is shown in Fig. 2, in which the trajectory distribution only includes the trajectory location in the adjacent frames ( seconds) and the bounding boxes of dotted line mean the completely-occluded target. The targets we are interested in urban environments are movable individuals or units, , pedestrian, car or van, rather than the stationary objects, , trees, pillars or traffic lights. In this section, we introduce our large scale urban surveillance video dataset (USVD) in details.

 

Train sequences Test sequences
Seq Resolution Length Tracks Boxes Density Seq Resolution Length Tracks Boxes Density
01 10601 240 91055 8.5893 02 10989 219 10989 10.7766
03 11000 389 234806 21.3460 04 11101 460 241924 21.7930
05 4500 500 246923 54.8718 06 3501 407 209580 59.8629
07 6001 178 67160 11.1915 08 6001 174 71361 11.8915
09 6000 250 80973 13.4955 10 7501 253 100381 13.3823
11 7500 363 80416 10.7221 12 7501 465 92760 12.3664
13 2700 94 27964 10.3570 14 2801 141 42252 15.0846
15 5000 166 74398 14.8796 16 5001 164 73601 14.7173
17 7000 193 101165 14.4521 18 7001 212 103357 14.7632
19 7500 81 48701 6.4935 20 7501 86 50832 6.7767
21 7501 99 48517 6.4681 22 7500 119 65430 8.7240
23 7500 115 161220 21.4960 24 7501 101 150504 20.0645
25 5000 177 93921 18.7842 26 5001 173 92729 18.5421
27 7500 177 117658 15.6877 28 7501 207 140085 18.6755
29 5001 168 103983 20.7924 30 5000 230 119420 23.8840
31 4001 239 201006 50.2389 32 6000 333 306315 51.0525
Total - 104305 3429 1779866 17.0636 Total - 107401 3744 1978955 18.4258

 

Table 2: Statistics of sequences included in the proposed dataset.

3.1 Data Collection

The videos in the dataset are captured from surveillance cameras distributed in public places. We collect hours of video (over 5 TeraBytes of data in total and hours of video for each camera captured from 7:00 to 19:00 for 6 days) and select 16 representative challenging scenes based on the factors, , density, diversity, occlusion, deformation, viewpoint, motion and etc. 111The factor details are shown in the supplementary material. The scenes we selected are shown in Fig. 1 and the ID numbers in the corners represent the scenarios in turn: street, crossroads, hospital entrance, school gate, park, pedestrian mall and public square.

3.2 Annotation

In order to evaluate the performance of the object tracking and behavior analysis algorithms, we proposed a clear protocol that was obeyed throughout the annotation of the entire dataset to guarantee the consistency. The annotation rule 222we only show the part of annotation rules due to space limited and the full rules are shown in the supplementary materials. includes the following aspects:

Class. We annotated 7 classes of targets in urban scenarios, including pedestrian, riding, car, van, bus, tricycle, and truck. The queries used for each of the classes are listed as follows: 1) pedestrian: single pedestrian, including the walking person, skating person and sitting person, but not including the person on the vehicle, bicycle, motorcycle, scooter. Note that the pictures on the advertising boards were regarded as the background. 2) riding: two-wheel vehicle with people on it, , bicycle, motorcycle and scooter. The annotated bounding box surrounds the extent of both the vehicle and the person. The person-riding is considered as one moving individual rather than divided into vehicle and person. Although it looks almost the same as a pedestrian from the waist up, the person on a vehicle will be considered as a part of the vehicle and will not be annotated as pedestrian which is different from the annotation in [9, 10]. The parked cycles without people will be ignored and regarded as background. 3) car: four-wheel vehicle for the purpose of several-person transport, such as hatchback, sedan, SUV, MPV, taxi, jeep, convertible, etc. 4) van: four-wheel medium-size passenger-car or van for the purpose of transport of a small number of cargo or used for engineering operations, such as ambulance, van, etc. 5) bus: four-wheel vehicle for the purpose of taking a large number of persons, and bigger than a van, such as bus, mini-bus, coach, etc. 6) tricycle: three-wheel vehicle for cargo or passenger transport, and the bounding box surrounding the extent of both vehicle and person or cargo. 7) truck: four-wheel vehicle for cargo transport, such as pickup, garbage truck, lorry, fire engine, trailer, etc.

Minimal size. For the case of the target size, too small targets will be ignored to make sure that the annotation accuracy is not forfeited in congested and complex scenes. The size threshold is defined as in pixels for the bounding box’s longest side. If the bounding box’s longest side of a target is less than in pixels in the whole trajectory, the target will be ignored. However, for the targets whose bounding box’s longest side is less than pixels only at part of the trajectory, we still annotate these too-small targets in order to remain the complete trajectory. All too-small boxes in a trajectory will ignored when evaluating the performance of the multiple-object tracking methods.

 

Seq Pedestrian Riding Car Van Bus Tricycle Truck
02 2.7k (2.36%) 4.0k (3.51%) 100.9k (88.16%) 4.4k(3.93%) 1.7k (1.50%) 0.0k (0.02%) 0.5k (0.50%)
09 5.5k (6.79%) 11.6k (14.28%) 52.1k (64.16%) 1.9k (2.35%) 8.6k (10.65%) 11.0k (1.36%) 0.3k (0.42%)
11 7.1k (8.49%) 34.8k (41.31%) 29.6k (35.18%) 0.0k (0.00%) 6.0k (7.17%) 2.2k (2.66%) 4.3k (5.20%)
18 59.7k (57.47%) 9.7k (9.35%) 24.4k (23.51%) 0.0k (0.07%) 0.7k (0.69%) 8.6k (8.32%) 0.5k (0.58%)
24 140.6k (93.43%) 9.8k (6.57%) 0.0k (0.00%) 0.0k (0.00%) 0.0k (0.00%) 0.0k (0.00%) 0.0k (0.00%)
32 251.1k (81.98%) 1.2k (0.39%) 54k (17.63%) 0.0k (0.00%) 0.0k (0.00%) 0.0k (0.00%) 0.0k (0.00%)
Overall 1665.1k (44.30%) 405.0k (10.78%) 1496.3k (39.81%) 51.1k (1.36%) 81.8k (2.18%) 38.0k (1.01%) 21.2k (0.57%)

 

Table 3: Target number of each class in sample sequences. The number in brackets is percentage of each class in one sequence.

3.3 Data Statistics

Ultimately, we have annotated 32 sequences in total. The total number of annotated frames is over , resulting in million bounding boxes and about unique trajectories, much more than the annotation in MOT challenge 2017 ( bounding boxes of trajectories in frames). The average density of the sequence, which means the average number of the annotated targets per frame, is almost . Therefore, the scenes in our dataset are extremely congested and challenging. The sequences are divided into two subsets, training split and test split and the statistics of the annotated sequences are listed in Tab. 2.

(a) Height of pedestrians
(b) Width of cars
Figure 3: Histogram of target size in pixels

All the targets are grouped into 7 classes defined in Sec. 3.2 with an unique class ID for each one. The class pedestrian and car are the most frequent in our dataset, with bounding boxes of car and pedestrian versus boxes for other classes in total. Furthermore, the diversity among different scenarios is relatively large. For example, the general roads include mostly cars (, bounding boxes versus for others in total in Seq.01) while there are nearly only pedestrians in the square (, bounding boxes versus for others in total in Seq.25). Tab. 3 lists the class statistics of some sample sequences owing to the limited space of paper.

Most of targets in the scenes are relatively small for the reason that the surveillance cameras are a bit far from the targets for wide visual field. The small targets, , the longest size length of the bounding box pixels, occupies of all targets regardless of too small targets. After ignoring the too small targets, histograms of the pedestrian height and the car width are shown in Fig. 3.

Scale variation is one of major deformations of targets in the surveillance videos. The measure of the scale change is defined as , where and represent the height and width of bounding box, respectively. The maximum and minimum size of the target is calculated based on the entire trajectory. Scale change occurs frequently in our dataset. For example, the size of the bounding box will change greatly when the target is approaching the camera from a distance. The distribution of scale change (in pixels defined above) can be grouped into three intervals by the values: small: , large: , and huge: . Each of three intervals occupies separately and of all the targets. Most of the targets (trajectories) in our dataset encounter at least large scale change which makes our dataset very challenging.

Occlusion is also very frequent in the scenes which makes the object tracking very difficult. Nearly all the trajectories have been part-occluded and some of which are completely-occluded in our dataset. There are in total trajectories in which at least bounding boxes occluded completely.

The trajectory length is the time period from the appearance to disappearance of the target and depends on the speed and route of the target. The average trajectory length in our dataset is as long as (20.94 seconds). Generally, the trajectory length of the fast target, , car, is shorter than the slower one, , pedestrian. Fig. 4 shows the histograms of trajectory length of pedestrians and cars.

(a) Pedestrian
(b) Car
Figure 4: Histogram of trajectory length

4 Applications and Evaluations

4.1 Detection

 

Methods Pedestrian Riding Car Van Bus Tricycle Truck Average
SSD [5] 0.6761 0.6768 0.5390 0.8772 0.6333 0.3604 0.8967 0.6656
Faster RCNN [6] 0.6696 0.6671 0.5404 0.8902 0.6260 0.3572 0.8696 0.6600

 

Table 4: Detection results (mAP) of Faster RCNN and SSD.

For each of seven classes defined in Sec. 3.2, the goal of the object detection is to predict the bounding boxes of each target of that class in a test image (if any), with associated real-valued confidence.

All images for object detection in our dataset are sampled from the sequences by one frame per second. There are in total annotated images and almost bounding boxes for object detection, regardless of the too small and the completely-occluded targets. All images are divided into train split and test split following Tab. 2. Therefore, there are annotated images for training and images for test. The completely-occluded bounding boxes will be ignored in both training and testing of detection methods. Faster RCNN [6] and Single Shot MultiBox Detector (SSD) [5] are the most typical algorithms for object detection and achieve very good performances for object detection in PASCAL VOC [13], COCO [12], and ILSVRC [15] datasets. In our experiments, Faster RCNN and SSD will be evaluated on the performance for object detection in urban congested environments.

The evaluation of the detectors in our experiments is the same as PASCAL VOC 2007[13] and the performances are listed in Tab. 4. Among all the classes of targets, we select pedestrian and car for analysis of the Precision/Recall curve as shown in Fig. 5.

(a) Pedestrian Detection
(b) Car Detection
Figure 5: Performance of detectors on pedestrians and cars.

4.2 Multiple Object Tracking

For multiple-object tracking, all the sequences are divided into training split and test split as shown in Tab. 2

. The train sequence and the test sequence in the same row are both captured from the same scene. We take experiments on multiple-object tracking by using the detection results of SSD for its good performance on accuracy and speed. Evaluation metrics for multiple object tracking not only are desirable to summarize the performance into one single number to enable a direct comparison but also provide several performance estimates including information about the individual errors made by algorithms. Following a recent trend

[9] [10], we employ two sets of measures as the evaluation metrics for multiple-object tracking: CLEAR metrics [16] and a set of track quality measures [17], , MOTA and MOTP.

 

Methods MOTA MOTP FAR MT(%) ML(%) FP FN IDsw rel. ID FM rel. FM
SORT [18] 37.38 83.01 0.47 11.04 28.54 50244 941580 18079 434.38 30650 736.42
IOU [19] 40.94 82.27 0.61 14.78 21.13 65344 847000 40241 847.46 40940 862.19
TC_ODAL [3] 41.05 78.32 1.59 16.25 29.06 671244 244335 14607 172.86 24662 291.86
DP_NMS [20] 20.31 83.95 3.06 19.68 38.20 328783 923420 5335 128.55 6184 149.01

 

Table 5: Quantitative results for multiple-object tracking.

In our experiments, we use several multiple-object tracking algorithms (preferring real-time methods) as baseline methods: 1) TC_ODAL [3] proposed the tracklet confidence using the detectability and continuity of a tracklet, and formulated a multi-object tracking problem based on the tracklet confidence. 2) DP_NMS [20] introduced a greedy algorithm that sequentially instantiates tracks using shortest path computations and allows one to embed pre-processing steps, such as non-max suppression, within the tracking algorithm. 3) SORT [18] was introduced as a pragmatic approach to multiple object tracking where the main focus is to associate objects efficiently for online applications. 4) IOU [19] posed a shift that enables the deployment of much simpler tracking algorithms which can compete with more sophisticated approaches at a fraction of the computational cost. We take the default parameters as suggested by the authors and the quality measures of these baseline methods are listed in Tab. 5. The provided numbers may not represent the best possible performance for each method.


Figure 6: Examples of different anomalies in our dataset. The top row is the ground truth and the bottom ones are the detection results of HMOF [21].
Figure 7: The ROC curves on the proposed dataset.

4.3 Anomaly Detection

Due to the urgent requirement of city security, anomaly detection and location is a vital task of video surveillance. Usually, the behaviors those rarely appeared in the videos are defined as anomaly behaviors [22]. It is common practice to detect anomaly behaviors when evaluation through modeling the normal videos in the training split (zero-shot learning). In the daily life, anomaly behaviors suffer vague definitions due to diverse scenes and relationships among objects. It is challenging to measure the diverse anomaly behaviors using the consistent standard.

Furthermore, we select parts of the annotated videos in our dataset for anomaly behavior detection. There are in total sequence videos and frames per sequence for training while sequences for test. The anomaly behaviors in the test split include running, jumping, bicycling, motoring and etc. As for the evaluation metrics, following [22, 23, 21], we use AUC (area under curve) and EER (equal error rate) to measure the ROC curves.

In our experiments, we use several anomaly detection methods as the baselines as following: 1) Zhu  [21]

used the HMOF features to distinguish anomaly behaviors in the videos through Gaussian Mixture Model and the visual tracking is adopted for the better detection. 2) The PT-HOF 

[24] were utilized to capture the fine-grained information and the consistency motion object (CMO) clusters similar point trajectories in a local region, for better anomaly localization. 3) Chen  [23] introduced a novel foreground object localization method and presented SL-MHOF, an effective descriptor modeling the local motion pattern while for appearance features, a CNN-based model is adopted. We take the default parameters as suggested by the authors and the ROC curves of these baseline methods are shown in Fig. 7. The videos in our dataset are captured from the real surveillance of the congested crowds, much more complicated and challenging than other anomaly detection datasets. Therefore, there are a long distance of the methods for real practice although they achieve excellent performances in the other datasets.

5 Conclusion

In this work, we have proposed a challenging large scale urban surveillance video dataset, one of the largest and most realistic datasets, for object tracking and behavior analysis. The dataset consists of 16 scenes captured in 7 typical urban outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. We annotated over video frames carefully, resulting in more than million object bounding boxes and about trajectories. The proposed dataset is pretty challenging and very suitable for evaluation on object tracking and anomaly detection in urban environments.

In the annotation procedure, we annotated a novel target class, group defined as one unit including at least two pedestrians walking together (close location with similar velocity and direction). In the future, the dataset will be extended for crowd analysis with the annotated information. On the other hand, there will be a big data expansion of the dataset owing to the annotated data occupies a small part of collected data by far.

Acknowledgment.

This work is supported by the National Natural Science Foundation of China (Grant No. 61371192), the Key Laboratory Foundation of the Chinese Academy of Sciences (CXJJ-17S044) and the Fundamental Research Funds for the Central Universities (WK2100330002, WK3480000005).

References

  • [1] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in CVPR, 2009.
  • [2] M. Wang and X. Wang, “Automatic adaptation of a generic pedestrian detector to a specific traffic scene,” in CVPR, 2011.
  • [3] S. H. Bae and K. J. Yoon, “Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning,” in CVPR, 2014.
  • [4] J. H. Yoon, C. R. Lee, M. H. Yang, and K. J. Yoon, “Online multi-object tracking via structural constraint event aggregation,” in CVPR, 2016.
  • [5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
  • [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” TPAMI, 2016.
  • [7] A. Ess, B. Leibe, and L. Van Gool, “Depth and appearance for mobile scene analysis,” in ICCV, 2007.
  • [8] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
  • [9] Laura L.-T., A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi-target tracking,” arXiv preprint arXiv:1504.01942, 2015.
  • [10] A. Milan, Laura L.-T., I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
  • [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” in IJCV, 2010.
  • [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,

    ImageNet Large Scale Visual Recognition Challenge,”

    IJCV, 2015.
  • [14] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” TPAMI, 2012.
  • [15] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, “Ilsvrc-2012, 2012,” http://www.image-net.org/challenges/LSVRC, 2012.
  • [16] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The clear 2006 evaluation,” in 2006 International Evaluation Workshop on Classification of Events, Activities and Relationships, 2006.
  • [17] B. Wu and R. Nevatia, “Tracking of multiple, partially occluded humans based on static body part detection,” in

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , 2006.
  • [18] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in ICIP, 2016.
  • [19] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in Advanced Video and Signal Based Surveillance (AVSS), 2017.
  • [20] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” in CVPR, 2011.
  • [21] H. Zhu, B. Liu, G. Yin, Y. Lu, W. Li, and N. Yu, “Real-time anomaly detection with hmof feature,” in arXiv:1812.04980, 2018.
  • [22] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in CVPR, 2011.
  • [23] Z. Chen, W. Li, C. Fei, B. Liu, and N. Yu, “Obust anomaly detection via fusion of appearance and motion features,” in VCIP, 2018.
  • [24] K. Zhao, B. Liu, W. Li, N. Yu, and Z. Liu, “Anomaly detection and localization: A novel two-phase framework based on trajectory-level characteristics,” in ICMEW, 2018.