With the rapid development of digital acquisition and storage technologies, video surveillance has become one of the most important safety monitoring methods widely used all around the world. As a very active research field in the computer vision area, the main goal of the research on the surveillance video is to effectively analyze and extract information from a large amount of unstructured video data acquired by the surveillance cameras, automatically detect, track and identify the targets, analyze various behaviors of the targets, understand various events occurring in the scene, and alarm suspicious events, to provide technical support for public security.
Among various research topics in surveillance video analysis and scene recognition, multiple-object tracking and behavior analysis is one of the major research fields. After the proposition of the concepts ofintelligent transportation and smart city, more and more researchers have begun to focus on object tracking and behavior analysis in the surveillance videos [1, 2]. However, the explosive growth of the number of vehicles and populations has resulted in more congested and complicated urban environments, which brings many new challenges to the research on the surveillance video.
As many object tracking and anomaly behavior analysis algorithms have been proposed to deal with the congested and complicated scenarios [3, 4, 5, 6] , the corresponding public challenging datasets are required to provide the fair comparison. However, there are only several real-world urban surveillance video datasets serving the purpose of evaluating the performance and robustness of object tracking and behavior algorithms. And most of the existing surveillance video datasets  used in the previous works are relatively small and simple, which makes them less qualified to assess the performance in real-world applications for more and more congested and complex scenarios.
|MIT Traffic||20||520||-||2054||3.95||static||Pedestrian detection|
|Caltech Pedestrian||11||250000||-||350000||1.40||dynamic||Pedestrian detection|
|Daimler Pedestrian * ||1||21790||-||56492||2.59||dynamic||Pedestrian detection|
|KITTI Tracking ** ||21||7924||917||8896||1.12||dynamic||Multiple object tracking|
|MOT Challenge 2015 2D ||22||11286||1221||101345||8.98||diverse||Multiple object tracking|
|MOT Challenge 2016 ||14||11235||1342||292733||26.06||diverse||Multiple object tracking|
|MOT Challenge 2017  ***||14||11235||1331||300373||26.73||diverse||Multiple object tracking|
|Our dataset||32||211706||7173||3758821||17.75||static||Object detection and tracking|
* The statistics of Daimler Pedestrian detection dataset only includes test split.
** The statistics of KITTI Tracking dataset only includes training split and boxes include the bounding boxes of DontCare labels.
*** The sequences in MOT17 Challenge are the same as MOT16 sequences with a new, more accurate ground truth.
In the current work, we propose a large scale urban surveillance video dataset (USVD) with congested and complex scenarios for multiple-object tracking and anomaly behavior analysis. To the best of our knowledge, it is to-date the largest and most realistic public dataset for real video surveillance. There are mainly four advantages in our dataset compared with the existing datasets.
Realistic. All the data are from the real public surveillance scenes, which enables the evaluation of computer vision algorithms on direct application to the real-world.
Complex. The dataset is comprised of typical scenarios with different congested scenes. There are frequent occlusion, deformation, various viewpoints and diverse targets in these congested scenes.
Large Scale. The dataset consists of over annotated frames and more than million bounding boxes of about thousand unique trajectories.
Well-Annotated. All the bounding boxes are manually annotated and checked. The annotation includes location, size, object category, occlusion, and trajectory identity.
We also use the proposed dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly detection and explore the robustness of these methods in urban congested conditions.
2 Related Works
In the recent years, the computer vision community has created benchmarks for video related tasks such as scene recognition, pedestrian & object detection, object tracking, action recognition, anomaly behavior detection and etc. Despite the potential pitfalls of such datasets, they have proved to be extremely helpful to advance the state-of-the-arts in the corresponding areas [11, 9, 12, 13]. An overview of examples of the existing datasets in urban environments for object detection and tracking is shown in Tab. 1.
Real Urban Video Datasets. The MIT Traffic dataset  is an example of the recent efforts to build more realistic urban traffic surveillance video datasets for research on pedestrian detection and activity analysis. It includes a traffic video sequence of 90 minutes long, recorded by a stationary camera and the whole sequence is divided into 20 clips. The size of the scene is 720 by 480. In order to evaluate the performance of human detection on this dataset, the ground truth of the pedestrians of some sampled frames are manually labeled. There are in total 520 annotated frames and 2054 bounding boxes in the dataset.
The Caltech Pedestrian dataset [1, 14] consists of approximately 10 hours of Hz video taken from a vehicle driving through the regular traffic in an urban environment. All the data is roughly divided in half, setting aside 6 sessions for training and 5 for testing. About frames with a total of bounding boxes and unique pedestrians are annotated.
The Daimler Monocular Pedestrian Detection dataset  is another dataset for pedestrian detection in urban environments. The training set contains pedestrian samples (image cut-outs at resolution) and 6744 additional full images without pedestrians for extracting negative samples. The test set contains an independent sequence with more than images and pedestrian labels (fully visible or partially occluded), captured from a vehicle during a 27 min driving through the urban traffic.
Although they are helpful to evaluate the performance of pedestrian detection algorithms in real surveillance video to some extent, the three datasets mentioned above only include some uncrowded urban scenes and a few annotated bounding boxes, which is relatively simple compared to the complex and congested traffic conditions nowadays.
MOT Datasets. Recently, the KITTI benchmark 
is introduced for challenges in autonomous driving, which includes stereo flow, odometry, road and lane estimation, object detection as well as tracking. Some of the sequences include crowded pedestrian crossings, making the dataset quite challenging, but the camera is moving, while the conventional traffic surveillance video varies greatly.
MOT challenge 2015 , challenge 2016  and challenge 2017  are the recent challenging benchmarks for multiple object tracking. The videos in the benchmarks are diverse, and some of which are selected from the existing datasets, KITTI Tracking dataset. However, the dataset consists of various video types including surveillance videos and moving-camera videos. Therefore, it motivates us to establish a public challenging urban surveillance video dataset which is more realistic to evaluate the performance of various algorithms for object tracking and behavior analysis.
3 Large Scale Urban Surveillance Video Dataset
An example sequence in our proposed dataset is shown in Fig. 2, in which the trajectory distribution only includes the trajectory location in the adjacent frames ( seconds) and the bounding boxes of dotted line mean the completely-occluded target. The targets we are interested in urban environments are movable individuals or units, , pedestrian, car or van, rather than the stationary objects, , trees, pillars or traffic lights. In this section, we introduce our large scale urban surveillance video dataset (USVD) in details.
|Train sequences||Test sequences|
3.1 Data Collection
The videos in the dataset are captured from surveillance cameras distributed in public places. We collect hours of video (over 5 TeraBytes of data in total and hours of video for each camera captured from 7:00 to 19:00 for 6 days) and select 16 representative challenging scenes based on the factors, , density, diversity, occlusion, deformation, viewpoint, motion and etc. 111The factor details are shown in the supplementary material. The scenes we selected are shown in Fig. 1 and the ID numbers in the corners represent the scenarios in turn: street, crossroads, hospital entrance, school gate, park, pedestrian mall and public square.
In order to evaluate the performance of the object tracking and behavior analysis algorithms, we proposed a clear protocol that was obeyed throughout the annotation of the entire dataset to guarantee the consistency. The annotation rule 222we only show the part of annotation rules due to space limited and the full rules are shown in the supplementary materials. includes the following aspects:
Class. We annotated 7 classes of targets in urban scenarios, including pedestrian, riding, car, van, bus, tricycle, and truck. The queries used for each of the classes are listed as follows: 1) pedestrian: single pedestrian, including the walking person, skating person and sitting person, but not including the person on the vehicle, bicycle, motorcycle, scooter. Note that the pictures on the advertising boards were regarded as the background. 2) riding: two-wheel vehicle with people on it, , bicycle, motorcycle and scooter. The annotated bounding box surrounds the extent of both the vehicle and the person. The person-riding is considered as one moving individual rather than divided into vehicle and person. Although it looks almost the same as a pedestrian from the waist up, the person on a vehicle will be considered as a part of the vehicle and will not be annotated as pedestrian which is different from the annotation in [9, 10]. The parked cycles without people will be ignored and regarded as background. 3) car: four-wheel vehicle for the purpose of several-person transport, such as hatchback, sedan, SUV, MPV, taxi, jeep, convertible, etc. 4) van: four-wheel medium-size passenger-car or van for the purpose of transport of a small number of cargo or used for engineering operations, such as ambulance, van, etc. 5) bus: four-wheel vehicle for the purpose of taking a large number of persons, and bigger than a van, such as bus, mini-bus, coach, etc. 6) tricycle: three-wheel vehicle for cargo or passenger transport, and the bounding box surrounding the extent of both vehicle and person or cargo. 7) truck: four-wheel vehicle for cargo transport, such as pickup, garbage truck, lorry, fire engine, trailer, etc.
Minimal size. For the case of the target size, too small targets will be ignored to make sure that the annotation accuracy is not forfeited in congested and complex scenes. The size threshold is defined as in pixels for the bounding box’s longest side. If the bounding box’s longest side of a target is less than in pixels in the whole trajectory, the target will be ignored. However, for the targets whose bounding box’s longest side is less than pixels only at part of the trajectory, we still annotate these too-small targets in order to remain the complete trajectory. All too-small boxes in a trajectory will ignored when evaluating the performance of the multiple-object tracking methods.
|02||2.7k (2.36%)||4.0k (3.51%)||100.9k (88.16%)||4.4k(3.93%)||1.7k (1.50%)||0.0k (0.02%)||0.5k (0.50%)|
|09||5.5k (6.79%)||11.6k (14.28%)||52.1k (64.16%)||1.9k (2.35%)||8.6k (10.65%)||11.0k (1.36%)||0.3k (0.42%)|
|11||7.1k (8.49%)||34.8k (41.31%)||29.6k (35.18%)||0.0k (0.00%)||6.0k (7.17%)||2.2k (2.66%)||4.3k (5.20%)|
|18||59.7k (57.47%)||9.7k (9.35%)||24.4k (23.51%)||0.0k (0.07%)||0.7k (0.69%)||8.6k (8.32%)||0.5k (0.58%)|
|24||140.6k (93.43%)||9.8k (6.57%)||0.0k (0.00%)||0.0k (0.00%)||0.0k (0.00%)||0.0k (0.00%)||0.0k (0.00%)|
|32||251.1k (81.98%)||1.2k (0.39%)||54k (17.63%)||0.0k (0.00%)||0.0k (0.00%)||0.0k (0.00%)||0.0k (0.00%)|
|Overall||1665.1k (44.30%)||405.0k (10.78%)||1496.3k (39.81%)||51.1k (1.36%)||81.8k (2.18%)||38.0k (1.01%)||21.2k (0.57%)|
3.3 Data Statistics
Ultimately, we have annotated 32 sequences in total. The total number of annotated frames is over , resulting in million bounding boxes and about unique trajectories, much more than the annotation in MOT challenge 2017 ( bounding boxes of trajectories in frames). The average density of the sequence, which means the average number of the annotated targets per frame, is almost . Therefore, the scenes in our dataset are extremely congested and challenging. The sequences are divided into two subsets, training split and test split and the statistics of the annotated sequences are listed in Tab. 2.
All the targets are grouped into 7 classes defined in Sec. 3.2 with an unique class ID for each one. The class pedestrian and car are the most frequent in our dataset, with bounding boxes of car and pedestrian versus boxes for other classes in total. Furthermore, the diversity among different scenarios is relatively large. For example, the general roads include mostly cars (, bounding boxes versus for others in total in Seq.01) while there are nearly only pedestrians in the square (, bounding boxes versus for others in total in Seq.25). Tab. 3 lists the class statistics of some sample sequences owing to the limited space of paper.
Most of targets in the scenes are relatively small for the reason that the surveillance cameras are a bit far from the targets for wide visual field. The small targets, , the longest size length of the bounding box pixels, occupies of all targets regardless of too small targets. After ignoring the too small targets, histograms of the pedestrian height and the car width are shown in Fig. 3.
Scale variation is one of major deformations of targets in the surveillance videos. The measure of the scale change is defined as , where and represent the height and width of bounding box, respectively. The maximum and minimum size of the target is calculated based on the entire trajectory. Scale change occurs frequently in our dataset. For example, the size of the bounding box will change greatly when the target is approaching the camera from a distance. The distribution of scale change (in pixels defined above) can be grouped into three intervals by the values: small: , large: , and huge: . Each of three intervals occupies separately and of all the targets. Most of the targets (trajectories) in our dataset encounter at least large scale change which makes our dataset very challenging.
Occlusion is also very frequent in the scenes which makes the object tracking very difficult. Nearly all the trajectories have been part-occluded and some of which are completely-occluded in our dataset. There are in total trajectories in which at least bounding boxes occluded completely.
The trajectory length is the time period from the appearance to disappearance of the target and depends on the speed and route of the target. The average trajectory length in our dataset is as long as (20.94 seconds). Generally, the trajectory length of the fast target, , car, is shorter than the slower one, , pedestrian. Fig. 4 shows the histograms of trajectory length of pedestrians and cars.
4 Applications and Evaluations
|Faster RCNN ||0.6696||0.6671||0.5404||0.8902||0.6260||0.3572||0.8696||0.6600|
For each of seven classes defined in Sec. 3.2, the goal of the object detection is to predict the bounding boxes of each target of that class in a test image (if any), with associated real-valued confidence.
All images for object detection in our dataset are sampled from the sequences by one frame per second. There are in total annotated images and almost bounding boxes for object detection, regardless of the too small and the completely-occluded targets. All images are divided into train split and test split following Tab. 2. Therefore, there are annotated images for training and images for test. The completely-occluded bounding boxes will be ignored in both training and testing of detection methods. Faster RCNN  and Single Shot MultiBox Detector (SSD)  are the most typical algorithms for object detection and achieve very good performances for object detection in PASCAL VOC , COCO , and ILSVRC  datasets. In our experiments, Faster RCNN and SSD will be evaluated on the performance for object detection in urban congested environments.
The evaluation of the detectors in our experiments is the same as PASCAL VOC 2007 and the performances are listed in Tab. 4. Among all the classes of targets, we select pedestrian and car for analysis of the Precision/Recall curve as shown in Fig. 5.
4.2 Multiple Object Tracking
For multiple-object tracking, all the sequences are divided into training split and test split as shown in Tab. 2
. The train sequence and the test sequence in the same row are both captured from the same scene. We take experiments on multiple-object tracking by using the detection results of SSD for its good performance on accuracy and speed. Evaluation metrics for multiple object tracking not only are desirable to summarize the performance into one single number to enable a direct comparison but also provide several performance estimates including information about the individual errors made by algorithms. Following a recent trend , we employ two sets of measures as the evaluation metrics for multiple-object tracking: CLEAR metrics  and a set of track quality measures , , MOTA and MOTP.
|Methods||MOTA||MOTP||FAR||MT(%)||ML(%)||FP||FN||IDsw||rel. ID||FM||rel. FM|
In our experiments, we use several multiple-object tracking algorithms (preferring real-time methods) as baseline methods: 1) TC_ODAL  proposed the tracklet confidence using the detectability and continuity of a tracklet, and formulated a multi-object tracking problem based on the tracklet confidence. 2) DP_NMS  introduced a greedy algorithm that sequentially instantiates tracks using shortest path computations and allows one to embed pre-processing steps, such as non-max suppression, within the tracking algorithm. 3) SORT  was introduced as a pragmatic approach to multiple object tracking where the main focus is to associate objects efficiently for online applications. 4) IOU  posed a shift that enables the deployment of much simpler tracking algorithms which can compete with more sophisticated approaches at a fraction of the computational cost. We take the default parameters as suggested by the authors and the quality measures of these baseline methods are listed in Tab. 5. The provided numbers may not represent the best possible performance for each method.
4.3 Anomaly Detection
Due to the urgent requirement of city security, anomaly detection and location is a vital task of video surveillance. Usually, the behaviors those rarely appeared in the videos are defined as anomaly behaviors . It is common practice to detect anomaly behaviors when evaluation through modeling the normal videos in the training split (zero-shot learning). In the daily life, anomaly behaviors suffer vague definitions due to diverse scenes and relationships among objects. It is challenging to measure the diverse anomaly behaviors using the consistent standard.
Furthermore, we select parts of the annotated videos in our dataset for anomaly behavior detection. There are in total sequence videos and frames per sequence for training while sequences for test. The anomaly behaviors in the test split include running, jumping, bicycling, motoring and etc. As for the evaluation metrics, following [22, 23, 21], we use AUC (area under curve) and EER (equal error rate) to measure the ROC curves.
In our experiments, we use several anomaly detection methods as the baselines as following: 1) Zhu 
used the HMOF features to distinguish anomaly behaviors in the videos through Gaussian Mixture Model and the visual tracking is adopted for the better detection. 2) The PT-HOF were utilized to capture the fine-grained information and the consistency motion object (CMO) clusters similar point trajectories in a local region, for better anomaly localization. 3) Chen  introduced a novel foreground object localization method and presented SL-MHOF, an effective descriptor modeling the local motion pattern while for appearance features, a CNN-based model is adopted. We take the default parameters as suggested by the authors and the ROC curves of these baseline methods are shown in Fig. 7. The videos in our dataset are captured from the real surveillance of the congested crowds, much more complicated and challenging than other anomaly detection datasets. Therefore, there are a long distance of the methods for real practice although they achieve excellent performances in the other datasets.
In this work, we have proposed a challenging large scale urban surveillance video dataset, one of the largest and most realistic datasets, for object tracking and behavior analysis. The dataset consists of 16 scenes captured in 7 typical urban outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. We annotated over video frames carefully, resulting in more than million object bounding boxes and about trajectories. The proposed dataset is pretty challenging and very suitable for evaluation on object tracking and anomaly detection in urban environments.
In the annotation procedure, we annotated a novel target class, group defined as one unit including at least two pedestrians walking together (close location with similar velocity and direction). In the future, the dataset will be extended for crowd analysis with the annotated information. On the other hand, there will be a big data expansion of the dataset owing to the annotated data occupies a small part of collected data by far.
This work is supported by the National Natural Science Foundation of China (Grant No. 61371192), the Key Laboratory Foundation of the Chinese Academy of Sciences (CXJJ-17S044) and the Fundamental Research Funds for the Central Universities (WK2100330002, WK3480000005).
-  P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in CVPR, 2009.
-  M. Wang and X. Wang, “Automatic adaptation of a generic pedestrian detector to a specific traffic scene,” in CVPR, 2011.
-  S. H. Bae and K. J. Yoon, “Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning,” in CVPR, 2014.
-  J. H. Yoon, C. R. Lee, M. H. Yang, and K. J. Yoon, “Online multi-object tracking via structural constraint event aggregation,” in CVPR, 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” TPAMI, 2016.
-  A. Ess, B. Leibe, and L. Van Gool, “Depth and appearance for mobile scene analysis,” in ICCV, 2007.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
-  Laura L.-T., A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi-target tracking,” arXiv preprint arXiv:1504.01942, 2015.
-  A. Milan, Laura L.-T., I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” in IJCV, 2010.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,
“ImageNet Large Scale Visual Recognition Challenge,”IJCV, 2015.
-  P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” TPAMI, 2012.
-  J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, “Ilsvrc-2012, 2012,” http://www.image-net.org/challenges/LSVRC, 2012.
-  R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The clear 2006 evaluation,” in 2006 International Evaluation Workshop on Classification of Events, Activities and Relationships, 2006.
B. Wu and R. Nevatia,
“Tracking of multiple, partially occluded humans based on static
body part detection,”
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.
-  A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in ICIP, 2016.
-  E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by-detection without using image information,” in Advanced Video and Signal Based Surveillance (AVSS), 2017.
-  H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” in CVPR, 2011.
-  H. Zhu, B. Liu, G. Yin, Y. Lu, W. Li, and N. Yu, “Real-time anomaly detection with hmof feature,” in arXiv:1812.04980, 2018.
-  Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in CVPR, 2011.
-  Z. Chen, W. Li, C. Fei, B. Liu, and N. Yu, “Obust anomaly detection via fusion of appearance and motion features,” in VCIP, 2018.
-  K. Zhao, B. Liu, W. Li, N. Yu, and Z. Liu, “Anomaly detection and localization: A novel two-phase framework based on trajectory-level characteristics,” in ICMEW, 2018.