The WILDTRACK Multi-Camera Person Dataset

07/28/2017 ∙ by Tatjana Chavdarova, et al. ∙ 0

People detection methods are highly sensitive to the perpetual occlusions among the targets. As multi-camera set-ups become more frequently encountered, joint exploitation of the across views information would allow for improved detection performances. We provide a large-scale HD dataset named WILDTRACK which finally makes advanced deep learning methods applicable to this problem. The seven-static-camera set-up captures realistic and challenging scenarios of walking people. Notably, its camera calibration with jointly high-precision projection widens the range of algorithms which may make use of this dataset. In aim to help accelerate the research on automatic camera calibration, such annotations also accompany this dataset. Furthermore, the rich-in-appearance visual context of the pedestrian class makes this dataset attractive for monocular pedestrian detection as well, since: the HD cameras are placed relatively close to the people, and the size of the dataset further increases seven-fold. In summary, we overview existing multi-camera datasets and detection methods, enumerate details of our dataset, and we benchmark multi-camera state of the art detectors on this new dataset.



There are no comments yet.


page 7

page 8

page 9

page 12

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pedestrian detection has been an active line of research and is an essential computer vision problem. From a goal-definition point of view, it represents a sub-category of object detection. However, due to the wide variety of diversities of the appearance of the people, combined with the importance that this task is solved with high accuracy - take for example the application of autonomous car driving - pedestrian detection confidently developed itself as a separate branch on which vast research time has been spent. As a result, many interesting algorithms have been developed which found even wider applications then the original intend.

Despite the remarkable recent advances, notably lately owning to the integration of the deep learning methods, the performance of these monocular detectors remains limited to medium level occluded applications at the maximum. This statement is legitimate, since given the monocular observation, the underlying cause, in our case the persons to identify, under highly occluded scenes is ambiguous.

Genuinely, multi-camera detectors come at hand. Important research body in the past decade has also been devoted on this topic. In general, simple averaging of the per-view predictions, can only improve upon a single view detector. Further, more sophisticated methods jointly make use of the information to yield a prediction.

In our recent work, Chavdarova and Fleuret (2017), we showed that deep learning methods outperform the existing joint-methods on the multi-camera people detection problem. To achieve this, we took advantage of the existing larger monocular pedestrian detection dataset in order to train a monocular detection model, and later exploit this model to initialize sub-parts of the complete multi-view architecture. This dependency of a monocular pre-training implies limitations in the architecture that can be employed. Furthermore, it is clear that due to the existence of solely low-scale datasets, many peculiarities such as multi stream information correlation, may not be taken maximum advantage of. On the other hand, we showed that deep learning methods greatly outperform standard methods, and that precisely regarding a deep-learning-based method benefits of adding views in terms of improving its accuracy and prediction confidence, robustness to diverse occlusions, and the detector’s generalization.

Widely recognized and up to this point the largest multi-camera dataset of strictly overlapping fields of view is the PETS 2009 dataset which at the time of publishing fitted the needs of a challenging benchmark dataset. As deep-learning became widely applied in computer vision, this dataset implies important drawback of the inability to access the algorithm’s generalisation as it is recorded in a so called actor-setup. By this we mean that throughout the sequence, the same persons appear. In addition, it demonstrates calibration and synchronization inconsistencies, and it is not of a sufficient size.

To comply with the needs of the current outperforming methods in computer vision, the acquisition of the WILDTRACK dataset was motivated. Note, we do not claim to propose a new method in this paper. We instead, provide a new person dataset, which we hope would help accelerate research progress. Moreover, our dataset comes along with calibration annotations and recorded in the sequences are also standard calibration patterns, what makes it very suitable for improving such algorithms. In summary:

  • we provide a large scale HD dataset, whose advantages regarding the following research topics, are:

    • Multi-View detection: overlapping fields of view while being large-scale and the first dataset with a high-precision joint camera calibration;

    • Monocular detection: high-resolution regions of interest, and being larger then the monocular pedestrian detection dataset Caltech;

    • Camera calibration: the fact that the fields of view of the cameras are overlapping makes it very suitable for calibration algorithms; we also provide annotations of across views corresponding points for performing bundle adjustment; by-hand measurements on the ground surface allowing for extraction of arbitrary many points for performing extrinsic calibration; as well as the fact that in the videos we have recorded big chessboard patterns simultaneously visible from the views;

  • we provide experimental benchmark results on this dataset of state of the art multi-camera detection methods;

  • we give an overview of the existing methods and datasets and we discuss research directions.

Our dataset is publicly available111 In addition, the source code of the annotation tool that we used is available, which was particularly designed for annotating multi-camera datasets.

We make an overview of both multi-camera methods and such datasets, § 2. Pragmatic information including the number of annotations, what the dataset includes as well as the cameras’ layout is later elaborated, § 3. In addition, we give more details how both the processes of data acquisition and annotation were carried out. We also benchmark current state of the art multi-camera person detectors, § 4. Besides the evident research topics of person detection and tracking we bring up discussion on several other applications which may make use of this dataset.

Please note that throughout the paper, we may inadvertently use the term jointly

while referring to all the views at the same time. This could apply both to: methods - referring to those which use cues of all of the views at the same time to yield an estimation, rather then operating per-view and then averaging; and calibration accuracy - meaning a given

point to project accurately at pixel coordinates in all of the views.

2 Related work

2.1 Related datasets

We make an overview of pedestrian datasets in Table 1, with a greater focus on the multi-view ones. Before developing a discussion about the traits of the dataset of our interest, we clarify the listing.

Notably, the INRIA dataset deviates from the rest of the listed datasets since in fact it represents a collection of high resolution images of pedestrians, and thus it excludes temporal consistency. This explains why many fields in Table 1 are not applicable for it. The Daimler-stereo dataset was extended to the given training sizes by shifting and mirroring the specified number of annotations, whereas its testing sizes are listed in number of labels obtained during min. drive. The Daimler dataset may also be considered as a collection, as different sequences were used for labelling, and it also enlarged the original number of annotations by mirroring and shifting. Regarding PETS 2009, there are three different sequences within this dataset and the details in the table are referring to the S1.L2 sequence of walking pedestrians. The KITTI dataset contains multiple different sequences, and the time listed in Table 1 refers to the person sequence. Most common length of each of the sequences of this dataset is seconds, and the maximum one is minutes which sequence belongs to the road category.

Caltech-USA is the most widely used pedestrian dataset. It consists of fully annotated 30 Hz videos taken from a moving vehicle in a regular traffic. As number of annotations we specify the ones for training, obtained with fps sampling, as has lately become adopted by deep learning monocular methods, whereas the specified duration is of the complete recording time.

For further details on monocular datasets please refer to the exhaustive list provided by Dollar et al. (2012, chap. 2.4). As of our interest is a set-up of multiple static cameras whose fields of view overlap in large part, bellow we separately discuss the most-related ones to our dataset, which discussion highlights the novelty of the WILDTRACK dataset.

It is fair to note that the term overlapping is ambiguous in the literature. In some cases the authors use it to indicate a particular topology of a network of cameras positioned in the same area, thus sharing targets, but not necessarily pointing towards the same center/3D space visible to all cameras. In this paper, by overlapping we mean strictly pointed cameras towards a shared volume visible in their fields of view.

Please note that in the later discussion we omit the very recent DukeMTMC dataset. This dataset was published while we were in the process of getting the annotations of our dataset, thus is very recent, and it also represents a challenging, large-scale and HD dataset of walking pedestrians. However, it differs from our motivation as it is not overlapping. In particular, only two of its eight cameras’ fields of view slightly overlap, and the rest of the cameras have a self-designated sub-area.

max width= Dataset Resolution angle=45,lap=0pt-(1em) Cameras angle=45,lap=0pt-(1em) FPS angle=45,lap=0pt-(1em) Mobile/Static angle=45,lap=0pt-(1em) Overlapping angle=45,lap=0pt-(1em) Video Annotations angle=45,lap=0pt-(1em) Size/Duration INRIA high n/a n/a n/a n/a No pos. ETH M n/a Yes pos. TUD-Brussels - M n/a No imp. Daimler n/a M n/a No - Daimler-stereo M n/a Yes Caltech-USA M n/a Yes K@fps hours KITTI 10 M n/a Yes imp. 7 min. APIDIS S Yes Yes IDs min. PETS 2009 ; ; S Yes Yes 795 frames DukeMTMC S No Yes IDs min. Campus S Yes Yes IDs min. EPFL S Yes Yes IDs min. SALSA S Yes Yes IDs min. EPFL-RLC S Yes Yes frames WILDTRACK S Yes Yes @fps min.

  • No color channels.

  • color and grayscale cameras.

  • Stereo camera(s).

Table 1: Commonly used datasets for pedestrian detection. denotes thousands; IDs - identities; imp - image pairs; the column FPS refers to the frame rate during the data acquisition; and with addition we denote the pre-defined splits to training and testing partitions where applicable. For more details please refer to §2.1.

2.1.1 WILDTRACK’s novelty

The first dataset with a camera set-up as ours is the PETS 2009 dataset, and in the discussion bellow we refer to its “S2.L1” sequence. It contains one additional camera which is left out to be used solely for cross validation. As authors report: (Peng et al., 2015, p. 10),  (Ge and Collins, 2010, p. 10),  (Chavdarova and Fleuret, 2017, p. 3), this dataset demonstrates notable calibration inaccuracy in terms of joint-consistency. In addition, there also exist miss-synchronization, as the providers of this dataset remark. Importantly, the dataset is acquired in a non-realistic environment, in a sense that the same persons are walking throughout the sequence. Although at the time of publishing this had no effect on the methods’ benchmarking since most of the earlier methods operate per-frame and perform background subtraction preprocessing, this clearly introduces uncertainty in the estimation of the generalization of the trained deep learning models, due to the fact that they are able to memorize appearance cues. Nevertheless, this dataset indeed is widely recognized and used, what in fact indicates the need of “re-providing” a dataset of such set-up, which would fit the deep learning methods’ needs.

The three datasets Campus, EPFL and SALSA are multi-camera with overlapping fields of view. However, EPFL and SALSA have a very small number of different people and are relatively not crowded. Furthermore, EPFL is short and has low image quality. In SALSA, a cocktail party is filmed for minutes, where the people are static most of the time what makes this dataset less challenging for tracking. Finally, Campus neither provides the calibration of the cameras, nor the annotations of the locations of the people.

The EPFL-RLC dataset demonstrates improved joint-calibration accuracy and synchronization compared to the PETS 2009 dataset. As we shell see in the later sections, the WILDTRACK’s calibration joint accuracy is more precise. We remark that despite being a sequence of frames, the current publication of the dataset does not contain full ground-truth annotations of the entire sequence. Instead, the annotations were initially intended for classification of a given multi-view sample as being occupied or not. It contains balanced set of multi-view examples, where conveniently for monocular classification, each negative multi-view sample is additionally annotated if it contains a pedestrian or not. Currently, full ground-truth annotations are provided solely for the last frames. In addition, it is acquired with three cameras, whereas WILDTRACK is of seven; and the cameras have relatively more limited fields of view, what results into WILDTRACK having -fold increased number of detections per frame on average.

We conclude, the novelty of this dataset is the fact that it is the largest overlapping seven-static-camera HD dataset acquired in a non-actor but realistic environment.

2.2 Related methods

We review joint multi-camera methods, which unless otherwise stated, in their original formulation utilise background subtraction pre-processing.

Fleuret et al. (2008)

are the first to propose a method which jointly uses the multi-view streams called Probabilistic Occupancy Map (POM). Based on a crude generative model, it estimates the probabilities of occupancy through mean-field inference, naturally handling occlusions. Further, it can be combined with a convex max-cost flow optimization to leverage time consistency, 

Berclaz et al. (2009).

Alahi et al. (2011)

re-cast the problem as a linear inverse, regularized by enforcing a sparsity constraint on the occupancy vector. It uses a dictionary whose atoms approximate multi-view silhouettes. To elevate the need of iterative and thus demanding O-Lasso computations,

Golbabaee et al. (2014) derive a regression model which includes solely Boolean arithmetic and sustains the sparsity assumption of Alahi et al. (2011). In addition, the iterative method is replaced with a greedy algorithm based on set covering.

Peng et al. (2015)

model the occlusions explicitly per view by a separate Bayesian Network, and a multi-view network is then constructed by combining them, based on the ground locations and the geometrical constraints.

Although considering crowd analysis, the multi-view image generation of Ge and Collins (2010) is with a stochastic generative process of random crowd configurations, and then maximum a posteriori (MAP) estimate is used to find the best fit with respect to the image observations.

In our recently published work, Chavdarova and Fleuret (2017), we show for the first time that deep learning methods even on lower-scale datasets outperform existing methods. To obtain generalization, we first make use of the larger scale monocular pedestrian detection dataset - Caltech-USA. Later we build an architecture which in parallel processes the multi-stream frames and jointly estimates the occupancy of the inspected position. To prove generalization on test data, due to the actor-setup of the PETS 2009 discussed above, we manually annotated such multi view examples, and tested the performance on a completely different part of the sequence, hence the reasons for providing the EPFL-RLC dataset. Given the trained monocular models which we provide, the resulting method is straightforward to apply, as it implies re-training on small data-set, and yields end-to-end deep learning model. Our published summary also included implementation insights.

3 The WILDTRACK dataset

3.1 Hardware and data acquisition


The dataset was acquired using seven high-tech statically positioned cameras with overlapping fields of view. Precisely, three GoPro Hero 4 and four GoPro Hero 3 cameras were used, of which example frames are illustrated in the bottom and top row of Fig. 1, respectively.

Data acquisition.

The data acquisition took place in front of the main building of ETH Zurich, Switzerland, during nice weather conditions. The sequences are of resolution pixels, shot at 60 frames per second.

Figure 1: Synchronized corresponding frames from the seven views.

max width=

Figure 2: Top view visualisation of the amount of overlap between the cameras’ fields of view. Each cell represents a position, and the darker it is coloured the more visible it is from different cameras. See § 3.1 for details.
Camera layout.

As Fig. 1 shows, the camera layout is such that their fields of view overlap in large part. As can be noticed, the height of the positions of the cameras is above humans’ average height.

In Fig. 2 we make a top-view visualisation to illustrate the level of overlap between the seven cameras. Namely, to obtain the illustration we pre-define an area of interest, and discretize it into a regular grid of points each defining a position. For each position we sum the cameras for which it is visible. The normalized values are then displayed, where the darker the filling color of a cell is the higher the number of such cameras is. We see that in large part the fields of view between the cameras overlap. Precisely, in the illustration we considered grid. Out of the total of positions, , , , , , , , are simultaneously visible to views, respectively. On average, each position is seen from cameras.


The sequences were initially synchronised with a  50 ms accuracy, what was further refined by detailed manual inspection. In Fig. 4, which illustrates cropped regions of synchronized corresponding frames, one can also observe the synchronization precision.

3.2 Statistics

Annotated frames.

Currently the first frames - extracted from the videos with fps - are being annotated. The annotation was done with a frame rate of fps, or in other words of the afore specified extracted frames, we annotated every fifth. Hence, this corresponds to a total of annotated frames. For details on the file formats and on how the annotation process was carried out, please see App. A and  C, respectively.

Figure 3: Multi view examples of our dataset. Each row represents a single positive multi-view annotation.
Multi-view annotations.

There are multi-view annotations in total. In Fig. 3 we illustrate multi-view examples of our dataset, visible in all of the seven views at the same time.

Monocular annotations.

As each annotated multi-view example is not always visible in all of the views, the number of monocular examples is , , , , , , , respectively for each of the views. This amounts to a total of monocular detections while using frame rate of fps.

3.3 Calibration of the cameras

Camera calibration refers to the estimation of the extrinsic and the intrinsic parameters of a given camera. The former parameters provide the rigid mapping of the world coordinates into the camera’s coordinates, whereas the latter also known as projective transformation consists of finding the optimal set of parameters which would build a projection model that relates the 2D image points to the 3D scene points.

In our setup all of the seven cameras are static. Unlike the existing multi-camera data-sets, our focus was obtaining joint camera calibration which is as accurate as possible. By this we mean obtaining the cameras’ calibration parameters so as a given point in the 3D space which lies within certain cameras’ fields of view is observed at logically the same 2D location as a human would expect. This does not necessarily coincides with obtaining per-camera accurate calibration: as a 2D point from a single camera can be ambiguously mapped into the 3D space, the obtained parameters are not necessarily adjusted to resolve this. We thus emphasised our aim of it to be jointly accurate, and in this section we explain how we performed the calibration of the cameras which consists of three steps.

Primarily, let us note that there exist few camera calibration algorithms, each of which models differently these mappings using different parameters, and thus exhibits different requirements in order to obtain the estimates. We used the simplest - and yet in practice powerful - the Pinhole camera model (Wikipedia (2017b)), due to the fact that it is supported by the widely used OpenCV library, Bradski (2000), which provides easy to use mapping modules.

We breifly discuss our approach to obtain the calibration of the cameras, and for further implementation details please refer to App. B.

Intrinsic calibration.

The Pinhole intrinsic

matrix includes: the focal center and length, the skew coefficient, as well as parameters which model the radial and the tangential distortion coefficients of the lens. These parameters are camera-specific, and are estimated once per camera.

Extrinsic calibration.

To obtain the orientation of each of the cameras, we need to have a set of accurate measurements of distances between -space points which have to be annotated in the view whose extrinsic matrix is being estimated. We used points on the ground between which we know the distances measured by hand in centimetres.

3.3.1 Bundle adjustment

Bundle adjustment (Wikipedia (2017a)) is commonly performed as a final step, as it provides jointly optimal reconstruction and parameter refinement. The term is coined by referring to a bundle of light rays. In a practical sense, it represents a re-projection error minimization between the image locations of observed and predicted image points.

Let I and E denote the intrinsic and the extrinsic parameters of all of the cameras, respectively. Given a dataset whose elements are a set of corresponding points, or precisely: , where , with denoting the number of cameras, the goal is to find projection matrices whose parameters are contained in and the points , , , such that:


where denotes the Euclidean image distance, and is the indicator variable equal to when the point is visible in view , and is otherwise. In other words, we formulated the optimisation as a non-linear least squares problem, where the error is the squared norm of the difference between the observed feature location and the projection of the corresponding 3D point on the image plane of the camera.

To this end, we manually annotated precisely points by clicking on visually corresponding points across the seven views, and throughout multiple frames. Due to utilising the Pinhole camera model, the set in our implementation consists of parameters: for rotation, for translation, for focal length (x and y), for the principal point, for radial distortion and for tangential distortion. To optimize eq. 1, gradient descent, (Gaus-)Newton, or the Levenberg-Marquardt methods are usually used. In addition, the sparse structure is often exploited since although the expression is simple, the number of variables grows rapidly with .

As the undistorted frames with the per-camera intrinsic calibration demonstrated good results as much as possibly visible for human eye, we also experimented with variants of the problem in eq. 1. In particular, we either: (1) solved the problem in eq. 1 as illustrated by optimising both for I and E; or (2) we fixed I and optimised E; or either (3) we fixed I for one iteration of the algorithm, followed by another iteration where we optimised both. The third was motivated to only slightly refine the intrinsic parameters. However, in our observations regular optimisation of both I and E i.e. solving the eq. 1 as specified provided the best results.

We conclude, in our observations the bundle adjustment significantly improved the calibration parameters estimation in terms of joint-accuracy.

3.3.2 Illustration of the final camera calibration precision

Finally, we provide calibration files which are compatible with the OpenCV library, thus are straightforward to use. We observe high precision joint-camera projections. In Fig. 4 we illustrate an example where we click on two views (displayed in blue color), find the point as an intersection of the two, and project it to the rest of the views (displayed in red color).

Figure 4: Illustration of the camera calibration precision. Best seen in color: blue - clicked points; red - projection of the intersection of the two clicked points. Note that we omit one of the views, since the considered point is occluded in it.

4 Benchmark experiments

We benchmark state of the art multi-camera people detection methods, which unless otherwise stated, operate per-frame. In other words, the reported results of the methods do not leverage time consistency which in general further improves performance as the missed detections would be smoothed and the false positives would be suppressed.

4.1 Evaluation protocol

Performance is always measured in terms of Euclidean distance to the ground-truth on the ground (or from top-view).

We compute false positive (FP), false negative (FN) and true positives (TP) by assigning detections to ground truth using Hungarian matching. Since we operate in the ground plane, we impose that a detection can be assigned to a ground truth annotation only if they are less than a distance away. Given FP, FN and TP, we can evaluate:

  • Multiple Object Detection Accuracy (MODA) which we will plot as a function of , and the Multiple Object Detection Precision (MODP)  Kasturi et al. (2009).

  • Precision-Recall

    . Precision and Recall are taken to be TP/(TP + FN) and TP/(TP+FP) respectively.

We will report MODP, Precision, and Recall for radius , which roughly corresponds to the width of a human body. Note that these metrics are unforgiving of projection errors because we measure distances in the ground plane, which would not be the case if we evaluated overlap in the image plane as is often done in the monocular case. Nevertheless, we believe them to be the metrics for a multi-camera system that computes the 3D location of people.

4.2 Tested methods

We tested the following methods:

  • DeepMCD. We used the deep learning method,  Chavdarova and Fleuret (2017) - described in § 2.2. So far we performed the following preliminary experiments: fully testing on the WILDTRACK dataset with a pre-trained model on the PETS 2009

    dataset; as well as training solely the top classifier on the WILDTRACK dataset. We denote the two experiments with Pre-DeepMCD and Top-DeepMCD, respectively.

  • Deep-Occlusion. The recent work of Baque et al. (2017). Uses an hybrid CNN-CRF method to use information about calibration while leveraging on the discriminative power of pre-trained monocular CNNs.

  • POM-CNN. The multi-camera detector Fleuret et al. (2008) described in § 2.2 takes background subtraction images as its input. In its original implementation, they were obtained using traditional algorithms Ziliani and Cavallaro (1999); Oliver et al. (2000). For a fair comparison reflecting the progress that has occurred since then, we use the same CNN-based segmentor.

  • RCNN-projected. The recent work of Xu et al. (2016b) proposes a MCMT tracking framework that relies on a powerful CNN for detection purposes Ren et al. (2015). Since the code of Xu et al. (2016b) is not publicly available, we reimplemented their detection methodology as faithfully as possible but without the tracking component for a fair comparison with our approach that operates on images acquired at the same time. Specifically, we run the 2D detector proposed by  Ren et al. (2015) on each image. We then project the bottom of the 2D bounding box onto the ground reference frame as in Xu et al. (2016b) to get 3D ground coordinates. Finally, we cluster all the detections from all the cameras using 3D proximity to produce the final set of detections.

4.3 Results

max width= Method MODA MODP Precision Recall Deep-Occlusion+KSP 0.752 - - - Deep-Occlusion 0.741 0.538 0.95 0.80 Pre-DeepMCD 0.334 0.528 0.93 0.36 Top-DeepMCD 0.601 0.642 0.80 0.79 POM-CNN 0.232 0.305 0.75 0.55 RCNN-projected 0.113 0.184 0.68 0.43

Table 2: Benchmark results on WILDTRACK using its seven views at the same time.

In Tab. 2 we list the results we obtained using the methods enumerated in § 4.2 on the WIDLTRACK dataset, while using all of its seven views. As the MODA metric can be negative, it is interesting to observe that pre-trained models of DeepMCD demonstrated nice generalization, despite the fact that this dataset is of higher resolution and of different statistics. Fine-tuning solely the common classifier further increased the detection performances. Our current experiments include training the full models jointly, as the sizes of this dataset allow for it.

5 Research Directions and Discussions


We provided a new large-scale seven-static-cameras dataset whose fields of view overlap. It comes along with highly accurate camera calibration, annotations for camera calibration algorithms, as well as an open-source annotation tool.

Research directions.

The provided dataset is realistic and challenging. As it was demonstrated in the experiments section, state of the art methods although demonstrating good performances - since the MODA metric can have negative values, still leave room for improvement. The direct use-cases of this dataset are improving algorithms for: monocular or multi-view people detection, people tracking and camera calibration.

Furthermore, its overlapping fields of view nature allows for testing ideas on utilising multi-stream information, which as a core problem arises relatively often.

Future work.

The WILDTRACK dataset contains an additional part of the same size, which has not been annotated yet. In a recent time-framework it will either be made public as is - to be used for unsupervised methods, or it will be published with annotations for it as well.


This work was supported by the Swiss National Science Foundation, under the grant CRSII2-147693 ”WILDTRACK”. We also gratefully acknowledge NVIDIA’s support through their academic GPU grant program, which GPU was used for this work. We would also like to thank Florent Monay and Salim Kayal for their advices and help regarding the calibration of the cameras.

Appendix A File formats and size

We explain the details regarding the available for download files which contain the annotations, and we note that these may be subject to slight modification.

File formats.

For each annotated frame, we provide a separate file in the language independent JSON file format.

Each multi-view annotation contains the following information:

  • Person ID: A unique identifier corresponding to a tracked person.

  • 3D location: (X, Y) location of the target in meters on the ground plane with respect to the origin.

  • pixel coordinates: For each camera , the detection location in pixel coordinates for that view is given: and which define the rectangle.


We refer as frame a set of images, synchronized with the same time stamp. The extracted and pre-processed frames with removed distortions contain images, while each image is of size MB. This corresponds to frames per second for h and cameras. Currently there are annotated frames, at fps.


Each of the videos is approximately h long, and of size GB.

Appendix B Camera calibration details

We list the details regarding our implementation of the camera calibration, whose final files are available for download.


The intrinsic calibration was done for each camera separately, and we used the OpenCV function calibrateCamera which provides also the distortion coefficients. Precisely, we used radial distortion coefficients. In particular, we used the asymmetric circle grid provided by OpenCV with sizes of , and 20 frames to obtain each camera’s intrinsic matrix. We find it useful in terms of accuracy to make sure that the target, in our case the circle grid, is captured in as many parts of the field of view of the camera as possible.


In our implementation, for each of the seven views we used , , , , , and pairs of points, respectively. We used the OpenCV’s module solvePnP, which given the intrinsics provides the rotation and the translation vector. The 3D measurements and the annotated corresponding points will also be made available, so as to make the dataset suitable for testing camera calibration algorithms.

Bundle adjustment.

In our implementation, we used the open source C++ library Ceres, provided by Agarwal et al. , which offers extensive support for bundle adjustment problems. We used linear optimisation which in Ceres is referred to as Iterative Schur.

Appendix C Annotation process

After the camera calibration, we designed an annotation tool and host it online. We separately elaborate the two.

Annotation tool.

In order to make use of the jointly accurate camera calibration, we designed a specific tool such that each multi-view annotation implies adjustment of a cylinder, in terms of finding its best position. In other words, rather then putting bounding boxes in each of the views separately, the goal of the annotator is to shift an imaginary volume so that the visible projections best fit the same person in all of the views. This allows for: (1) more effective annotation: a single adjustment rather then putting bounding boxes and adjusting each in all of the seven views separately; as well as (2) more accurate annotation: again, we refer to joint accuracy. The latter is due to the fact that separate per-view annotation, and assigning the most probably 2D rectangle is prone to errors: first, the annotators are less motivated to observe the best fit and second, it is by far less evident which is the best position of a bounding box in a view, due to the ambiguity which comes with the one-dimension reduction itself.

To this end, we generated a high-density grid of regularly positioned points, and at each position we center a cylinder whose height corresponds to the average one of the humans. Each such cylinder projects into each of the separate 2D views as a rectangle whose position in the view is given in pixel coordinates. We then use this pre-calculated projections to integrate them into our annotation tool.

Figure 5: Interface of the multi-view annotation tool.

We built a custom web application which has responsive design, whose interface illustrated in Fig. 5. The Python based tool is hosted on a website222, created and managed using Django. The source-code is available for download333

As illustrated in Fig. 5, for the selected frame the tool displays the seven corresponding images at the same time. It allows the user place the 3D bounding-box, around each pedestrian visible the frame. This is achieved by clicking on its feet and refining the position of the box using the arrow keys. Once the frame is fully labelled and the user moved to the next frame, optionally (s)he is able to reload the annotations from the previous frame, traverse each of the annotations, and refine their positions. Additional features such as zooming the multi-view detection which is currently being annotated, keyboard short-cuts and similar, are also implemented.

Mechanical Turk Annotation.

Since the labelling process is time consuming and tedious, the tool was shared on Amazon Mechanical Turk and external people would be paid to label frames. To obtain accurate annotations, we were highly involved in the process, due to the risk of the annotators prioritizing profit over quality of the annotations, and thus deteriorating the accuracy of the annotations. Due to the explained capability of loading the labels from the previous frame and to help and speed up the annotation process, the Turker recruited annotators were assigned frames in batches of size .

As explained, annotators were found via Mechanical Turk. However, since the dataset at some points is challenging, annotating locations in 3D for crowded scenes requires substantial attention and dedication. Despite all our efforts to make the tool easy to use, it turned out that most MT workers were reluctant to provide this level of effort and they were almost never achieving the required quality. We therefore had to select few workers to whom we personally explained the level of detail needed. They were then able to annotate with high accuracy.

Annotating one frame takes on average minutes for a trained person, and approximately half of it when initialized using the previous frame.


  • [1] Sameer Agarwal, Keir Mierle, and Others. Ceres solver.
  • Alahi et al. [2011] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst. Sparsity driven people localization with a heterogeneous network of cameras. Journal of Mathematical Imaging and Vision, 41(1-2):39–58, 2011.
  • Alameda-Pineda et al. [2016] X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe. Salsa: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1707–1720, Aug 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2015.2496269.
  • Baque et al. [2017] P. Baque, F. Fleuret, and P. Fua. Deep Occlusion Reasoning for Multi-Camera Multi-People tracking. 2017.
  • Berclaz et al. [2009] J. Berclaz, F. Fleuret, and P. Fua.

    Multiple object tracking using flow linear programming.

    Idiap-RR Idiap-RR-10-2009, Idiap, 6 2009.
  • Berclaz et al. [2011] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1806–1819, Sept 2011. ISSN 0162-8828. doi: 10.1109/TPAMI.2011.21.
  • Bradski [2000] G. Bradski. Opencv. Dr. Dobb’s Journal of Software Tools, 2000.
  • Chavdarova and Fleuret [2017] T. Chavdarova and F. Fleuret. Deep multi-camera people detection. CoRR, abs/1702.04593, 2017. URL
  • Dalal and Triggs [2005] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In

    Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on

    , volume 1, pages 886–893 vol. 1, June 2005.
    doi: 10.1109/CVPR.2005.177.
  • De Vleeschouwer et al. [2008] Christophe De Vleeschouwer, Fan Chen, Damien Delannay, Christophe Parisot, Christophe Chaudy, Eric Martrou, Andrea Cavallaro, et al. Distributed video acquisition and annotation for sport-event summarization. In NEM summit 2008:: Towards Future Media Internet, 2008.
  • Dollar et al. [2009] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 304–311, June 2009. doi: 10.1109/CVPR.2009.5206631.
  • Dollar et al. [2012] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(4):743–761, April 2012. ISSN 0162-8828. doi: 10.1109/TPAMI.2011.155.
  • Enzweiler and Gavrila [2009] M. Enzweiler and D.M. Gavrila. Monocular pedestrian detection: Survey and experiments. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(12):2179–2195, Dec 2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.260.
  • Ess et al. [2008] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mobile vision system for robust multi-person tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008. doi: 10.1109/CVPR.2008.4587581.
  • Ferryman and Shahrokni [2009] J. Ferryman and A. Shahrokni. Pets2009: Dataset and challenge. In Performance Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on, pages 1–6, Dec 2009. doi: 10.1109/PETS-WINTER.2009.5399556.
  • Fleuret et al. [2008] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multi-Camera People Tracking with a Probabilistic Occupancy Map. 30(2):267–282, February 2008.
  • Ge and Collins [2010] W. Ge and R. T. Collins. Crowd detection with a multiview sampler. 2010.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Golbabaee et al. [2014] M. Golbabaee, A. Alahi, and P. Vandergheynst. Scoop: A real-time sparsity driven people localization algorithm. Journal of Mathematical Imaging and Vision, 48(1):160–175, 2014. ISSN 1573-7683. doi: 10.1007/s10851-012-0405-4. URL
  • Kasturi et al. [2009] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, M. Boonstra, V. Korzhova, and J. Zhang. Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol. 31(2):319–336, 2009.
  • Keller et al. [2009] ChristophGustav Keller, DavidFernández Llorca, and DariuM. Gavrila. Dense stereo-based roi generation for pedestrian detection. In Pattern Recognition, volume 5748 of Lecture Notes in Computer Science, pages 81–90. Springer Berlin Heidelberg, 2009. ISBN 978-3-642-03797-9. doi: 10.1007/978-3-642-03798-6˙9.
  • Oliver et al. [2000] N.M. Oliver, B. Rosario, and A.P. Pentland. A Bayesian Computer Vision System for Modeling Human Interactions. 22(8):831–843, 2000.
  • Peng et al. [2015] P. Peng, Y. Tian, Y. Wang, J. Li, and T. Huang. Robust multiple cameras pedestrian detection with multi-view bayesian network. Pattern Recogn., 48(5):1760–1772, May 2015. ISSN 0031-3203. doi: 10.1016/j.patcog.2014.12.004. URL
  • Ren et al. [2015] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. 2015.
  • Ristani et al. [2016] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
  • Wikipedia [2017a] Wikipedia. Bundle adjustment — wikipedia, the free encyclopedia, 2017a. URL [Online; accessed 13-July-2017].
  • Wikipedia [2017b] Wikipedia. Pinhole camera model — wikipedia, the free encyclopedia, 2017b. URL [Online; accessed 12-July-2017 ].
  • Wojek et al. [2009] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedestrian detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 794–801, June 2009. doi: 10.1109/CVPR.2009.5206638.
  • Xu et al. [2016a] Y. Xu, X. Liu, Y. Liu, and S. C. Zhu. Multi-view people tracking via hierarchical trajectory composition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4256–4265, June 2016a. doi: 10.1109/CVPR.2016.461.
  • Xu et al. [2016b] Y. Xu, X. Liu, Y. Liu, and S.C. Zhu. Multi-View People Tracking via Hierarchical Trajectory Composition. pages 4256–4265, 2016b.
  • Ziliani and Cavallaro [1999] F. Ziliani and A. Cavallaro. Image Analysis for Video Surveillance Based on Spatial Regularization of a Statistical Model-Based Change Detection. 1999.