Semantic Driven Multi-Camera Pedestrian Detection

by   Alejandro López-Cifuentes, et al.

Nowadays, pedestrian detection is one of the pivotal fields in computer vision, especially when performed over video surveillance scenarios. People detection methods are highly sensitive to occlusions among pedestrians, which dramatically degrades performance in crowded scenarios. The cutback in camera prices has allowed generalizing multi-camera set-ups, which can better confront occlusions by using different points of view to disambiguate detections. In this paper we present an approach to improve the performance of these multi-camera systems and to make them independent of the considered scenario, via an automatic understanding of the scene content. This semantic information, obtained from a semantic segmentation, is used 1) to automatically generate a common Area of Interest for all cameras, instead of the usual manual definition of this area; and 2) to improve the 2D detections of each camera via an optimization technique which maximizes coherence of every detection both in all 2D views and in the 3D world, obtaining best-fitted bounding boxes and a consensus height for every pedestrian. Experimental results on five publicly available datasets show that the proposed approach, which does not require any training stage, outperforms state-of-the-art multi-camera pedestrian detectors non specifically trained for these datasets, which demonstrates the expected semantic-based robustness to different scenarios.



There are no comments yet.


page 1

page 4

page 5

page 8

page 10

page 11

page 12


The WILDTRACK Multi-Camera Person Dataset

People detection methods are highly sensitive to the perpetual occlusion...

Generalizable Multi-Camera 3D Pedestrian Detection

We present a multi-camera 3D pedestrian detection method that does not n...

Multi-Person tracking by multi-scale detection in Basketball scenarios

Tracking data is a powerful tool for basketball teams in order to extrac...

Psychophysical Evaluation of Deep Re-Identification Models

Pedestrian re-identification (ReID) is the task of continuously recognis...

Geometry-Based Multiple Camera Head Detection in Dense Crowds

This paper addresses the problem of head detection in crowded environmen...

3D Move to See: Multi-perspective visual servoing for improving object views with semantic segmentation

In this paper, we present a new approach to visual servoing for robotics...

Efficient Pedestrian Detection in Top-View Fisheye Images Using Compositions of Perspective View Patches

Pedestrian detection in images is a topic that has been studied extensiv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Pedestrian or people detection is a pivotal step in several computer vision applications, including pedestrian tracking, crowd monitoring and event detection. Automatic people detection is generally considered a solid and mature technology able to operate with nearly human accuracy in generic scenarios [3, 4]

. However, the handling of severe-occlusions is still a major challenge. Occlusions occur due to the projection of the 3D objects onto a 2D image representation. Although recent deep-learning based methods are able to cope with partial occlusions, the detection process fails when only a small part or no part of the person is visible. To cope with severe-occlusions, a potential solution is the use of additional cameras: if they are adequately positioned, the different points of view might allow for disambiguation.

Disambiguation is generally achieved by projecting every camera’s detections on a common reference plane. The ground plane is usually the preferred option as it constitutes a common reference in which people’s height can be disregarded. Per-camera detections can then be combined on the ground plane to refine and complete pedestrian detection. However, there are several challenges to be addressed during the fusion process. Among the striking ones are: the convenience to define common visibility areas where cameras’ views overlap, and how to cope with camera calibration errors and persons’ self-occlusions. See Figure 1 for visual examples of these challenges, which we detail below:

In multi-camera approaches a common strategy is to define an operational area on the ground plane. This area represents the overlapping field-of-view of all the involved cameras. It can be used to reduce the impact of calibration errors in the process and to generally ease the combination or fusion of per-camera detections. This area is generally manually defined for each scenario, precluding the automation of the process.

Scene calibration is a well-known task [5] which can be performed either manually or using automatic calibration methods based on image cues. In both cases, small perturbations in the calibration process may cause uncertainty in the fusion of the detections on the ground plane. The impact of calibration errors increases with the distance to the camera: generally, calibration is more accurate for pixels which represent objects close to the camera.

Self-occlusions are caused by the intrinsic three-dimensional nature of people, resulting in the occlusion of some human parts by some others. If the visible parts are different for different cameras and these are used to project a person location on the ground plane, the cameras’ projections will diverge, hindering their fusion.

To cope with these challenges, in this paper we present a multi-camera pedestrian detection method which is driven by semantic information in the 2D image planes and the 3D ground plane by the following novel contributions:

  1. Adaptation to these scenarios of a method [6] to automatically define the operational area.

  2. A novel approach to globally combine pedestrian detections in a multi-camera scenario by creating connected components in a graph representation of detections.

  3. An height-adaptive optimization algorithm which uses semantic cues to globally refine the location and size of people detections by aggregating information from all the cameras.

Experimental results on public datasets (PETS 2009 [7], EPFL RLC [8, 9], EPFL Terrace [10, 11] and EPFL Wildtrack [12, 13]) prove that: (1) the proposed method outperforms state-of-the-art monocular pedestrian detectors [1] as well as state-of-the art scene-independent multi-camera detection approaches; and (2) the performance is comparable to recent deep-learning multi-camera detection approaches while not requiring a manually annotated operational area nor a preliminary training on the target scenario.

The rest of the paper is organized as follows: Section II reviews the State of the Art, Section III describes the proposed method, Section IV presents and discusses experimental results, leading to a set of conclusions in Section V.

Ii Related Work

Multi-camera people detection faces the combination, fusion and refinement of visual cues from several individual cameras to obtain more precise people locations. First, an operational area is generally defined, either manually or, as we propose, based on a semantic segmentation. Then, in approaches relying on a common reference plane, a common three-stage strategy is usually followed: (1) extracting detections on each camera, (2) projecting detections onto the plane and (3) combining detections and back-projecting them to the individual views to obtain per-camera people detections. Finally, obtained detections are sometimes post-processed to further improve their localization.

Ii-a Definition of the operational area

There are several approaches [2, 14] that rely on manually annotated operational areas where evaluation is performed. An advantage of these areas is that camera calibration errors are limited and controlled. Besides, these areas are defined to maximize the overlapping between the field of view of the involved cameras. However, the manual annotation of these operational areas hinders the generalization of people detection approaches. Our previous work in this domain [6] resulted in an automatic method for the cooperative extraction of operational areas in scenarios recorded with multiple moving cameras: semantic evidences from different junctures, cameras, and points-of-view are spatio-temporally aligned on a common ground plane and are used to automatically define an operational area or Area of Interest ().

Ii-B Semantic Segmentation

Semantic segmentation is the task of assigning a unique object label to every pixel of an image. Top-performing strategies for semantic segmentation are based on CNNs. For instance, a dense up-sampling CNN can be used to generate pixel-level predictions within a hybrid dilated convolution framework [15]. Performance can be boosted through the use of an ensemble of many relatively shallow networks [16]. Contextual information can be implicitly used by including relationships between different labels—e.g. an airplane is likely to be on a runway or flying in the sky but not on the water—[17]. These relationships allow to reduce the complexity associated to large sets of object labels, generally improving performance.

Ii-C Monocular people detection

As stated in Section I, automatic monocular pedestrian detection is considered a mature technology able to obtain accurate results in a broad range of scenarios. According to exhaustive surveys and evaluation benchmarks [3, 4], there is a plethora of pedestrian detectors yielding good performance in varied scenarios. Traditional algorithms as ACF [18] and DPM [19] successfully operate in low-populated scenarios. In the more challenging crowded scenarios, well established object detectors as Faster-RCNN [1] and YOLO [20]

, based on Convolutional Neural Networks, obtain high performance as far as their architectures are trained to detect people. Nevertheless, in scenarios with severe-occlusions, the performance of these algorithms decreases.

Ii-D Projection of per-camera detections

Multi-camera pedestrian detection is fundamentally based on the projection of monocular detections onto a common reference plane. Projection is typically achieved either by using calibrated camera models that relate any 2D image point with a corresponding referenced 3D world direction [14], or by relying on homographic transformations that project image pixels to a specific 3D plane [2]. In both cases, the ground plane, where people is usually standing on, is chosen as reference for simplicity reasons.

Ii-E Fusion and refinement of per-camera detections

Fusion and refinement approaches can be mainly divided into three different groups depending on how global detections are obtained. The first group encompasses geometrical methods, which combine detections based on the geometrical intersections between image cues. The second group embraces probabilistic methods, which combine detections via optimization frameworks and statistical modeling of the image cues. The third group is composed of solutions based on the ability of deep learning architectures to model occlusions and achieve accurate pedestrian detection at scene level.

Regarding geometrical methods, detection can be combined by projecting foreground masks to the ground plane in a multi-view scenario: the intersection of foreground regions leads to pedestrian detection [21]

. Accuracy can be increased by projecting the middle vertical axis of pedestrians, leading to a more accurate intersection on the ground plane and, therefore, to a better estimation of the pedestrian’s position

[22]. Following the same hypothesis, the use of a space occupancy grid to combine silhouette cues has been proposed: each ground pixel is considered as an occupancy sensor and observations are then used to infer pedestrian detection [23]. All of these approaches outperform per-camera pedestrian detection algorithms by the use of ground-plane homography projections. Nevertheless, the evaluation of foreground intersections in crowded spaces may lead to the appearance of phantoms or false detections. To handle this problematic, the general multi-camera homography framework has been extended by using additional parallel planes to the ground plane [24, 25]. The intersection of the image cues with these parallel planes is expected to suppress these phantoms. Similarly, parallel planes can be used to create a full 3D reconstruction of pedestrians, which can then be back-projected to each of the camera views, improving monocular pedestrian detection [26].

Fig. 2: Overall pedestrian detection method. Top (a): processing starts performing both a semantic segmentation and a pedestrian detection over a set of cameras (four, in the illustration) with overlapping fields of view. The segmentation, the detections and the camera calibration parameters feed the Multi-camera Pedestrian Detection module which is described in detail in bottom (b): detections are projected onto a 3D reference plane; a Pedestrian Semantic Filtering module is used to remove detections located out of the automatically generated ; the remaining detections are fused, based on a disconnected graph, to obtain global detections. The so-obtained global detections are back-projected to the camera views, and the Semantic-driven Back-Projection module globally refines the location of these detections by also using semantic cues.

Concerning probabilistic

methods, an interesting example is the use of a multi-view model shaped by a Bayesian network to model the relationships between occlusions

[2]. Detections are here assumed to be images of either pedestrians or phantoms, the former differentiated from the latter by inference on the network.

Recent approaches are focused on deep learning methods. The combination of CNNs and Conditional Random Fields (CRF) can be used to explicitly model ambiguities in crowded scenes [8]. High-order CRF terms are used to model potential occlusions, providing robust pedestrian detection. Alternatively, multi-view detection can be handled by an end-to-end deep learning method based on an occlusion-aware model for monocular pedestrian detection and a multi-view fusion architecture [27].

Ii-F Improving detection’s localization

Algorithms in all of these groups require accurate scene calibration: small calibration errors can produce inaccurate projections and back-projections which may contravene key assumptions of the methods. These errors may lead to misaligned detections, hindering their later use. To cope with this problematic, one can rely on an Height-Adaptive Projection (HAP) procedure in which a gradient descent process is used to find both the optimal pedestrian’s height and location on the ground-plane by maximizing the alignment of their back-projections with foreground masks on each camera [2].

Iii Proposed Pedestrian Detection Method

The proposed method is depicted in Figure 2. First, state-of-the-art algorithms for monocular pedestrian detection and semantic segmentation are used to extract the respective image cues for each camera. Semantic cues drive the automatic definition of the , and detections outside this area are discarded. Remaining per-camera detections are combined to obtain global detections by establishing rules and constraints on a disconnected graph. These detections are back-projected to the camera views where their location and height estimates are further refined.

Fig. 3: Top row represents RGB frames from the Terrace dataset [10, 11]. Bottom row represents the correspondent semantic labels obtained by the PSP-Net algorithm [17]. Columns from left to right represent cameras 1 to 4 of this dataset. The bottom legend indicates the detected semantic classes.

Iii-a Preliminaries

Monocular Pedestrian Detection

is performed using a state-of-the-art detector (Faster-RCNN [1]). In order to avoid a potential height-bias, we ignore the height and width of the detected bounding boxes, i.e. the pedestrian detection at camera is just represented by the middle point of the base of its bounding box: , in homogeneous coordinates111we use common notation, upper case to denote 3D points/coordinates and lower case to denote 2D camera plane points/coordinates..

Semantic Segmentation

is also performed using a state-of-the-art algorithm (Pyramid Scene Parsing Network (PSPNet) [17]) trained with the ADE20K dataset. This method is used to label each image pixel for every camera and every frame : , where is one of the 150 pre-trained semantic classes: , i.e. floor, building, wall… Figure 3 depicts examples of semantic labels for selected camera frames of the Terrace Dataset [10, 11].

Projection of People Detections

Let be the homography matrix that transforms points from the image plane of camera to the world ground-plane. The detection of camera , is projected onto the ground plane by:


where the height of the projected point equals zero as it is representing a point on the ground-plane.

Iii-B Pedestrian Semantic Filtering

Automatic Definition of the

To obtain a semantic partition of the ground-plane an adaptation of [6] for static-camera scenarios is carried out. We first project every image pixel via . Every projected point inherits the semantic label assigned to :


Thereby, a semantic locus—a ground-plane semantic partition—is obtained for each camera. The extent of each locus is defined by the image support, and missing points inside the locus are completed by nearest-neighbor interpolation.

In order to globally reduce the impact of moving objects and segmentation errors, we propose to temporally aggregate each locus along several frames. In a set of loci obtained for consecutive frames, a given point on the ground plane is labeled with semantic labels, which may be different owing to inaccuracies in the semantic segmentation or to the presence of moving objects. A single temporally-smoothed label is obtained as the mode value of this set. Examples of these per-camera obtained smoothed loci are included in the first four-columns of Figure 4.

We propose to combine these loci to define the . The definition of the is scenario-dependent but can be generalized by defining a set of ground-related semantic classes: floor, grass, pavement, etc. The operational area is obtained as the union of the projected pixels from any camera which are labeled with any class in :


An example of a so-obtained is included in the right-most column of Figure 4.

Fig. 4: Temporally-smoothed projected loci for each camera (columns 1 to 4) of the Terrace dataset [10, 11], both in the RGB domain (top) and the semantic labels domain (bottom). The last column depicts, again in both domains, the resulting which, in the example, consists of the combined floor class of the four smoothed loci.

Detection Filtering

Projected detections lying outside the operational area, , are filtered out and so, discarded for forthcoming stages.

Iii-C Fusion of Multi-Camera Detections

We propose a geometrical approach to combine detections on the ground-plane. Every camera single detection is considered a vertix of a disconnected graph located in the reference plane. Vertices are then joined generating connected components , each representing a joint 3D global detection. The whole fusion process is summarized in Figure 5. The conditions that shall be satisfied to join two vertices or detections, and , are:

  1. That vertices in a connected component are close enough. The - between any two vertices in shall be smaller than a predefined distance : (Figure 5 (a)). may be fixed in the interval between and with no influence in the results. We set meters to: 1) reduce the computational cost of the final stage (see below) assuming that vertices separated do not belong to the same object and 2) protect against calibration errors, assuming that they are not larger than .

  2. That vertices in a connected component come from different cameras. This condition prevents two different detections from the same camera, which are near in the ground plane, from influencing the final global detection. (Figure 5 (b))

To avoid ambiguities, the creation of connected components is performed in order, according to the spatial position of the detections: those with a lower module are combined first.

The outcome of the fusion process for cameras is a set of connected components , each containing detections: , where when a person is occluded or not detected in one or more cameras.

As each connected component is assumed to represent a single person, an initial ground-position of the person is obtained by simply computing the arithmetic mean of all the detections in the connected components (Figure 5 (c)).

Fig. 5: Fusion of multi-camera detections in the ground plane. (a) The distance , depicted here as circumferences around detections, defines neighbors for each detection . (b) Connected components are defined for detections: (i) which - is lower than and (ii) that are projected from different cameras. Connected components fulfilling (i) but not (ii) are represented by dashed lines crossed out. (c) The ground-plane detection is obtained via the arithmetic mean of all the detections in a connected component .

Iii-D Semantic-driven Back-projection

To obtain correctly positioned detections, i.e. visually precise detections, in each camera, ground-plane detections need to be back-projected to each camera and 2D bounding-boxes enclosing pedestrians need to be outlined based on these projections.

The problem of back-projecting 3D detections

Let be an orthogonal line segment to the ground plane which represents the detected pedestrian and extends from the detection to a 3D point meters above. Using the camera calibration parameters, the segment can be back-projected onto camera . This back-projection defines a 2D line segment , which extends between and (see Figure 6 (a)).

We propose to create 2D bounding-boxes around these back-projected 2D line segments. To this aim, each segment is used as the vertical middle axis of its associated 2D bounding-box . For simplicity, the width of is made proportional to its height. Due to pedestrian self-occlusion, calibration errors and the uncertainty on the pedestrians’ height, this back-projection process results in misaligned bounding-boxes (see Figure 6 (a)), hindering their later use and degrading camera-wise performance.

To handle this problematic, we define an iterative method which aims to globally optimize the alignment between all 3D detections and their respective views or back-projections in all cameras. This method is based on the idea proposed in [2]. While the referenced method is guided by a foreground-segmentation, we instead propose to use a cost-function driven by the set of pedestrian-labeled pixels in the semantic segmentation (e.g. see person label in Figure 3). Next we detail the full process for the sake of reproducibility.

Fig. 6: (a) Back-projecting global segment results in misaligned bounding-boxes due to pedestrian self-occlusion, calibration errors and the uncertainty on the pedestrians’ height. (b) The proposed optimization process results in the best-aligned segments for each camera.

Method Overview

As a 3D detection , with height , inevitably results in misaligned back-projected 2D detections, the proposed method tries to adapt the 3D detection segment to each camera, generating a set of 3D detection segments, , for each 3D detection and iteratively modifying their positions and height to maximize 2D detections’ alignment with the semantic segmentation masks, while constraining all the segments to have the same final height (as they are all projections of a same pedestrian) and to be located sufficiently close to each other. This process is not performed independently for each 3D detection but jointly and iteratively for all 3D detections. Observe that the joint nature of the optimization problem for all 3D detections is a key step as pedestrian pixels may contain segmentations from more than one pedestrian.

For each 3D segment , the method starts by initializing (i.e., iteration ) the per-camera adapted segments:


Iterative steepest-ascent algorithm

For each 3D segment, let be the set of adapted detections to camera at iteration , and let be the set of camera-adapted segments for all cameras at the same iteration.

The optimization process aims to find , the solution to the constrained optimization problem:


where defines the maximum distance between 3D projections of a single pedestrian, which we set to to twice the average width of the human body, i.e. 1 meter, to forestall the effect of nearby pedestrians in the image plane. Variations in value have no significant influence on the results.

is defined as the cost function to maximize and is based on the alignment of the back-projected bounding boxes with the set of pedestrian-labeled pixels in each camera: . The cost function considers the information from all the cameras.


where is a weight for pixel : for pedestrian and for non pedestrian pixels— in our setup—, is the number of pixels in the camera image plane and

is the loss function of pixel

with respect to :


where is the distance from to the vertical middle axis of the back-projected bounding box .

At each iteration , the set of camera-adapted segments is moved towards the direction of maximum increment:


where is the gradient-step, which in our approach decreases with the iteration. The gradient in the -th iteration is approximated by Forward Difference Approximation:


The algorithm extends until convergence is reached or when the -constrain is violated.

Iv Results

This section addresses the evaluation of the proposed method. To this aim, we first describe the evaluation framework; then, in the first experiment, we measure the performance improvement of each of the method’s stages; and in the second experiment, we finish by comparing the approach with alternative state-of-the-art approaches in classic and recent multi-camera datasets.

Iv-a Evaluation Framework


Results are obtained by evaluating the proposed method over five sequences extracted from 4 publicly available multi-camera datasets in which cameras are calibrated and temporally synchronized:

  • EPFL Terrace [10, 11]: Generally used in the state-of-the-art to evaluate multi-camera approaches. It consists of a 5000 frames sequence per camera showing up to eight people walking on a terrace captured by four different cameras. All the cameras record a close-up view of the scene.

  • EPFL RLC [8, 9]: Consists of an indoor sequence of 2000 frames per camera recorded in the EPFL Rolex Learning Center using three static HD cameras with overlapping field of views. All these cameras represent close-up views of the scene.

  • EPFL Wildtrack [12, 13]: A challenging multi-camera dataset which has been explicitly designed to evaluate deep learning approaches. It has been recorded with 7 HD cameras with overlapping fields of view. Pedestrian annotations for 400 frames are provided, composing the evaluation set used in this paper.

  • PETS 2009 [7]: The most used sequences from this widely used benchmark dataset have been chosen.

    • PETS 2009 S2 L1, which contains 795 frames recorded by eight different cameras of a medium density crowd—in this evaluation, we have just selected 4 of these cameras: view 1 (far field view) and views 5, 6 and 8 (close-up views)—.

    • PETS 2009 City Center (CC), recorded only using two far-field view cameras with around 1 minute of annotated recording (400 frames per camera).

Performance Indicators

To obtain quantitative performance statistics according to an experiment-based evaluation criterion the following state-of-the-art [28, 29]

performance indicators have been selected: Precision (P), Recall (R), F-Score (F-S), Area Under the Curve (AUC), N-MODA (N-A) and N-MODP (N-P). To globally asses performance, a single value for each statistic and each configuration is provided by averaging per-camera ones.

Algorithm EPFL Terrace PETS 2009 S2 L1 PETS 2009 CC EPFL RLC
Baseline [1] 0.82 0.84 0.71 0.74 0.90 0.91 0.85 0.76 0.90 0.91 0.85 0.76 0.77 0.78 0.58 0.69
Baseline + Filtering 0.84 0.85 0.73 0.74 0.90 0.91 0.85 0.76 0.90 0.91 0.85 0.76 0.80 0.82 0.68 0.70
Baseline + Filtering + Fusion + Back-Projection 0.87 0.90 0.83 0.77 0.92 0.93 0.89 0.79 0.94 0.94 0.88 0.79 0.81 0.82 0.70 0.70
TABLE I: Experiment 1: Stage-wise performance of the proposed method. Bold values indicate best result. Indicators are Area Under the Curve (AUC), F-Score (F-S), N-MODA (N-A) and N-MODP (N-P)

Iv-B System Setup

To evaluate the performance of the proposed system via the aforementioned statistics the whole evaluation has been carried out using the same setup. In the Pedestrian Semantic Filtering stage, all frames in each sequence are used for temporal and spatial semantic aggregation, i.e. . For the Semantic-driven Back-Projection stage the initial height estimation has been set to an average pedestrian height of m. Besides, for all the datasets convergence in the iterative steepest-ascent algorithm has been reached before or at the iteration.

Iv-C Experiments Overview

Evaluation has been performed carrying out two different experiments:

  • Experiment 1 aims to gauge the impact of the different stages in the proposed approach. To this end, the baseline pedestrian detector Faster-RCNN [1] is compared with two versions of the proposed method:

    1. “Baseline + Filtering” is a simplified version of our method which aims to independently evaluate the effect of the proposed automatic computation.

    2. “Baseline + Filtering + Fusion + Back-Projection” is the full version of the proposed method, which additionally evaluates the fusion and semantic-driven back-projection stages.

    This experiment is conducted on 4 of the sequences: Terrace, PETS 2009 S2 L1, PETS 2009 CC and RLC.

  • Experiment 2 compares the proposed method with several non-deep learning state-of-the-art multi-camera pedestrian detectors on Terrace, PETS 2009 S2 L1, PETS 2009 CC and RLC sequences. Additionally, the method is compared with novel deep-learning methods on the Wildtrack dataset.

Fig. 7: Experiment 1: Automatically obtained (superimposed in green) compared to the manually annotated (red box) by the authors of EPFL Terrace [10, 11] (left), EPFL RLC Dataset [8, 9] (middle) and PETS2009 [7] (right).
(a) EPFL Terrace (b) PETS 2009 S2 L1
(c) PETS 2009 CC (d) EPFL RLC
Fig. 8: Experiment 1: Stage-wise performance of the proposed method: precision-recall curves for EPFL Terrace, PETS 2009 S2 L1, PETS 2009 CC and EPFL RLC. Color dots in each curve represent the confidence threshold which yields maximum F-Score.

Iv-D Experiment 1

Evaluation Criterion

The availability of bounding-boxes annotations, permits to use the classic performance criterion [3]: a detection is considered a TP one if it overlaps at least a 50% with a ground-truth bounding-box.


Table I agglutinates the method’s performance on a per-stage basis. Results are obtained by thresholding detection’s confidence at the optimal F-Score operation point according to the Precision-Recall curves in Figure 8. Qualitative examples of automatically generated s and algorithm results are depicted in Figure 7 and Figure 9 respectively. A visual example of the limitations of the Semantic-drive back-projection stage is included in Figure 10.


The ablation analysis of Precision-Recall curves from Figure 8 shows that the use of automatically generated s (red curves) tends to increase Precision performance with respect to the baseline (blue curves) due to the false detections filtering process. When combined with the Semantic-driven Back-projection stage (green line), Recall is increased with respect to the baseline due to the creation of new detections, which also may also lead to a slight descent in Precision, as some of these detections are as occluded than there were not included in the ground-truth annotations.

Table I supports these results and shows that filtering-out detections using automatically generated s (Baseline + Filtering) improves the baseline performance for datasets where the ground-plane area does not cover the whole image representation, i.e. datasets containing close-up views of the scene as EPFL Terrace and RLC. In these datasets, our more precise s reduce phantom detections obtained by the baseline detector. Overall, the performance of the baseline detector on the EPFL Terrace dataset is relatively increased using the proposed s a and a in terms of AUC and N-MODA measures respectively. For the EPFL RLC dataset, relative improvements are of a regarding AUC and of a concerning N-MODA.

Fig. 9: Experiment 1: Qualitative results on selected frames of the EPFL Terrace, PETS S2 L1, PETS CC and EPFL RLC datasets. From left to right: First three columns depict a same time frame captured by three available cameras, showing color bounding boxes (a color per pedestrian) corresponding to the final per-camera detections. The most-right column depicts obtained detections—one per pedestrian in the scene—on the ground plane, conserving the identifying colors.

Automatically extracted s do not improve baseline’s performance for datasets in which the ground-plane dominates the scene, i.e. those recorded with far-field view cameras as PETS 2009, as no false-pedestrians are suppressed. However, as depicted in Figure 7, the automatically obtained s are larger than the original Operational Areas in the datasets and would have been helpful to incorporate pedestrians moving off-paths. Furthermore, observe how the proposed generation method also effectively handles multi-class ground partitions as in the PETS 2009 dataset, where the proposed (see Figure 7 right) encompasses road, grass, pavement and side-walks classes.

Fig. 10: Experiment 1: Semantic-driven Back-Projection. First row: back-projected bounding boxes at the initial iteration of the optimization algorithm. Global detections obtained by the Multi-camera Detection Fusion algorithm are displaced with respect to real pedestrian when back-projected to each camera. Second row: The semantic-driven optimization algorithm correctly refines locations and heights for the bounding boxes in Camera 2 and 3. However, when semantic pedestrian cues are highly overlapped some bounding boxes might be refined to an incorrect location (Camera 1, green bounding-box).
Algorithm EPFL Terrace PETS S2 L1 PETS CC EPFL RLC
POM [10] - 0.19 0.56 - 0.65 0.67 - 0.70 0.55 - - -
Baseline:   Faster-RCNN [1] 0.84 0.71 0.74 0.91 0.85 0.76 0.91 0.85 0.76 0.78 0.58 0.69
MvBN + HAP [2] - 0.82 0.73 - 0.87 0.76 - 0.87 0.78 - - -
Proposed Approach 0.90 0.83 0.77 0.93 0.89 0.79 0.93 0.88 0.79 0.82 0.70 0.70
TABLE II: Experiment 2: Comparison with State Of the Art methods non based on deep-learning. Bold values indicate best results. Indicators are F-Score (F-S), N-MODA (N-A) and N-MODP (N-P)

Table I also shows that our complete method (Baseline + Filtering + Fusion + Back-Projection) notably improves baseline’s performance, mainly in scenarios with heavy occlusions (EPFL Terrace and EPFL RLC datasets). Specifically, for the EPFL Terrace dataset results are relatively increased a , a and a in terms of AUC, F-Score, and N-MODA respectively, whereas relative improvements are of a —in AUC—, a —in F-Score terms—and a in N-MODA, for the EPFL RLC dataset. Overall, performance indicators and qualitative results (see Table I and Figure 9 respectively) support that the proposed multi-camera detection approach is able to cope with partial, severe and complete occlusions by combining detections for all the cameras guided by a semantic segmentation.

Focusing on the semantic-driven back-projection process, results in Figure 9 depict highly tight pedestrian bounding boxes, disregarding people’s height as well as self-occlusions and calibration problems, suggesting that the optimization process is able to automatically adapt bounding-boxes by jointly estimating pedestrian heights and world positions. Results in Table I consolidate this idea, semantic-driven back-projection leads to a higher overlap between detections and ground-truth annotations: in terms of the N-MODP metric the proposed method achieves relative improvements with respect to the baseline of a for EPFL Terrace, a for both PETS 2009 S2 L1 and PETS 2009 CC and a for the RLC dataset. However, the optimization cost function is biased towards wider pedestrians, a situation that may lead to wrong relocations of the back-projected bounding-boxes (Figure 10 shows an example of this case, including this misbehavior in Camera 1).

Iv-E Experiment 2

Evaluation Criterion

The same criterion used in Experiment 1 applies for the Terrace, PETS and RLC datasets. In the Wildtrack dataset, as the ground-truth is provided via points on the world ground plane (i.e., no bounding-boxes are provided), the evaluation criterion is different. Specifically, a detection is considered a TP if it lies at most m to a ground-truth annotated point. This radius roughly corresponds to the average width of the human body. Due to the absence of bounding-boxes, for this dataset the Semantic-driven back-projection stage is not included (this truncated version of the method is denoted as Proposed Approach*).

State-of-the-art Algorithms

The following algorithms have been selected to carry out the comparison:

  • POM [10]

    . This algorithm proposes to estimate the marginal probabilities of pedestrians at every location inside an

    . It is based on a preliminary background subtraction stage.

  • POM-CNN [10]. An upgraded version of POM in which the background subtraction stage is performed based on an encoder-decoder CNN architecture.

  • MvBN+HAP [2]. Relies on a multi-view Bayesian network model (MvBN) to obtain pedestrian locations on the ground plane. Detections are then refined by a Height-Adaptive Projection method (HAP) based on an optimization framework similar to the one proposed in this paper, but driven by background-subtraction cues.

  • RCNN-Projected [30]. The bottom of bounding-boxes obtained thorough per-camera CNN detectors are projected onto ground-plane, where 3D proximity is used to cluster detections.

  • Deep-Occlusion [8] is an hybrid method which combines a CNN trained on the Wildtrack dataset and a Conditional Random Fields (CRF) method to incorporate information on the geometry and calibration of the scene.

  • DeepMCD [27] is an end-to-end deep learning approach based on different architectures and training scenarios:

    • Pre-DeepMCD: a GoogleNet architecture trained on the PETS dataset.

    • Top-DeepMCD: a GoogleNet architecture trained on the Wildtrack dataset.

    • ResNet-DeepMCD: a ResNet-18 architecture trained on the Wildtrack dataset.

    • DenseNet-DeepMCD: a DenseNet-121 architecture trained on the Wildtrack dataset.

EPFL Wildtrack
Algorithm F-Score N-MODA N-MODP


Deep-Occlusion [8] 0.86 0.74 0.53
Top-DeepMCD [27] 0.79 0.60 0.64
ResNet-DeepMCD [12] 0.83 0.67 0.64
DenseNet-DeepMCD [12] 0.79 0.63 0.66


Proposed approach* 0.69 0.39 0.55
Pre-DeepMCD [27] 0.51 0.33 0.52
POM-CNN [10] 0.63 0.23 0.30
RCNN-Projected [30] 0.52 0.11 0.18
TABLE III: Experiment 2: Wildtrack Dataset Comparison Results. Bold values indicate best results.


Table II includes performance indicators for the proposed method compared with POM [10] and MvBN+HAP [2] on the Terrace, PETS and RLC sequences (results for the compared methods have been published and are extracted from [2]). Table III compares the performance of the proposed approach against methods using deep-learning some of them trained with a subset of the Wildtrack dataset (which we denote as Trained) or trained with data from other datasets (which we denote as Non-Trained). Performance indicators for these methods are extracted from [12]. In addition, qualitative results for the Wildtrack dataset are presented in Figure 11, including obtained non-adapted detections in camera frames, global detections on the ground-plane and the automatically computed .

Fig. 11: Experiment 2: Qualitative results from a sample frame on Wildtrack dataset. For representation reasons camera frames depict adapted bounding boxes via the semantic-driven back-projection stage, although this stage is not used for evaluation in the Wildtrack dataset. In addition figure depicts the automatically obtained superimposed in green and finally, the manually annotated proposed by the authors [13, 12] (area delimited by red lines). The last image represents the cameras’ positions, the obtained detections and the authors’ ground-truth and over the ground-plane. Pedestrians are identified with different colors (one per detection) along views and ground-plane.


Results in Table II show that our method (Baseline + Filtering + Fusion + Back-Projection) outperforms the baseline algorithm, the MvBN + HAP and the POM-CNN methods. Analyzing these results, we observe that the proposed method yields a higher recall, i.e. increases the number of correct detections by coping with occlusions and pedestrian detector errors, while keeping similar precision, i.e. without increasing the number of false positives. The proposed method also obtains better results in terms of N-MODA, which measures detection accuracy. Relative improvements with respect to the best performing method (MvBN + HAP [2]) are of a , a and a for the EPFL Terrace, the PETS 2009 S2 L1 and the PETS 2009 CC datasets, respectively. Besides, N-MODP results are better than those obtained by the HAP method [2], which proposed the back-projection alignment we inspired on. This indicates that our use of semantic segmentation mask instead of foreground masks benefits the optimization process. Relative increments in N-MODP performance of a for EPFL Terrace, a for PETS 2009 S2 L1 and a for PETS 2009 CC support this assumption.

Finally, results on the Wildtrack dataset (Table III), indicate that the proposed method is also able to outperform deep-learning approaches that have not been specifically-adapted to the Wildtrack dataset. Our method improves respect to Pre-DeepMCD—the second ranked—, which is an end-to-end deep learning architecture trained on the PETS dataset. However, algorithms trained on the Wildtrack dataset, i.e., DenseNet-DeepMCD, ResNet-DeepMCD, Top-DeepMCD, and Deep-Occlusion, outperform the proposed method, in our opinion for two main reasons:

First, they learn their occlusion modeling and their inference ground occupancy probabilistic models on the Wildtrack scenario. This training is highly effective, as indicated by the increase in performance resulting from the use of the same architecture but tunned for the Wildtrack scenario (compare results of Pre-DeepMCD and Top-DeepMCD). However, this training requires the use of human-annotated detections in each scenario, hindering the scalability of these solutions, whereas the proposed approach is equal (i.e., not adapted) for every experiment reported in this paper.

Second, the qualitative results presented in Figure 11 suggest that results in Table III are highly biased by the author’s manually annotated area. The proposed method obtains a broader (Figure 11, green area) than the one provided by the authors (Figure 11, red area). Although the automatically obtained seems to be better fitted to the ground floor in the scene than the manually annotated one, the performance of our method decreases because ground-truth data is reported only on the manually annotated area. Thereby, our true positive detections out of this area result in false positives in the statistics (see Figure 11, cameras 1 and 4).

On average and contrary to state-of-the-art approaches, the proposed method adapts to different target scenario without needing a separate training stage for each situation and neither requiring a manually annotated area of interest.

V Conclusions

This paper describes a novel approach to perform pedestrian detection in a multi-camera recorded scenario. The proposed strategies for the temporal and spatial aggregation of semantic cues, along with homography projections, are used to obtain an estimation of the ground-plane. Through this process, a broader, accurate and role-annotated Area of Interest () is automatically defined. Per-camera detections, obtained by a state-of-the-art detector, are projected to the reference plane, and those laying outside of the obtained are filtered-out. A fusion approach based on creating connected components on a graph representation of the detections is used to fuse per-camera detections yielding global pedestrian detection. Then, a semantic-driven back-projection method handles occlusions and uses semantic cues to globally refine the location and size of the back-projected detections by aggregating information from all the cameras. Results on a broad set of sequences confirm that the method outperforms every other compared not deep-learning method and also every deep-learning method not trained to the target dataset, The proposed method performs close to scenario-tailored methods, but without their training stage, which highly hinders their straight use in new scenarios. In overall, results suggest that the proposed approach is able to obtain accurate, robust, tight-to-object and generic pedestrian detection in varied scenarios, included crowded ones.


This study has been partially supported by the Spanish Government through its TEC2014-53176-R HA-Video project.


  • [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.
  • [2] P. Peng, Y. Tian, Y. Wang, J. Li, and T. Huang, “Robust multiple cameras pedestrian detection with multi-view bayesian network,” Pattern Recognition, vol. 48, no. 5, pp. 1760–1772, 2015.
  • [3] Á. García-Martín and J. M. Martínez, “People detection in surveillance: classification and evaluation,” IET Computer Vision, vol. 9, no. 5, pp. 779–788, 2015.
  • [4] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2012.
  • [5] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [6] A. Lopez-Cifuentes, M. Escudero, and J. Bescos, “Automatic semantic parsing of the ground-plane in scenarios recorded with multiple moving cameras,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1495–1499, 2018.
  • [7] J. Ferryman, J. L. Crowley, and A. Shahrokni. (2018) Pets 2009 dataset. [Online]. Available:
  • [8] P. Baqué, F. Fleuret, and P. Fua, “Deep occlusion reasoning for multi-camera multi-target detection,” arXiv preprint arXiv:1704.05775, 2017.
  • [9] T. Chavdarova and F. Fleuret. (2018) Epfl rlc dataset. [Online]. Available:
  • [10] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera people tracking with a probabilistic occupancy map,” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 2, pp. 267–282, 2008.
  • [11] F. Fleuret, J. Berclaz, and R. Lengagne. (2018) Epfl terrace dataset. [Online]. Available:
  • [12] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection,” 2018.
  • [13] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret. (2018) Wildtrack dataset. [Online]. Available:
  • [14] Á. Utasi and C. Benedek, “A bayesian approach on people localization in multicamera systems,” IEEE transactions on circuits and systems for video technology, vol. 23, no. 1, pp. 105–115, 2013.
  • [15] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” arXiv preprint arXiv:1702.08502, 2017.
  • [16] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting the resnet model for visual recognition,” arXiv preprint arXiv:1611.10080, 2016.
  • [17] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” arXiv preprint arXiv:1612.01105, 2016.
  • [18] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1532–1545, 2014.
  • [19] H. Cho, P. E. Rybski, A. Bar-Hillel, and W. Zhang, “Real-time pedestrian detection with deformable part models,” in Intelligent Vehicles Symposium (IV), 2012 IEEE.   IEEE, 2012, pp. 1035–1042.
  • [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [21] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst, “Sparsity driven people localization with a heterogeneous network of cameras,” Journal of Mathematical Imaging and Vision, vol. 41, no. 1, pp. 39–58, 2011.
  • [22] K. Kim and L. S. Davis, “Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering,” in European Conference on Computer Vision.   Springer, 2006, pp. 98–109.
  • [23] J.-S. Franco and E. Boyer, “Fusion of multiview silhouette cues using a space occupancy grid,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2.   IEEE, 2005, pp. 1747–1753.
  • [24] D. Delannay, N. Danhier, and C. De Vleeschouwer, “Detection and recognition of sports (wo) men from multiple views,” in Distributed Smart Cameras, 2009. ICDSC 2009. Third ACM/IEEE International Conference on.   IEEE, 2009, pp. 1–7.
  • [25] S. M. Khan and M. Shah, “Tracking multiple occluding people by localizing on multiple scene planes,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 3, pp. 505–519, 2009.
  • [26] H. Aliakbarpour, V. B. S. Prasath, K. Palaniappan, G. Seetharaman, and J. Dias, “Heterogeneous multi-view information fusion: Review of 3-d reconstruction methods and a new registration with uncertainty modeling,” vol. 4, pp. 8264–8285, 12 2016.
  • [27] T. Chavdarova and F. Fleuret, “Deep multi-camera people detection,” arXiv preprint arXiv:1702.04593, 2017.
  • [28] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A benchmark,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 304–311.
  • [29] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The clear 2006 evaluation,” in International Evaluation Workshop on Classification of Events, Activities and Relationships.   Springer, 2006, pp. 1–44.
  • [30] Y. Xu, X. Liu, Y. Liu, and S.-C. Zhu, “Multi-view people tracking via hierarchical trajectory composition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4256–4265.