Pedestrian detection through computer vision is a building block for a multitude of applications in the context of smart cities, such as surveillance of sensitive areas, personal safety, monitoring, and control of pedestrian flow, to mention only a few. Recently, there was an increasing interest in deep learning architectures for performing such a task. One of the critical objectives of these algorithms is to generalize the knowledge gained during the training phase to new scenarios having various characteristics, and a suitably labeled dataset is fundamental to achieve this goal. The main problem is that manually annotating a dataset usually requires a lot of human effort, and it is a time-consuming operation. For this reason, in this work, we introduced ViPeD - Virtual Pedestrian Dataset, a new synthetically generated set of images collected from a realistic 3D video game where the labels can be automatically generated exploiting 2D pedestrian positions extracted from the graphics engine. We used this new synthetic dataset training a state-of-the-art computationally-efficient Convolutional Neural Network (CNN) that is ready to be installed in smart low-power devices, like smart cameras. We addressed the problem of the domain-adaptation from the virtual world to the real one by fine-tuning the CNN using the synthetic data and also exploiting a mixed-batch supervised training approach. Extensive experimentation carried out on different real-world datasets shows very competitive results compared to other methods presented in the literature in which the algorithms are trained using real-world data.READ FULL TEXT VIEW PDF
We present a new method for training pedestrian detectors on an unannota...
Video games are a compelling source of annotated data as they can readil...
Deep learning has rapidly transformed the state of the art algorithms us...
Authoring realistic behaviors to populate a large virtual city can be a
Deep Learning-based object detectors can enhance the capabilities of sma...
In the area of computer vision, deep learning has produced a variety of
Training deep neural control networks end-to-end for real-world applicat...
When it comes to smart cities, it is impossible not to consider video surveillance, which is becoming a key technology for a myriad of applications ranging from security to the well-being of people. In this context, pedestrian detection, in particular, is a fundamental aspect of smart city applications. These smart city applications have two conflicting requirements: to respond in real-time and to reduce false alarms as much as possible.
Deep learning, and in particular Convolutional Neural Networks (CNNs), offer excellent solutions to these conflicting requirements provided that we feed the neural network with a sufficient amount of data during the training phase. However, the data must be quite diversified to ensure that the CNNs can generalize and adapt to different scenarios having different characteristics, like different perspectives, illuminations, and object scales. Current state-of-the-art object detectors are powered by CNNs, since they can automatically learn features characterizing the objects by themselves; these solutions outperformed approaches relying instead on hand-crafted features.
This aspect becomes even more crucial in smart city applications where the smart devices that are typically used should be easily installed and deployed, without the need for an early tuning phase. Therefore, a key point for the training of cutting-edge CNNs is the availability of large sets of labeled training data that cover as much as possible the differences between the various scenarios. Although there are some large annotated generic datasets, such as ImageNet Deng2009 and MS COCO coco, annotating the images is a very time-consuming operation, since it requires great human effort, and it is error-prone. Furthermore, sometimes these hand-labeled datasets are not sufficiently large and diverse enough to ever learn general models, and, as a consequence, they lead to poor cross dataset performance, which restricts real-world utility. Finally, it should also be noticed that sometimes it is also problematic to create a training/testing dataset with specific characteristics.
To this end, one of the contributions of this work is to provide a suitable dataset collecting images from virtual world environments that mimics as much as possible all the characteristics of our target real-world scenarios. In particular, we introduce a new dataset named ViPeD (Virtual Pedestrian Dataset), a huge collection of images taken from the highly photo-realistic video game GTA V - Grand Theft Auto V developed by Rockstar North, which extends the JTA (Joint Track Auto) dataset presented in Fabbri2018. We demonstrate that by using ViPeD during the training phase we can improve performance and achieve competitive results compared to the state-of-the-art approaches in the pedestrian detection task.
In this work, we extend our previous contribution amato2019viped presented at the International Conference of Image Analysis and Processing (ICIAP) 2019. In particular, we experimented with another state-of-the-art object detector, Faster-RCNN Ren2015, and we employed an extended set of real-world datasets for improving the baselines and for better validating our approach. Furthermore, other than using a simple fine-tuning methodology as in amato2019viped, we try also to perform domain-adaptation using a mixed-batch supervised method.
In the end, as in amato2019viped, we adapted the detector on specific real-world scenarios, specifically on the MOT detection benchmarks (MOT17Det and MOT19Det) MOT17; MOT19. They are real-world datasets suited for pedestrian detection. This latest experiment is intended to measure the ability of our model to transfer knowledge from our virtual-world trained model to some crowded real-world scenarios.
To summarize, in this work we propose a CNN-based system able to detect pedestrians for surveillance smart cameras. We train the detector using a new dataset collected using images from a realistic video game, and we take advantage of the graphics engine for extracting the annotations without any human intervention. This paper is an extension of our previous work amato2019viped. We extend it with the following:
we experiment with another state-of-the-art object detector, Faster-RCNN Ren2015;
we add a new set of experiments for evaluating the generalization capabilities of our training procedure;
we test a mixed-batch supervised domain-adaptation approach, that we compare with the basic fine-tuning methodology.
The full code for replicating our experiments, together with the scripts which generate the ViPeD dataset from the JTA (Joint Track Auto) dataset Fabbri2018, are accessible through our project web-page 111https://ciampluca.github.io/viped/.
In this section, we review the most relevant works in object and pedestrian detection. We also analyze previous studies on using synthetic datasets as training sets.
Pedestrian detection is highly related to object detection. It deals with recognizing the specific class of pedestrians, usually walking in urban environments. We can subdivide approaches for the pedestrian detection problem into two main research areas. The first class of detectors is based on handcrafted features, such as ICF (Integral Channel Features) 10yearspedestrian; Zhang2014; Zhang2015; Zhang2016; Nam2014. Those methods can usually rely on higher computational efficiency, at the cost of lower accuracy. On the other hand, deep neural network approaches have been explored. Tian2015; Yang2016; Cai2016; Sermanet2013 proposed some modifications around the standard CNN network LeCun1998 to detect pedestrians, even accounting for different scales.
Many datasets are available for pedestrian detection. Caltech Dollar2012, INRIA Dalal2005, MOT17Det MOT17, MOT19Det MOT19, and CityPersons citypersons are among the most important ones. Since they were collected in different living scenarios, they are intrinsically very heterogeneous datasets. Some of them Dalal2005; Dollar2012 were specifically collected for detecting pedestrians in self-driving contexts. Our interest, however, is mostly concentrated on video-surveillance tasks and, in this scenario, the recently introduced MOT17Det and MOT19Det datasets have proved to be enough challenging due to the high variability of the video subsets.
With the need for huge amounts of labeled data, generated datasets have recently gained considerable interest. Kaneva2011; Marin2010 have studied the possibility of learning features from synthetic data, validating them on real scenarios. Unlike our work, however, they did not explore deep learning approaches. Vazquez2012; Vazquez2014 focused their attention on the possibility of performing domain adaptation to map virtual features onto real ones. Authors in Fabbri2018 created a dataset taking images from the highly photo-realistic video game GTA V
and demonstrated that it is possible to reach excellent results on tasks such as people tracking and pose estimation when validating on real data.
To the best of our knowledge, Roberson2016 and bochinski2016training are the works closest to our setup. In particular, Roberson2016 also used GTA V as the virtual world, but, unlike our method, they concentrated on vehicle detection.
used a synthetically-generated dataset to train a simple CNN to detect objects belonging to various classes in a video. The convolutional network dealt only with the classification, while the detection of objects relied on a background subtraction algorithm based on Gaussian mixture models (GMMs). The real-world performance was evaluated on two standard pedestrian detection datasets, and one of these (MOTChallenge 2015mot2015) is an older version of the dataset we used to carry out our experimentation.
In this section, we describe the datasets exploited in this work. First, we introduce ViPeD -Virtual Pedestrian Dataset, a new synthetically generated collection of images used for training the network. Then we outline four real-world datasets that we used for validating our approach - MOT17Det MOT17, MOT19Det MOT19, CityPersons citypersons and COCOPersons.
As mentioned above, CNNs need large annotated datasets during the training phase to learn models robust to different scenarios, and creating the annotations is a very time-consuming operation that requires a great human effort.
ViPeD is a huge collection of images taken from the highly photo-realistic video game GTA V developed by Rockstar North. This newly introduced dataset extends the JTA (Joint Track Auto) dataset presented in Fabbri2018. Since we are dealing with images collected from a virtual world, we can extract pedestrian bounding boxes for free and without the manual human effort, exploiting 2D pedestrian positions extracted from the video card. The dataset includes a total of about 500K images, extracted from 512 full-HD videos (256 for training, 128 for validating and 128 for testing) of different urban scenarios.
In the following, we report some details on the construction of the bounding boxes and on the data augmentation procedure that we used to extend the JTA dataset for the pedestrian detection task.
is specifically designed for pedestrian pose estimation and tracking, the provided annotations are not directly suitable for the pedestrian detection task. In particular, the annotations included inJTA are related to the joints of the human skeletons present in the scene (Fig. (a)a). At the same time, what we need for our task are the coordinates of the bounding boxes surrounding each pedestrian instance.
Bounding box estimation can be addressed using different approaches. The GTA graphic engine is not publicly available, so it is not easy to extract the detailed masks around each pedestrian instance; Roberson2016 overcame this issue by extracting semantic masks and separating the instances by exploiting depth information. Instead, our approach uses the skeletons annotations already derived by the JTA team to reconstruct the precise bounding boxes. This seems to be a more reliable solution than the depth separation approach, especially when instances are densely distributed, as in the case of crowded pedestrian scenarios.
The very basic setup consists of drawing the smallest bounding box that encloses all the skeleton joints. The main issue with this simple approach is that each bounding box entirely contains the skeleton, but not the pedestrian mesh. Indeed, we can notice that the mesh is always larger than the skeleton (Fig. (b)b
). We can solve this problem by estimating a pad for the skeleton bounding box exploiting another information produced by the GTA graphic engine and already present inJTA, i.e., the distance of all the pedestrians in the scene from the camera.
In particular, the height of the mesh, denoted as , can be estimated from the height of the skeleton by means of the formula:
where is the distance of the pedestrian center of mass from the camera, and is a parameter that depends on the camera projection matrix. Since we have not access to the camera parameters, is unknown.
Given that is already available for every pedestrian, we estimate the parameter by annotating 30 random pedestrians, actually obtaining for them the correct value for
. At this point we can perform linear regression on the parameterfor finding the best fit.
Then, we estimate the mesh’s width . Unlike the height, the width is strongly linked to the specific pedestrian pose, so it is difficult to be estimated only having access to the camera distance information. For this reason, we simply estimate directly from , assuming no changes in the aspect ratio for the original and adjusted bounding boxes:
where is the aspect ratio of the bounding box. Examples of final estimated bounding boxes are shown in Fig. (b)b.
Finally, we perform a global analysis of these new annotations. As we can see in Fig. 2, in the dataset, there are annotations of pedestrians farthest than 30-40 meters from the camera. However, human annotators tend to avoid annotating objects farthest than this amount. We perform this measurement by measuring the height of the smallest bounding boxes in the human-annotated MOT17Det dataset MOT17 and catching out in our dataset at what distance from the camera the bounding boxes assume this human-limit size. Therefore, to obtain annotations comparable to real-world human-annotated ones, we prune all the pedestrian annotations furthest than 40 meters from the camera.
In Fig. 3, we report some examples of images of the ViPeD dataset together with the sanitized bounding boxes.
Synthetic datasets should contain scenarios as close as possible to real-world ones. Even though images grabbed from the GTA
game were already very realistic, there are some missing details. In particular, images grabbed from the game are very sharp, edges are very pronounced, and common lens effects are not present. In light of this, we prepare also a more realistic version of the original images, adding some effects such as radial-blur, gaussian-blur, bloom effects, and adjusting the exposure/contrast. Parameters for these filters are randomly sampled from a uniform distribution.
Following, we report some details about the real-world datasets we employ for evaluating our approach. MOT17Det MOT17 and MOT19Det MOT19 are two real-world pedestrian-detection benchmarks for surveillance-based applications. CityPersons citypersons is a real-world dataset for pedestrian detection more focused on self-driving applications. Finally, COCOPersons is a split of the MS-COCO dataset coco comprising images collected in general contexts.
MOT17Det MOT17 and MOT19Det MOT19
datasets are recently introduced benchmarks for surveillance-based applications (the latter has been presented at the Computer Vision and Pattern Recognition Conference (CVPR) 2019). They comprise a collection of challenging images for pedestrian detection taken from multiple sequences with various crowded scenarios having different viewpoints, weather conditions, and camera motions. The annotations for all the sequences are generated by human annotators from scratch, following a specific protocol described in their papers. Training images of theMOT17Det collection are taken from sequences 2, 4, 5, 9, 10, 11 and 13 (for a total of 5,316 images), while MOT19Det training set comprises sequences 1, 2, 3 and 5 (for a total of 8,931 images). Test images for both datasets are taken from the remaining sequences (for a total of 5,919 images in MOT17Det and 4,479 images in MOT19Det). It should be noticed that the authors released only the ground-truth annotations belonging to the training subsets. The performance metrics concerning the test subsets are instead available only submitting results to their MOT Challenge website 222https://motchallenge.net. The main peculiarity of MOT19Det compared to MOT17Det is the massive crowding of the collected scenarios.
CityPersons dataset citypersons is a recent collection of images of interest for the pedestrian detection community. It consists of a large and diverse set of stereo video sequences recorded in streets from different cities in Germany and neighboring countries. In particular, authors provide 5,000 images from 27 cities labeled with bounding boxes and divided across train/validation/test subsets. This dataset is more focused on self-driving applications, and images are collected from a moving car.
COCOPersons dataset is a split of the popular COCO dataset coco comprising images collected in general contexts belonging to 80 categories. We filter these images considering only the ones belonging to the persons category. Hence, we obtain a new dataset of about 66,000 images containing at least one pedestrian instance.
Differently from amato2019viped, we use Faster-RCNN Ren2015
as an object detector, exploiting the TorchVision v0.3 implementation provided with the PyTorch deep-learning library. Unlike YOLOv3Redmon2018
, Faster-RCNN is a two-stage detector that employs a Region Proposal Network (RPN) to suggest interesting regions in the image to attend to. Then, features extracted from each proposed region contribute to the final object classification. For the detector backbone, we prefer the ResNet-50 FPN over the ResNet-101 FPN, since it gives satisfactory performances when taking into account also the computational resources and the time required during the training phase.
As a starting point, we consider Faster-RCNN pre-trained on the COCO dataset coco, a large dataset composed of images describing complex everyday scenes of common objects in their natural context, categorized in 80 different categories. Since this network is a generic object detector, we substitute the last layers of the detector to recognize and localize object instances belonging only to a specific category - i.e., the pedestrian category in our case.
In amato2019viped, domain adaptation between virtual and real scenarios is simply carried out by fine-tuning the pre-trained Faster-RCNN architecture. In this work, we also experiment with another type of supervised domain adaptation approach, called Balanced Gradient Contribution (BGC) ros2016synthia; ros2016bgc. It consists of mixing into the same mini-batch images from both real and virtual domains. As explained in ros2016bgc, we use the real-world data as a regularization term over the synthetic data training loss. For this reason, the mini-batch is filled mostly with synthetic images, while the real-world ones are used to slightly constrain the gradients to not overfit on synthetic data.
Our entire approach is developed under the guidance of two different use-cases:
a general-purpose use case where we are interested in obtaining a good overall detector, able to generalize to different scenarios, using the available synthetic data;
a more specific use-case, where we want to maximize the performances on a particular dataset by fine-tuning the model previously trained with the synthetic data.
We explain these two scenarios in detail in the following section.
We evaluate the detection performance using standard mean Average Precision (mAP) metrics. In particular, we rely on the COCO mAP and the MOT AP metrics. In all the experiments, we fix the IoU threshold to 0.5, which is a widely used choice. Therefore, the mAP is computed varying only the detection confidence threshold. We feed into the evaluators all the detector proposals having detection confidence greater than 0.05.
We evaluate our solution performing experiments separately for the two different use-cases introduced in Section 4 and detailed below.
The first use-case consists of a domain-agnostic training of the detector by exploiting the synthetic data so that the same model can handle diverse real-world scenarios while keeping good performances on all of them. To achieve this, we rely on the heavy amount of synthetic data available with ViPeD, captured in different urban scenarios, and under different lighting and weather conditions. We experiment with two different kinds of supervised domain-adaptation approaches. The first method, already used in amato2019viped, is a simple fine-tuning of the COCO pre-trained model on ViPeD. The second one, instead, is built upon the Balanced Gradient Contribution (BGC) framework by ros2016synthia; ros2016bgc and consists in mixing into the same mini-batch images from both real and virtual domains, as explained in Section 4.
First of all, we obtain a baseline for this scenario using the detector trained only on the real-world general-purpose COCO dataset. We evaluate this initial model testing it on all the remaining real-world datasets. i.e. MOT17Det, MOT19Det and CityPersons, considering only the detections belonging to the person category.
Other significant baselines are obtained by fine-tuning the COCO pre-trained model with the other real-world datasets.
Then, as in amato2019viped, we fine-tune this initial COCO pre-trained detector using the ViPeD dataset. In particular, we re-initialize the box predictor module with random weights and a different number of classes. The box predictor module is placed at the end of the detector pipeline, and it is responsible for outputting bounding boxes coordinates and class scores. Instead, all the other layers are initialized with the weights of the COCO pre-trained model.
All the weights are left unfrozen so that they can be adjusted by the back-propagation algorithm. With this technique, we are forcing the architecture to adjust the learned features to match those from the destination dataset. Again, we evaluate this new model testing it on all the remaining real-world datasets. i.e. MOT17Det, MOT19Det and CityPersons.
Finally, to not overfit our detector on synthetic images, we employ the mixed-batch domain-adaptation approach for injecting into the batch images coming from some real-world scenarios. As in the previous case, we initialize all the layers except the ones from the box predictor module using the weights of the COCO pre-trained model. However, this time we fine-tune the network using batches composed by 2/3 of synthetic images and 1/3 of real-world images. In this experiment we consider COCOPersons as real-world dataset, since it depicts humans in highly heterogeneous scenarios, and it is not biased towards a specific application (e.g. autonomous driving).
As in the previous cases, we evaluate this model testing on all the real-world datasets.
Results are reported in Table 1. In the first section of the table, we report the baselines, while the latter is related to our approaches. Note that we omit results concerning a specific dataset if it has been employed during the training phase, for a fair evaluation of the overall generalization capabilities.
|ViPeD + Real||0.733||0.582||0.546|
|ViPeD Aug. + Real||0.730||0.581||0.557|
Results in Table 1 show that our solution can generalize the knowledge learned from the virtual-world to different real-world datasets. In most cases, our network is also able to perform better than the ones trained using only the real-world manual-annotated datasets, taking advantage of the high variability and size of our ViPeD dataset. In particular, concerning the MOT17Det dataset, all our solutions trained with synthetic data perform better than the ones trained with real data. The best result is obtained using the mixed-batch approach. Considering the MOT19Det dataset, we achieve the best result fine-tuning the detector with our basic version of ViPeD . CityPersons is the only dataset on which the detector maintains higher performances when trained with real-world data. In particular, the highest mAP on CityPersons is obtained when Faster-RCNN is trained with the MOT17Det dataset. However, the mixed-batch approach achieves in this case results comparable with the baselines.
It should be noted that, differently from amato2019viped, the augmentation of our ViPeD dataset has a minor impact on the final performances with respect to the basic non-augmented version. It is very likely that Faster-RCNN is more robust to effects such as radial-blur, Gaussian-blur, and bloom effects than YOLOv3. Given these results, in the next section we consider only the basic version of our ViPeD dataset.
The second use-case consists in adapting the previously trained pedestrian detection model to maximize its performance on a particular real-world scenario. In this case, we want to test the ability of our synthetic trained model to adapt to novel real-world scenarios, when fine-tuned on the target training set.
In particular, we specialize our detector on the MOT17Det and MOT19Det datasets, as they contain very interesting smart-cities scenarios mainly designed for surveillance applications, rather than for self-driving tasks. Even in this case, the knowledge transfer to a specific real-world domain is performed both using fine-tuning and through batch mixing, as explained above.
First, we consider the previously model trained with synthetic images of our ViPeD dataset, and we fine-tune it with the training sets of the MOT17Det or MOT19Det datasets. Even in this scenario we initialize all the layers except the ones from the box predictor module using the weights of the COCO pre-trained model.
Then, we exploit the mixed-batch domain adaptation approach, starting from the COCO pre-trained model and fine-tuning the network using batches composed by 2/3 of synthetic images from the ViPeD dataset and 1/3 of real-world images from the MOTDet17 or the MOT19Det dataset.
Results are reported in Table 2 and Table 3. Since authors do not release the ground truth for the test sets, we submit our results to the Mot Challenge website, as mentioned in Section 3.2.1; therefore, the results are evaluated using the MOT mAP. We report our results together with the state-of-the-art approaches publicly released in the MOT Challenges (at the time of writing). We also report our previous results obtained in amato2019viped. Please note that results in Table 3 are related only to our previous approach in amato2019viped since, at the time of writing, it was no possible to submit our results to the challenge concerning the MOT19Det dataset. As explained before, in this case, we do not have the ground truth for the test set. The authors will open this challenge again, and we will report the updated results as soon as they will available before the final version of this article. However, given that the new detector shows higher performances than YOLOv3 on the MOT17Det dataset, we expect better results also on the MOT19Det dataset.
|Faster R-CNN ViPeD FT (our)||0.89|
|Faster R-CNN ViPeD MB (our)||0.87|
|YOLOv3 ViPeD FT amato2019viped (our)||0.71|
|YOLOv3 ViPeD FT amato2019viped (our)||0.734|
Results in Table 3 demonstrate that our training procedure can reach competitive performance even when compared to specialized pedestrian detection approaches. In particular, concerning the MOT17Det dataset, we obtain a mAP of 0.89, the same value obtained by the top finishers.
Furthermore, even though the mixed batch approach cannot reach the state-of-the-art results, a performance loss of only 0.02 mAP can be justified if we consider that we are using only 1/3 of the real-world target training set. We will discuss in more detail the results concerning the MOT19Det dataset as soon as the authors open again the challenge.
In this work, we proposed a novel approach for training pedestrian detectors using synthetic generated data. The choice of training a network using synthetic data is motivated by the fact that a huge amount of different examples are needed for the algorithm to generalize well. This huge amount of data is typically manually collected and annotated by humans, but this procedure usually takes a lot of time, and it is error-prone.
To this end, we introduced a synthetic dataset named ViPeD . This dataset contains a massive collection of images rendered out from the highly photo-realistic video game GTA V developed by Rockstar North, and a full set of precise bounding boxes annotations around all the visible pedestrians.
We fine-tuned Faster-RCNN, a state-of-the-art two-stage object detector, with ViPeD , and we validated this approach on different real-world publicly available pedestrian detection datasets. To address the problem of the domain-adaptation from the virtual world to the real one, we also exploit a mixed-batch supervised training approach.
Using Faster-RCNN, we demonstrated that the network trained with the help of synthetic data can generalize to a great extent to multiple real-world scenarios. Furthermore, our solution can be easily transferred to specific real-world environments, outperforming the same architecture trained instead only on real-world manually-labeled datasets.
Even though in this work we considered the specific task of pedestrian detection, we think that the presented procedure could be applied at a larger scale even on other related tasks, such as image classification or object segmentation.
This work was partially supported by the Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009.