Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond

04/05/2018 ∙ by Xi Ouyang, et al. ∙ 0

State-of-the-art pedestrian detection models have achieved great success in many benchmarks. However, these models require lots of annotation information and the labeling process usually takes much time and efforts. In this paper, we propose a method to generate labeled pedestrian data and adapt them to support the training of pedestrian detectors. The proposed framework is built on the Generative Adversarial Network (GAN) with multiple discriminators, trying to synthesize realistic pedestrians and learn the background context simultaneously. To handle the pedestrians of different sizes, we adopt the Spatial Pyramid Pooling (SPP) layer in the discriminator. We conduct experiments on two benchmarks. The results show that our framework can smoothly synthesize pedestrians on background images of variations and different levels of details. To quantitatively evaluate our approach, we add the generated samples into training data of the baseline pedestrian detectors and show the synthetic images are able to improve the detectors' performance.



There are no comments yet.


page 2

page 5

page 7

page 8

page 9

page 10

page 12

Code Repositories


Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pedestrian detection is a crucial task in computer vision with a wide range of applications, including autopilot, surveillance and robotics

[1, 2, 3, 4]

. Recently, pedestrian detectors based on convolutional neural networks (CNNs), such as Faster R-CNN

[5] and YOLO9000 [6], have been applied to various of benchmarks. Built on tremendous amount of training examples, these models can achieve significant performance improvement over previous baselines.

However, labeling ground-truth bounding boxes for pedestrian locations requires time consuming and considerable human effort. Meanwhile, the performance of CNN-based pedestrian detectors heavily depends on the quality and the diversity of annotations in the training datasets. In other words, those methods expect the training data set to cover the same scenes or similar background environment as the testing data, such as camera configurations, lighting conditions and backgrounds. This becomes an issue when one applies these methods to a new unannotated video or video with limited supervision. Therefore, it is very important to design approaches that only rely on limited supervision and can be extended to new unannotated datasets smoothly.

One way to solve this problem is to develop methods to automatically generate labeled datasets. There exists some efforts that use simulation techniques to generate pedestrian appearance and its location in the image [7, 8]. But these methods apply in strict environment like fixed cameras. One model proposed to be used for the moving camera [9], can combine real-world background information in a scene with synthetically generated pedestrians. Nevertheless, since they generate pedestrians through rendering of 3D human models, the synthetic images look unrealistic and unnatural.

Figure 1: The PS-GAN model learns to smoothly synthesize pedestrians in background images through the multiple discriminators ( and ) network.

Motivated by recent promising success of generative adversarial networks (GANs) [10] in several applications [11, 12, 13], we propose to build a GAN-based model to generate realistic pedestrian images in real scene and utilize them as the augmented data to train the CNN-based pedestrian detector. Compared with adopting the regular GAN as a powerful tool for generating images, the goal of our model is different and more challenging due to: 1) generating pedestrians to fit the background scene well; 2) providing the corresponding locations of those synthetic pedestrians as the ground truths for the CNN-based detectors. We denominate it as Pedestrian-Synthesis-GAN (PS-GAN).

PS-GAN adopts the adversarial learning recipe and contains multiple discriminators ( for background context learning and

for pedestrian classifying), as shown in Figure

1. We replace the pedestrians in the bounding boxes with random noise and train the generator to synthesize new pedestrians within that noise region. The discriminator , learns to discriminate between real and synthesized pair. Meanwhile, the discriminator learns to judge whether the synthetic pedestrian cropped from the bounding boxes is real or fake. aims to force to learn the background information like the road, light condition in noise boxes. It leads to smooth connection between the background and the synthetic pedestrian. makes to generate real pedestrians with more realistic shape and details. Moreover, due to the varied sizes of cropped synthetic pedestrians, we utilize the Spatial Pyramid Pooling (SPP) layer [14] in to avoid the effect of resizing. After training, the generator can learn to generate photo-realistic pedestrians in the noise box regions and the locations of noise boxes are taken as the ground truths for detectors.

To the best of our knowledge, PS-GAN is the first work that utilizes GAN to generate data for pedestrian/object detection task. We evaluate it on two large-scale datasets: Cityscapes [15] and Tsinghua-Daimler Cyclist Benchmark [16]. We use the model to generate results on these two datasets, and also train the Faster R-CNNs [5] with real and synthetic data to prove the effectiveness of data augmentation. We show that:

  • Our proposed model can generate sharp and photo-realistic pedestrian images and fit the background well in real scene/image;

  • The data generated from PS-GAN can be used with some real samples to train CNN-based detectors. This data augmentation step can improve both detection performance and stability over original model;

  • On cross-dataset experiments, i.e., model is trained on one dataset and tested on the other, PS-GAN is also able to generate good samples and improve the performances of CNN-based detectors.

2 Related Work

2.0.1 Pedestrian Detection

Pedestrian detection attracts great interest due to its wild applications including driving systems, surveillance and robotics [1, 2, 3, 4]. Built upon parameterized CNN models, recent works [5, 6, 17, 18, 19] can achieve good detection performances in several benchmarks. However, these models require a large amount of training samples, which is quite time-consuming and takes many human efforts.

To handle this issue, researchers have proposed different solutions, one of which is to develop data augmentation techniques. Existing data augmentation methods are generally limited to certain tasks or conditions: [7] focuses exclusively on crowd behavior, [8] works only when the camera is stationary. The paper [9] provides an automatic and relatively robust model STD-PD, which selects possible locations to place synthetic agents. Using a 3D model for pedestrian rendering, its generated pictures are not realistic. Realizing that it is difficult to model the complex distribution of pedestrians in real scene by using hand-crafted rules only, we decide to adopt data-driven approach like GANs to perform the task.

2.0.2 Generative Adversarial Network

The original GAN was proposed by [10], and there are plenty of following works to improve the training stability and visual quality of the generation [20, 21, 22, 23, 24, 25, 26]

. GANs also have been employed in many other applications, for example, super-resolution

[13], image in-painting [12, 27, 28], image translation [11, 29, 30, 31]. [21] proposed DCGAN and adopted it to augment training data for person re-identification, which focuses on verifying the effectiveness of the label smoothing regularization, instead of the quality of the generated pictures. [32] proposed PGGAN to synthesize persons in arbitrary poses in the cropped person images.

The most related work to ours is the GAN [11], which have solid and robust results when paired training samples are available. [31] add Cycle Consistency Loss to the original version, enabling the model to conduct translations without paired training examples and task-specific designed functions. To synthesize the pedestrians in the noise boxes and the locations of which can be taken as the bounding box labels, we adopt the paired training in GAN but a different architecture with multi-discriminators.

Compared with the image in-painting work [12, 27, 28], which aims to fill the randomly removed monochromatic patches in original image, our framework fills the missing area with noise rather than monochromatic blocks to generate patches with diverse shapes/colors. We only need to learn the background information based on the context provided by surrounding parts of the image when synthesizing pedestrians in noise boxes. The work in [28] exploits a similar two discriminators GAN for image in-painting to learn more context information of surrounding pixel. Different from that, we pass the pedestrian patch cropped from the generated output into the discriminator to encourage the model to generate person in diverse shapes.

3 Pedestrian-Synthesis-GAN

Generative Adversarial Networks [10] consist of a generator and a discriminator that compete in a two-player minimax game. In this paper, we adopt the adversarial learning idea and propose PS-GAN with multiple discriminators, which has ability to synthesize photo-realistic pedestrians with the corresponding bounding boxes information. Unlike the regular GAN, our method leverages an adversarial process between the generator and both two discriminators: for background context learning and for discriminating pedestrian.

Our framework is inspired by the conditional GAN work [33]. While training, we replace the pedestrian region in original image with random noise and push it to the generator . Suppose the noise image is while the original image with pedestrian is . is trying to generate fake image from as similar as possible to to fool the two discriminators and . Therefore, when generating new data, we can place noise boxes on the certain area where the pedestrians are expected and use the generator to synthesize pedestrian within the noise boxes. In this section, we first introduce the model architecture, then the detailed formulation of the overall objective.

Figure 2: The discriminator is applied to classify between real and synthesized pair to learn the background context in the noise box. The discriminator learns to classify the real and synthesized pedestrian with the noise box. We adopt a 3-level SPP layer (, , , totally 21 bins) before the final feature representation.

3.1 Model architecture

3.1.1 U-Net for Generator

The generator learns a mapping function , where is the input noisy image and is the ground truth image. In this work, we adopt the enhanced encoder-decoder network (U-Net) [11] for . It follows the main structure of the encode-decoder architecture, where the input image is passed through a series of convolutional layers as downsampling layers until a bottleneck layer. Then the bottleneck layer feeds the encoded information of original inputs to the deconvolutional layers to be upsampled. U-Net uses the skip connections to connect the downsampling and upsampling layers in a symmetric position with respect to the bottleneck layer, which can preserve richer local information.

3.1.2 to Discriminate fake/real Pedestrians

For this discriminator , we crop the synthetic pedestrian from the generated image as the negative example, while the real pedestrian from the original image as the positive example. Therefore, is used to classify whether the pedestrian is real or fake in the noise box. It forces to learn the mapping from the noise to the real pedestrian , wher is the noise region in the noise image .

The overall structure of is shown in Figure 2. We apply a 5-layer convolutional network with LeakyRelu and BatchNorm layers. Normally the discriminator net accepts a fixed-size input. However, the input for our is the cropped pedestrian from the generated image or the ground truth image, which have various sizes. To address this issue, we adopt the Spatial Pyramid Pooling (SPP) layer [14] in and the detail of SPP-layer is also shown in Figure 2. In our experiments, for each cropped pedestrian, we use a 3-level spatial pyramid (, ,

, totally 21 bins) to pool the features. After that, we concatenate all those 3-level features to an entire feature vector and apply the

GAN loss [11] here.

3.1.3 to Learn Background Context

The goal of our model is to not only synthesize a realistic pedestrian but also smoothly fill the synthetic pedestrian into the background. Thus it requires our model to learn context information like light conditions, surrounding backgrounds, etc. Following the pair-training recipe from GAN [11], is used to classify between real and synthesized pairs. The real pair concatenates the noise image and ground truth image while the synthesized pair concatenates the noise image and the generated image. The overall framework training is shown in Figure 2.

The main structure of follows the design of DCGAN [21] with the following modifications: 1) we make the first convolutional layer accept the 6-channel input of the stacked pair of images; 2) we use the GAN in this discriminator as in [11], which means tries to classify if each (in our experiment,

is set to 70) patch in an image is real or fake; 3) we adopt the loss function of LSGAN

[22] in . To fit the GAN setting, we calculate the mean squares between the

output and corresponding all-ones or all-zeros matrix as the loss function for


3.2 Loss Function

As illustrated in Figure 1, our model consists of two adversarial learning procedures and . The adversarial learning between and can be formulated as:


where is the image with the noise box and is the ground truth image. We use LSGAN here to replace the original GAN loss by a least square loss.

To encourage to generate realistic pedestrians within the noise box in the input image , we conduce the other adversarial procedure between and :


where is the noise box in and is the cropped pedestrian in the ground truth image . We use the negative log likelihood objective to update the parameters of and .

The training of GAN can be benefited from the traditional loss [11]. In this paper, we apply loss to control the differences between the generated image and ground image :


Finally, combining the losses previously defined results in the final loss function as:


where controls the relative importance of the loss. Empirically, we found that is a good setting and fixed it for the all experiments.

Figure 3: We compare PS-GAN with four different models. The baseline is GAN [11] in the third column, which only contain one discriminator to classify the real and synthesized pair. The following columns (A-C) show the ablation test of the proposed PS-GAN. Model : the main structure is same as PS-GAN but the SPP layer is removed; Model : The difference with our final model is that this model adopt the LSGAN loss on both the two adversarial learning and ; Model : The regular GAN loss is kept in both two adversarial learning procedures.

4 Experimental Results

We test PS-GAN model on Cityscapes [15] and show the quality of the synthesized images. To analyze the effect of the data augmentation, we combine the real and synthesized data to train the Faster R-CNN [5] detectors and evaluate the performance. Moreover, to evaluate the ability to generate training example on the new video with limited supervision, we test PS-GAN model trained using Cityscapes on Tsinghua-Daimler Cyclist Benchmark [16]

. All those experiments are based on PyTorch

222 and run on Titan X GPUs.

4.1 Cityscapes

The Cityscapes dataset is a large-scale dataset for semantic urban scene understanding that contains a diverse set of stereo video recordings from 50 cities

[15]. Compared to other benchmarks like Caltech Pedestrian[34] and KITTI[35], Cityscapes has higher resolution pictures and contains more pedestrians with rich variety, which is more suitable to train GANs.

4.1.1 Qualitative Result

We generate the bounding boxes for all pedestrians based on the pixel-wise labels. There are some labeled pedestrians which are too small or partially blocked by cars or walls. So we filter out all the bounding boxes with the height smaller than 70 pixels and width smaller than 25 pixels. After that, we obtain 2326 images containing totally 9708 labeled pedestrians and randomly select 500 images of them as the testing dataset. We do not feed the original images () into PS-GAN directly. Instead, we crop the patches around the chosen pedestrians from the original images. Moreover, we select 1200 pedestrian patches from the 1826 training images which display intact body shapes. Those 1200 patches will be covered with noise boxes in the pedestrian positions, then those noise images are taken as the training data for PS-GAN.

Figure 4: Results of different models for synthesizing pedestrians in blank background.

To show the pedestrians generated by PS-GAN, we conduct two experiments: 1) generating pedestrians on the real pedestrian positions, and 2) generating pedestrians only on the background images without pedestrians. For the first setting, we crop the patches around the pedestrians from the original images among the 500 test examples and fill the noise boxes to cover the real pedestrians in those patches. Our pre-trained generator synthesizes pedestrians within those noise boxes and we compare the synthetic and real pedestrians as shown in Figure 3. For the second setting, we randomly crop the patches from the blank scene images without any labeled pedestrians. Considering that the pedestrians can not appear in unreasonable positions like in the wall or within a car, we remove those wrong images and add the noise boxes in the remaining image patches. The results are in Figure 4.

We list the synthesized samples of all baseline models, trained on the same training set for 200 epochs. Compared with the baseline

GAN, PS-GAN can generate better quality of images both on Figure 3 and Figure 4. Most of the results of GAN only have murky person shapes while PS-GAN gets very clear shape of pedestrians. It proves that our discriminator can effectively guide generator to learn more realistic shape information and details of pedestrians. To evaluate the effect of the SPP layer in , we compare the results of PS-GAN with the model , which does not have SPP layer in . As shown in Figure 3 and Figure 4, the model with SPP layer can learn more detailed information of pedestrians. For instance, in the first row of both Figure 3 and Figure 4, the legs of the person from PS-GAN can clearly be seen while they are blurry in the one from model .

In our experiments, we find that using LSGAN [22] for is helpful to learn the background context. PS-GAN can obtain the best picture quality when applying the least square loss for the adversarial learning and keeping the regular GAN loss for . We design the model that adopts the LSGAN loss in both adversarial learning procedures, but the results are not competitive with PS-GAN as shown in both Figure 3 and Figure 4. Model performs only slightly better than the GAN. We also study model , which uses the regular loss on both adversarial learning procedures. Actually, the model can generate pedestrians with nice human-body shape. In the last row of Figure 3, it even generates a pedestrian with better shape. However, this model can not learn the adequate background context information to fit surrounding pixels.

We analyze the reason why the two discriminators and have different optimal GAN losses in our work: 1) for , as we apply the GAN trick, LSGAN with least square loss will get larger error than the regular GAN loss. It makes the model to be more sensitive to every pixel in images than the regular GAN. Thus the generator may be forced to learn too much detailed information of pedestrians instead of capturing the global distribution; 2) however, our discriminator can take benefit from the least square loss when learning the background context information. We expect the generator to strictly learn the background information from the surrounding pixels.

(a) Generated pedestrians
(b) Real pedestrians
Figure 5: Comparison of generated and real pedestrians.
Figure 6: The examples of synthesizing pedestrians in the original scenes. It shows the original images in the left and corresponding synthesized images in the right.
1826 real images (7729 labels)
+ 3000 synthetic pedestrians
59.95% 61.02%
+ 5000 synthetic pedestrians
60.23% 61.79%
+ 8000 synthetic pedestrians
58.41% 61.59%
Pascal VOC 2007 34.13%
Pascal VOC 2007 & 2012 36.85%
300 real images (1173 labels)
+ 500 synthetic pedestrians
46.97% 47.36%
+ 1000 synthetic pedestrians
46.71% 48.79%
+ 2000 synthetic pedestrians
46.12% 48.11%
1000 real images (4368 labels)
+ 2000 synthetic pedestrians
52.07% 54.41%
+ 4000 synthetic pedestrians
51.68% 56.19%
+ 5000 synthetic pedestrians
51.24% 55.96%
Table 1: The performance comparison of using different settings to train the Faster R-CNN, including adding different amounts of synthetic data from GAN and PS-GAN, separately.

We crop the pedestrians from the generated images, and demonstrate that PS-GAN can generate pedestrians with sharp body shapes and detailed information as illustrated in Figure 5. Compared with the work in [36], which uses 12,936 images to train the GAN for the person re-identification task, we only use 1200 images to train PS-GAN and get sharper and more photo-realistic results.

4.1.2 Quantitative Analysis

In this section, we combine the data generated by PS-GAN with some real data to train the Faster R-CNN detector [5] to analyze the effects of data augmentation. In the experiment, we follow the setting on the above qualitative result section and randomly put noise boxes to generate pedestrians on the patches from the images on Cityscapes. After that, we fill those patches with generated pedestrians into the original images. Some examples are shown in Figure 6. Many synthetic pedestrians by PS-GAN look hallucinating real in the original images, which is only trained on the 1826 training images. It is notable that all the patches are add into the original 1826 training images which means we do not involve any new images with synthetic pedestrians. To demonstrate how the augmented synthetic images can help boost the performance of the Faster R-CNN model, we train three Faster R-CNN detectors[5] (VGG-16 [37] based models). The baseline detector is trained on the original 1826 training images, and two detectors are trained on those images adding synthetic pedestrians from GAN and PS-GAN separately. All the detectors are tested on the 500 testing images and the average precisions (AP) are from the best performance when all the models converge. We also add different amounts of synthetic pedestrians into the 1826 training image and present the results on Table 1. Although the Faster RCNN detector has been trained well (60.11%) on 1856 images, adding synthetic pedestrians on the original images for training the detector is still beneficial. With 5000 synthetic pedestrians from PS-GAN, we improve the the detector performance from 60.11% to 61.79%. On the contrary, adding 8000 synthetic pedestrians from GAN downgrades the performance to be 58.41% since adding too many examples from GAN destroys the normal data distribution. This experimental result matches the terrible visual quality of the GAN.

To attain deeper insight into the effect of the augmented synthetic images, we conduct more experiments as shown in Table 1. We train a baseline Faster R-CNN detectors[5] (VGG-16 [37] based models) of using 300 real image and also adopt the detectors [5] pretrained on Pascal VOC [38]

. Also, all the detectors are tested on the 500 testing images. Moreover, to avoid the GAN model to see more data than the Faster R-CNN, all the Pix2Pix GAN model and PS-GAN models are retrained on the same image set for training the Faster R-CNN. In other words, we retrain those GAN models on the 300 images for fair comparison. The synthetic pedestrians are also adding into the original images without adding any new image into training. As shown in Table

1, the detectors pretrained on Pascal VOC 2007 dataset and 2007 & 2012 datasets can achieve 34.13% AP and 36.85% respectively. This observation indicates that pretrained model on different background can not perform well. The baseline detector using 300 real images with 1173 pedestrians in Cityscapes, can achieve 47.08% of the average precision (AP) for pedestrian detection. By adding the synthetic images, the AP rate can be improved. We get the best performance when adding 1000 synthetic pedestrians. It outperforms the baseline to 1.71% while adding 2000 synthetic pedestrians can improve by only 1.04%. In both cases, we compare the results with image synthesized from GAN. It slightly downgrades the performance in all the experiments.

We also train another baseline Faster R-CNN detector using 1000 real images, with 4368 pedestrians annotated in total. Meanwhile, the GAN models are retrained on the 1000 real images. The motivation here is to see how the augmented synthetic images can help boost the performance when the Faster R-CNN model gets different amounts of real training data. We add 2000 and 4000 synthetic pedestrians to the original 1000 real ones and retrain the Faster R-CNN detector. We can see that, even Faster R-CNN is trained in a more saturated state, the model with data augmentation can achieves 56.19% AP, outperforming the baseline 3.47%.

4.2 Tsinghua-Daimler Cyclist Benchmark

Tsinghua-Daimler Cyclist Benchmark [16] is a dataset for cyclist detection, which contains 4 subsets: train, validation, test and “NonVRU” set. The train set contains 9741 images with annotations as “cyclist”. There are 1019 images in validation set and 2914 images in test set, which contain the annotations as “pedestrian”, “cyclist”, “motorcyclist”, “tricyclist”, “wheelchairuser”, and “mopedrider”. The “NonVRU” set contains 1000 images with background image only (no pedestrian).

To explore the generalization ability of PS-GAN, we perform the cross-dataset test. The goal of this experiment is to simulate the situation that applying our GAN model on the new unannotated video or video with limited supervision. It is useful to improve the performance when the training set contains the similar scenes in testing set. If the PS-GAN has great generalization ability in the new data, it could be very helpful when we face a new task with limited annotated information.

(a) Synthesize pedestrians in background images.
(b) Synthetic images for data augmentation
Figure 7: The results of generating pedestrians on Tsinghua-Daimler Cyclist Benchmark. All those images are generated by PS-GAN pretrained on Cityscapes without any data from Tsinghua-Daimler Cyclist Benchmark.

Firstly, we directly apply the PS-GAN model pretrained on Cityscapes (using 1826 images) to generate pedestrians on the empty background images from “NonVRU” set. Since some images in “NonVRU” set are not suitable (e.g. no road, too dark or light, etc.) to synthesize pedestrians, we get 650 images after removing those images. Similar to what we did in Cityscapes, we cropped patches from those images and put noise boxes to synthesize pedestrians. The generated examples are shown in Figure 7(a). Without adding any data from Tsinghua-Daimler Cyclist Benchmark, PS-GAN can still generate high-quality and realistic images on this dataset. Note that there are many differences between these two datasets, such as the background, lighting conditions and pedestrian styles. We can expect the generated image quality has a slight drop compared to the results in Cityscapes. Specifically, the region around pedestrian does not match the background well, and the body of pedestrian loses some details in some cases. Nevertheless, the generated images still look natural with satisfactory qualities.

Also, we conduct the comparison between using the real data on Cityscapes and adding synthetic data to train the Faster R-CNN. For the test, all the 2914 test images of Tsinghua-Daimler Cyclist Benchmark with the bounding boxes annotated as “pedestrian” and “cyclist” are directly used. The results are presented in Table 2, where adding the 650 synthetic images gain a huge improvement (2.64%) than the baseline with the real data on Cityscapes. Different from the setting in Cityscapes, we add new background images when adding the synthetic pedestrians. To illustrate the effect of adding new images, we also compare with the detector trained on the real data on Cityscapes and the 650 empty background images. Adding background images can bring a slight improvement, about 0.29%. In this case, the result with image synthesized from GAN can slight improve the AP rate but the improvement is much poor compared with the PS-GAN by 2.3%.

Meanwhile, we execute the detection experiments with different amounts of training data for the Faster-RCNN. We report the results of using 300 and 1000 real images, and also adding synthetic images and background images separately on Table 2. Also, we use the GAN models retrained on 300 and 1000 images as we did in section 4.1.2. The performances get improved in both cases. Adding background images can bring limited improvement, 0.91% and 0.6%, respectively. Adding some synthetic data here shows significant help here, boosting the performance by 2.62% and 2.52%, respectively. Especially when adding 650 synthetic images into 1000 real images, the AP rate get better from 42.42% to 44.94% which even significantly outperforms the AP rate 43.77% of using 1826 real images to train the detector. Moreover, in all cases, we compare the results with image synthesized from GAN. It can only achieve similar AP rate as the baseline detectors and has not done better than PS-GAN.

1826 real images from Cityscapes (7729 labels) 43.77%
+ 650 background images (no pedestrian) 44.06%
+ 650 synthetic images (4500 pedestrians) 44.11% 46.41%
Pascal VOC 2007 23.24%
Pascal VOC 2007 & 2012 26.50%
300 real images from Cityscapes (1173 labels) 32.15%
+ 300 background images (no pedestrian) 33.06%
+ 300 synthetic images (2000 pedestrians) 32.64% 34.77%
1000 real images from Cityscapes (4368 labels) 42.42%
+ 650 background images (no pedestrian) 43.02%
+ 650 synthetic images (4500 pedestrians) 42.70% 44.94%
Table 2: The comparison of adding different amounts of synthetic data to train the Faster R-CNN. The number of label means the number of real pedestrians in the real images or generated pedestrians in the synthetic images.
Generator Pretrained Background
Detector Cityscapes Tsinghua
PS-GAN Pascal VOC 84.55% 88.85%
Cityscapes 90.11% 90.46%
GAN Pascal VOC 52.46% 69.42%
Cityscapes 58.82% 71.68%
Table 3: The AP rate comparison of different pretrained detectors. The two sets, Cityscapes and Tsinghua-Daimler Cyclist, are used as the background.

4.3 Evaluation with Pretrained Detectors

Finally, we use the detectors pretrained on real images to detect the synthetic samples (using 500 samples) and report the AP rate. Two Faster RCNN detectors [5] trained on Pascal VOC and Cityscapes (300 samples) are utilized. We also compare PS-GAN with GAN on this task. The results are list in Table 3. We can see that the AP rate of the detectors on the samples generated with PS-GAN are much higher than that with GAN, showing the generation power of PS-GAN in another prospective.

5 Conclusion

We propose PS-GAN to synthesize pedestrian within the certain bounding boxes in real scenes. The experimental results show that our model can generate high quality pedestrian images, and the synthetic images can effectively improve the ability of the CNN based detectors. In the cross dataset test, our PS-GAN model trained on Cityscapes can do pretty good generation in the other new dataset as well as help boost the detection, which demonstrates the ability of generalization and transferring knowledge. This is helpful when we face a new task with limited annotated information.

Currently PS-GAN pedestrians vary in a mild range of scales (can not be too small or large), which restricts it to generate more diverse and natural data. Making PS-GAN to handle the extreme case is challenging. Besides that, how to control PS-GAN to generate pedestrians in reasonable locations (e.g., pedestrian should not be on the tree or in the water) is also interesting.

In the meantime, applying PS-GAN to other detection tasks is definitely one of our future works.


6 Supplementary Experiments

In this supplemental, We provide more results to help understand our proposed approach described in the paper. Firstly, we show more generation results from PS-GAN on Cityscapges and Tsinghua-Daimler Cyclist Benchmark. Secondly, we investigate the effects of data augmentation used to boost the detection performance on both dataset.

6.1 Generated images comparison

We show more results of the work in Section 4.1.1 in Figure 8. PS-GAN still gets the best performance. We also compare the synthetic images of all models on Tsinghua-Daimler Cyclist. Here we further evaluated different approaches in the cross-dataset setting.

As shown in Figure 9, PS-GAN, even trained on Cityscapes, can generate the best quality of images and fit the background well. on the contrary, the synthesized pedestrians using GAN are not good. Some of them are very blur. We also see the results of other variant ions of PS-GAN and have similar observation of them as in Cityscapes.

Figure 8: More results of synthesizing pedestrians on cityscapes with different models.
Figure 9: Results of synthesizing pedestrians on Tsinghua-Daimler Cyclist Benchmark with different models.

6.2 Visualization of detection results

To future investigate how the synthesized can help improving the performance of Faster R-CNN, we put some visualized detection results on for bot Cityscapes and Tsinghua-Daimler Cyclist. The experimental details are described in Section 4.1.2 and 4.2.

We first show some examples of Cityscapes in Figures 10 and 11. For the 300 real image setting, it is clear that the data augmentation step can increase true positives while reduce some false positives. Even for the 1000 real setting, it can also gain more real detections than the original model.

For Tsinghua-Daimler Cyclist, the results are demonstrated in Figures 12 and 13. In both settings, adding the generated data is able to boost the performance of detector by getting more real detections.

Figure 10: Visualization of the detection result on Cityscapes. The left column contains the results of the Faster R-CNN trained on 300 real images, while the right are the results for adding 1000 synthetic pedestrians from PS-GAN.
Figure 11: Visualization of the detection result on Cityscapes. The left column contains the results of the Faster R-CNN trained on 1000 real images, while the right are the results for adding 4000 synthetic pedestrians from PS-GAN.
Figure 12: Visualization of the detection result on Tsinghua-Daimler Cyclist Benchmark. The left column contains the results of the Faster R-CNN trained on 300 real images from Cityscapes, while the right are the results for adding 300 synthetic images from PS-GAN.
Figure 13: Visualization of the detection result on Tsinghua-Daimler Cyclist Benchmark. The left column contains the results of the Faster R-CNN trained on 1000 real images from Cityscapes, while the right are the results for adding 650 synthetic images from PS-GAN.