generator for synthetic data
As autonomous vehicles become an every-day reality, high-accuracy pedestrian detection is of paramount practical importance. Pedestrian detection is a highly researched topic with mature methods, but most datasets focus on common scenes of people engaged in typical walking poses on sidewalks. But performance is most crucial for dangerous scenarios, such as children playing in the street or people using bicycles/skateboards in unexpected ways. Such "in-the-tail" data is notoriously hard to observe, making both training and testing difficult. To analyze this problem, we have collected a novel annotated dataset of dangerous scenarios called the Precarious Pedestrian dataset. Even given a dedicated collection effort, it is relatively small by contemporary standards (around 1000 images). To allow for large-scale data-driven learning, we explore the use of synthetic data generated by a game engine. A significant challenge is selected the right "priors" or parameters for synthesis: we would like realistic data with poses and object configurations that mimic true Precarious Pedestrians. Inspired by Generative Adversarial Networks (GANs), we generate a massive amount of synthetic data and train a discriminative classifier to select a realistic subset, which we deem the Adversarial Imposters. We demonstrate that this simple pipeline allows one to synthesize realistic training data by making use of rendering/animation engines within a GAN framework. Interestingly, we also demonstrate that such data can be used to rank algorithms, suggesting that Adversarial Imposters can also be used for "in-the-tail" validation at test-time, a notoriously difficult challenge for real-world deployment.READ FULL TEXT VIEW PDF
State-of-the-art pedestrian detection models have achieved great success...
Deep learning-based computer vision is usually data-hungry. Many researc...
Existing datasets for training pedestrian detectors in images suffer fro...
Recently, generative adversarial networks (GANs) have shown great advant...
Pedestrian detection through computer vision is a building block for a
Deployment and operation of autonomous underwater vehicles is expensive ...
In this paper we study the use of convolutional neural networks (convnet...
generator for synthetic data
There’s no software designer in the world that’s ever going to be smart enough to anticipate all the potential circumstances an autonomous car is going to encounter. The dog that runs out into the street, the person who runs up the street, the bicyclist, the policeman or the construction worker.
As autonomous vehicles become an every-day reality, high-accuracy pedestrian detection is of paramount practical importance. Pedestrian detection is a highly researched topic with mature methods, but most datasets focus on “everyday” scenes of people engaged in typical walking poses on sidewalks [9, 6, 13, 12, 48]. However, perhaps the most important operating point for a deployable system is its behaviour in dangerous, unexpected scenarios, such as children playing in the street or people using bicycles/skateboards in unexpected ways.
Precarious Pedestrian Dataset: Such “in-the-tail” data is notoriously hard to observe, making both training and evaluation of existing systems difficult. To analyze this problem, we have collected a novel annotated dataset of dangerous scenarios called the Precarious Pedestrian Dataset. Even given a dedicated collection effort, it is relatively small by contemporary standards ( images). To explore large-scale data-driven learning, we explore the use of synthetic data generated by a game engine. Synthetic training data is an actively explored topic because it provides a potentially infinite well of annotated data for training data-hungry architectures [24, 30, 14, 19, 40, 42]. Particularly attractive are approaches that combine a large amount of synthetic training data with a small amount of real data (that may have been difficult to acquire and/or label).
Challenges in Synthesis: We see two primary difficulties with the use of synthetic training data. The first is that not all data is created “equal”: when combining synthetic data with real data, synthesizing common scenes may not be particularly useful since they will likely already appear in the training set. Hence we argue that the real power of synthetic data is generating examples “in-the-tail”, which would otherwise have been hard to collect. The second difficulty arises in building good generative models of images, a notoriously difficult problem. Rather than building generative pixel-level models, we make use of state-of-the-art rendering/animation engines that contain an immense amount of knowledge (about physics, light transfer, etc.). The challenge of generative synthesis then lies in constructing the right “priors”, or scene-parameters, to render/animate. In our case, these correspond to body poses and spatial configurations of people and other objects in the scene.
Adversarial Imposters: We address both concerns with a novel variant of Generative Adversarial Networks (GANs) 
, a method for synthesizing data from latent noise vectors. Traditional GANs learn generative feedforward models that process latent noise vectors, typically from a fixed known prior distribution.Instead, we fix the feedforward model to be a rendering engine, but use an adverserial framework to learn the latent priors. To do so, we define a rendering pipeline that takes an input a vector of scene parameters capturing object attributes and spatial layout. We use rejection sampling to construct a set of scene parameters (and their associated rendered images) that maximally confuse the discriminator. We call such examples Adversarial Imposters, and use them within a simple pipeline for adapting detectors from synthetic data to the world of real images.
RPN+: We use our dataset of real and imposter images to train a suite of contemporary detectors. We find surprisingly good results with a (to our knowledge) novel variant of region proposal network (RPN)  tuned for particular objects (precarious people) rather than a general class of objectness detections. Instead of classifying a sparse set of proposed windows (as nearly all contemporary object detection systems based on RCNN do ), this network returns a dense heatmap of pedestrian detections, along with regressed bounding box location for each pixel location in the heatmap. We call this detector RPN+. Our experiments show that our RPN+, trained on real+imposter data, outperforms other detectors trained only on real data.
Validation: Interestingly, we also demonstrate that our Adverserial Imposter Dataset can be used to rank algorithms, suggesting that our pipeline can also be used for “in-the-tail” validation at test-time, a notoriously difficult challenge for real-world deployment.
Contributions: The contribution of our work is as follows: (1) a novel dataset of pedestrians in dangerous situations (Precarious Pedestrians) (2) a general architecture for creating realistic synthetic data “in-the-tail”, for which limited real data can be collected and (3) demonstration of our overall pipeline for the task of pedestrian detection using a novel detector. Our datasets and code can be found here: https://github.com/huangshiyu13/RPNplus.
Synthetic datasets have been used to train and evaluate the performance of computer vision algorithms. Some forms of ground truth are hard to obtain from hand-labelling, such as optical flow, but easy to synthesize via simulation. Adam et al.  used a 3D game engine to generate synthetic data and learned an intuitive physics model to predict falling towers of blocks. Mayer et al.  released a benchmark suite of various tasks using synthetic data, including disparity and optical flow. Richter et al.  used synthetic data to improve image segmentation performance, but notably do not control the scene as to explore targeted arrangements of objects. German Ros et al.  used Unity Development Platform to generate a synthetic urban scene dataset.
Generative Adversarial Nets: GANs 
are deep networks that can generate synthetic images from latent noise vectors. They do so by adversarially-training a neural network to discriminate between real versus synthetic images. Recent works have shown impressive performance in generation of synthetic images[31, 7, 37, 44, 5]. However, it appears challenging to synthesize high-resolution images with semantically-valid content. We circumvent these limitations with a rendering-based adversarial approach to image synthesis.
We begin by describing our Precarious Pedestrian Dataset. We perform a dedicated search for targeted keywords (such as “pedestrian fall”, “traffic violation” and “dangerous bike rider”) on Google Images, Baidu Images, and some selected images from MPII Dataset , producing a total of 951 reasonable images. We then label bounding boxes for each image manually. Precarious Pedestrians contains various kinds scenes, such as children running on the road, people tripping, motorcyclists performing dangerous movements, people interacting with objects (such as bicycles or umbrellas). One important dangerous but increasingly common scenario consists of people watching their phones or texting while crossing the street, which is potentially dangerous as the person may not be aware of their surroundings (Figure 4). To quantify the (dis)similarity of Precarious Pedestrians to standard pedestrian benchmarks such as Caltech , we tabulate the percentage of images with more than one people, as well as the number of irregular “pedestrians” such as bicyclists or motorcyclists. Compared to Caltech, Precarious Pedestrians contains images with many more overall people as well as many more cyclists and motorbikes (Figure 12). We split the Precarious dataset equally for training and testing.
To help both train and evaluate algorithms for detecting precarious pedestrians, we make use of a synthetic data. In this section, we describe our rendering pipeline for generating synthetic data. We use the Unity 3D game engine as our basic platform for simulation and rendering, due to the large availability of both commercial and user-generated assets, in the form of 3D models and character animations.
Figure 7 shows the commercial 3D human models that we use for data generation, consisting of 20 models spanning different women, men, cyclists and skateboarder avatars. Because these are designed for game engine play, each 3D model is associated with characteristic animations such as jumping, talking, running, cheering and applauding. We animate these models in a 3D scene with a 2D billboard to capture the scene background , as shown in Figure 7. Billboards are randomly sampled from the 1726 background images from INRIA dataset  and a custom set of outdoor scenes downloaded from Internet. Our approach can generate a diverse set of background scenes, unlike approaches that are limited to a single virtual urban city .
|Number of 3D models|
|Index of background images|
|Index of 3D models|
|Position of 3D models||Within the field of vision|
|Index of Animations|
|Time of animation|
|Model’s angle on the x axis|
|Model’s angle on the y axis|
|Model’s angle on the z axis|
|Light’s angle on the x axis|
|Light’s angle on the y axis|
Scene parameters: To build a large library of synthetic images that will potentially be used for training and evaluation, we first define a set of parameters and parameter ranges. We index the set of background images, the set of 3D models, and the animation frame number for each model. In brief, the scene parameters include directional light intensity and direction (capturing sunlight), the background image index, the number of 3D models, and for each model, an index specifying the avatar ID and animation frame, as well as a root position and orientation (rotation in the ground plane). We assume a fixed camera viewpoint. Note that the root position affects both the location and scale of the 3D model in the rendered image. All these parameters can be summarized as a variable-length vector , where each vector corresponds to a particular scene instantiation.
Synthesis: Our generator , or rendering engine, synthesizes an image corresponding to . Importantly, we can also synthesize labels for each rendered image, specifying object type, 3D location, pixel segmentation masks, etc. In practice, we make use of only 2D object bounding boxes. Table 1
shows the viable ranges of each parameter. In addition, we found the following heuristic to simulate reasonable object layouts: we enforce the maximum overlap between any two 3D models to be 20% (to avoid congestion) and the projected location of the 3D models should lie within the camera’s field-of-view. These conditions are straightforward to verify for a given vectorwithout rendering any pixels, and so can be efficiently enforced though rejection sampling (i.e., generate a random vector and only render those that pass these conditions). Unlike Hironori et al. , who generate training data by manually tuning to match specific scenes, our approach is not scene specific and does not require any manual intervention.
Pre-processing: Synthesized images and Precarious Pedestrian images may be of different sizes. We isotropically scale each image to a resolution of 960
720, zero-padding as necessary. Our experiments also make use of the Caltech Pedestrian benchmark, to which we apply the same pre-processing.
Domain adaption: In this section, we introduce a novel framework for adversarially adapting detectors from synthetic training data to real training data. We use to denote an image and to denote its label vector (a set of bounding box labels). Let to refer to the distribution of image-label pairs from the source domain (of synthetic images), and to refer to the target domain (of real Precarious Pedestrians). In our problem, we expect large amounts of source samples, but a limited amount of target ones. We factorize the joint into a marginal over image appearance and conditional on label given the appearance - e.g., . Importantly, we discriminatively train a feedforward function to match the conditional distribution. Our central question is how to transfer feedforward predictors trained from source samples to the target domain .
Fine-tuning: The most natural approach to domain adaption may simply be to fine-tune a predictor , originally trained on the source, with samples from the target
. Indeed, virtually all contemporary methods for visual recognition makes use of fine-tuned models that were pre-trained on Imagenet. We compare to such a strategy in our experiments, but find that fine-tuning works best when source and target distributions are similar. As we argue, while rendering engines can produce photorealistic scenes, it is difficult to specify a prior over scene parameters that mimic real (Precarious) scenes. We describe a solution that adversarially learns a prior.
Generators: As introduced in Sec. 3.2, let be a vector of scene parameters, be a feedforward generator function that renders a synthetic image given the scene parameters, and be a function that generates labels from the scene parameters. We can then reparameterize the distribution over synthetic images as a distribution over scene parameters . We now describe a procedure for learning a prior that allows for easier transfer. Specifically, we learn a prior that fools an adversary that is trying to distinguish samples from the source and target.
Adversarial generators: To describe our approach, we first recall a traditional generative adverserial network (GAN):
where the minmax optimization jointly tries to estimate a discriminatorthat can distinguish real versus synthesized data examples, and the generator tries to synthesize realistic examples that fool the discriminator. Typically, the discriminator
is trained to output the probability thatis real (e.g., a real Precarious Pedestrian), while
is fixed to be a zero-mean, unit-variance Gaussian. This optimization can be performed with stochastic gradient updates, that converge (in the limit) to a fixed point of the minimax problem. We refer the reader to the excellent introduction in. Importantly, the generator must encode complex constraints about the manifold of natural images, that capture amongst other knowledge the physical properties of light transport and material appearance.
Adversarial priors: We note that rendering engines can be viewed as generators that already contain much of this knowledge, and so we fix to be a production-quality rendering platform (Unity 3D). Instead, we learn the prior over parameter vectors in a adversarial manner:
If the generator is differentiable with respect to , it is possible to use backprop to compute gradient updates for simple prior distributions , such as Gaussians [22, 39]. This implies that the above formulation of adversarial priors is amenable to gradient-based learning.
Imposter search: We see two difficulties with directly applying (2) to our problem: (1) It seems unlikely that the optimal prior for precarious scene parameters will be a simple unimodal distribution with a single mean parameter vector (and associated covariance matrix). (2) Rendering, while readily expressed as a feed-forward function, is not naturally differentiable at object boundaries (where small changes in parameters can generate large changes in the rendered image). While approximate differentiable renderers do exist , we wish to make use of highly-optimized commercial packages for animation and image synthesis (such as Unity 3D). As such, we adopt a simple sampling-based approach that addresses both limitations:
where . That is, we search for a subset of parameter vectors (the “imposters”) that fool the discriminator. One could employ various sequential sampling strategies for optimizing the above; start with a random sample of parameter vectors, update the discriminator (with gradient based updates using a batch of real and synthesized data), generate additional samples close to those imposters that fool the discriminator, and repeat. We found a single iteration to work quite well. Our algorithm for synthesizing a realistic set of precarious scenes is given in Alg. 1, and the overall approach for advesarial domain adaption is given in Alg. 2.
Here, the set consists of synthetic image-label pairs rendered from an exhaustive set of scene parameters and the set consists of real (Precarious) image-label pairs. Without Step 2, Alg. 2 reduces to standard fine-tuning from a source to a target domain. Step 2 can be thought of as “marginal distribution adaption”, since the distribution of imposter images mimics the true target distribution , at least from the discriminator’s perspective. But importantly, the discriminator has not made use of labels to find imposters, and so imposter labels may not mimic the true target label distribution. Because of this, we opt to finally fine-tune on the target image-label pairs. Alternatively, one may explore a discriminator that directly operates on pairs of data and labels, as in .
Discriminator : Our discriminator is a VGG16 network trained to output the probability that an input images is real (with label 1) or synthetic (with label 0). We found that modest number of images sufficed for training: 500 images from the Precarious Pedestrian train split and 1000 random synthetic images. We downsample images to 384288 to accelerate training. After training , we generate another set of 8000 synthetic images and select various subsets of size to define the imposter set (examined further in our experiments). We roughly find that 2.5% of the synthesized images can serve as reasonable imposters.
Predictor : We make use of a detection system based off on a region proposal network (RPN) [38, 49]. Rather than training a RPN to return objectness proposals, we train it to directly return pedestrian bounding boxes. Our network, denoted as RPN+, is illustrated in Figure 16
. RPN+ is a fully convolutional network implemented with TensorFlow. We concatenate several layers on different stages in order to improve the ability of locating people in different resolutions. We use 9 anchors (reference boxes with 3 scales and aspect ratios) at each sliding position. During training, a candidate bounding box will be treated as a positive if its intersection-over-union overlap with a ground-truth box exceeds 50%, and will be a negative for overlaps less than 20%. To accelerate training time, we initialize with a pre-trained VGG-16 model where the first two convolutional layers are frozen.
We follow the evaluation protocol of the Caltech pedestrian dataset , which use ROC curves for 2D bounding box detection at 50% and 70% overlap thresholds.
Testsets: We use three different datasets for evaluation: our novel Precarious Pedestrian testset of real images, our novel Adverserial Imposter Testset, and for diagnostics, a standard pedestrian benchmark dataset (Caltech).
Baselines: We compare our approach with the following baselines:
ACF: An aggregate channel features detector  .
LDCF: A LDCF detector .
HOG+Cascade: A cascade of boosted classifiers working with HOG features .
HARR+Cascade: A cascade of boosted classifiers working with haar-like features[47, 25].
RPN/BF: A RPN detection model trained with boosted forest , which appears to be the state-of-the-art pedestrian detection system at the time of publication.
Precarious Pedestrians: Results on Precarious Pedestrians are presented in Figure 19. Our detector significantly outperforms alternative approaches, including the state-of-the-art RPN/BF model. At false positive per image, our miss rate of 42.47% significantly outperforms all baselines, including the state-of-the-art RPN/BF model (with a miss rate of 54.5%). Note that all baseline detectors are trained on Caltech. Comparing to baselines is complicated by the fact that both the detection system and training dataset have changed. However, in some sense, our fundamental contribution is method for generating more accurate training datasets through adversarial imposters. To isolate the impact of our underlying detection network RPN+, we also train a variant solely on the Caltech training set (denoted as RPN+Caltech), making it directly comparable to all baselines because they use the same training set. RPN+Caltech performs slightly worse than RPN/BF (with miss-rate of 58.82%), though it outperforms RPN/BF at higher false positive rates. This suggests that our underlying network is close to state-of-the-art, and moreover validates the significant improvement of training with Adversarial Imposters. Figure 22 visualizes the results of RPN+, both trained on Caltech and trained with Adversarial Imposters. Qualitatively, we find that Precarious Pedestrians tend to take on more pose variation than typical pedestrians. This requires detection systems that are able to report back a wider range of bounding box scales and aspect ratios.
Adversarial Imposters: We also explore how detectors perform on a testset of Adversarial Imposters. Note that we can generate an arbitrarily large testset since it is synthetic. Figure 25 and Figure 31 show that the performance on both real test data and synthetic test data has the same ranking order. These results suggest that synthetic data may be useful as a testset for evaluating detectors on rare (but important) scenarios that are difficult to observe in real test data.
Caltech: Finally, for completeness, we also test our RPN+ network on the Caltech Dataset in Figure 28. Here, all the detectors are trained on Caltech Dataset. For reference, RPN+Caltech model would currently rank 6th out of 68 entries on the Caltech Dataset leaderboard. We also attempted to evaluate our final model (trained with Adversarial Imposters) on Caltech, but saw lackluster performance. We posit that this is due to the different set of scales and aspect ratios in Precarious Pedestrians. We leave further cross-dataset analysis to future work.
|Fine-tuning method||50% overlap||70% overlap|
In this section, we explore various variants of our approach. Table 2 examines different fine-tuning strategies for adapting detectors from the source domain of synthetic images to the target domain of real Precarious images. Fine-tuning via Imposters performs the best 42.47%, and noticeably outperforms the commonplace baselines of traditional fine-tuning (by 6%) and training on only the target (by 24%).
Figure 31 examines the effect of , the size of the imposter set. We find good performance when is equal to , the size of the target set of Precarious Pedestrians used for training. In retrospect, this may not be surprising as this produces a balanced distribution of real images and Adversarial Imposters for training. Finally, Figure 31 also explores the impact of the discriminator. It plots performance as a function of the training epoch used to learn . As we train a better discriminator, the performance of our overall adversarial pipeline gets noticeably better.
We have explored methods for analyzing “in-the-tail” urban scenes, which represent important modes of operations for autonomous vehicles. Motivated by the fact that rare but dangerous scenes are exactly the scenarios on which visual recognition should excel, we first analyze existing datasets and illustrate that they do not contain sufficient rare scenarios (because they naturally focus on common or typical urban scenes). To address this gap, we have collected our own dataset of Precarious Pedestrians, which we will release to spur further research on this important (but under explored) problem. Precarious scenes are challenging because little data is available for both evaluation and training. To address this challenge, we propose the use of synthetic data generated with a game engine. However, it is challenging to ensure that the synthesized data matches the statistics of real precarious scenarios. Inspired by generative adversarial networks, we introduce the use of a discriminative classifier (trained to discriminate real vs synthetic data) to implicitly specify this distribution. We then use the synthesized data that fooled the discriminator (the “synthetic imposters”) to both train and evaluate state-of-the-art, robust pedestrian detection systems.
Acknowledgements. This work was supported by NSF Grant 1618903, NSF Grant 1208598, and Google. We thank Yinpeng Dong and Mingsheng Long for helpful discussions.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
Journal of Machine Learning Research, 17(59):1–35, 2016.