Data simulation, as can be conveniently performed in graphic engines, provides valuable convenience and flexibility for computer vision research[richter2016playing, sakaridis2018semantic, ruiz2019learning, tremblay2018training, sun2019dissecting]. One can simulate a large amount of data under various combinations of environmental factors even from a small number of 3D object/scene models. In order for simulated data to be effective in the real world, the domain gap should be addressed from two levels: content level and appearance level [kar2019meta]. While much existing work focuses on appearance (style)-level domain adaptation [deng2018image, hoffman2018cycada, zhong2018camera], we focus on the content level, i.e., learning to simulate data with similar content to the real data, as difference computer vision tasks require varieties of image contents.
This paper investigates the content simulation of synthetic datasets, which can augment real-world datasets for the specific real-world task of vehicle re-ID. This task matches a query vehicle image with every database image so as to find the true matches containing the same vehicle. Here, based on the graphics software Unity, we propose an editable vehicle generation engine called VehicleX. Given a target real dataset, we propose a method allowing the VehicleX engine to generate a synthetic dataset with similar attributes to the real data, such as orientation, illumination, camera coordinates, and so on.
Our system is designed based on the following considerations. It is expensive to collect large-scale datasets for re-ID tasks. During annotation, one needs to associate an object across different cameras, a difficult and laborious process as objects might exhibit very different appearances in different cameras. In addition, there also has been an increasing concern over privacy and data security, which makes collection of large real datasets difficult. On the other hand, we can see that datasets can be very different in their content. Here content means the object layout, illumination, and background in the image. For example, the VehicleID dataset [liu2016deep] consists mostly of car rears and car fronts, while vehicle viewpoints in the VeRi-776 [liu2016large] cover a very diverse range. This content-level domain gap might cause a model trained on VehicleID to have poor performance on VeRi. Most existing domain adaptation methods work on the pixel level or the feature level so as to allow the source and target domains to have similar appearance or feature distributions. However, these approaches are not capable of handling content differences, as can often be encountered when training on synthetic data and testing on real data.
The flexibility of 3D graphic engines allows us to 1) scale up the training data and 2) potentially reduce the content domain gap between simulated data and real-world data. In order to make effective use of our approach, we make contributions from two aspects. First, we introduce a large-scale synthetic dataset named VehicleX, which lays the foundation of our work. It contains 272 backbone models, and after coloring, creates 1,209 different vehicles. Similar to existing 3D synthetic datasets such as PersonX [sun2019dissecting] and ShapeNet [chang2015shapenet], VehicleX has editable attributes and is able to generate a large training set by varying object and environment attributes.
Based on the VehicleX platform, we propose an attribute descent method which automatically configures the vehicle attributes, such that the simulated data shares similar content distributions with the real data of interest. Specifically, we manipulate a range of five key attributes closely related to the vehicle. To measure the distribution discrepancy between the simulated and real data, we use the FID score and aim to minimize it. In each iteration, we optimize the values of attributes in a specific sequence, which usually terminates within two iterations of optimization. The system workflow is shown in Figure 1.
We show the efficacy of performing attribute descent on VehicleX by jointly training the resulting simulated data with real-world datasets. We show that the simulated training data with optimized attributes can consistently improve re-ID accuracy on the real-world test set. Under this augmentation scheme, we achieve very competitive re-ID accuracy compared with the state-of-the-art approaches, validating the effectiveness of learning from data synthesis.
2 Related Work
Vehicle re-identification has received increasing attention in the past few years, and many effective systems are proposed, such as those based on vehicle keypoints [khorramshahi2019dual, wang2017orientation], color [tang2019pamtri] and viewpoints [zhou2018aware]
. Moreover, useful loss functions are successfully adopted, such as the cross-entropy loss[zheng2016mars], the triplet loss [hermans2017defense] and the label smooth regulariation (LSR) [szegedy2016rethinking]. In this paper, our baseline system is built with commonly used loss functions with no bells and whistles. Depending on the camera conditions, location and environment, existing vehicle re-ID datasets usually have their own distinct characteristics. For example, images in the VehicleID [liu2016deep] are either captured from the front or the back. In comparison, the VeRi-776 [liu2016large] includes a wide range of viewpoints. The recently introduced CityFlow [tang2019cityflow] has distinct camera heights and backgrounds. Despite these characteristics, our proposed data simulation approach can effectively augment various re-ID datasets due to its strong ability in content adaptation.
Appearance (style)-level domain adaptation. Domain adaption is used to reduce the domain gaps between the distributions of two datasets. To our knowledge, the majority of work in this field focuses on discrepancies in image style, such as real vs. synthetic [bak2018domain] and real vs. sketch [peng2019moment]. For example, some use the generative adversarial network (GAN) to reduce the style gap between two domains [hoffman2018cycada, shrivastava2017learning, deng2018image]. Various constraints are exerted on the generative model such that useful properties are preserved during image translation. While these works have been shown to be effective in reducing the style domain gap, a fundamental problem remains to be solved, i.e., the content difference.
Content-level domain adaption, to our knowledge, has been discussed by a few existing works [kar2019meta, ruiz2019learning]. This paper adopts their advantages and makes new contributions. On the one hand, we adopt the idea of Nataniel et al. [ruiz2019learning] that represents attributes using predefined distributions. We are also motivated by Amlan et al. [kar2019meta]
, who suggest that GAN evaluation metrics (e.g., KID [binkowski2018demystifying]) are potentially useful to measure content differences. On the other hand, the two methods [kar2019meta, ruiz2019learning]
use gradient-based methods for attribute optimization, such as reinforcement learning and finite difference. The gradient descent based methods contain many random variables and are sensitive to learning rates. To overcome the difficulty in training, we propose attribute descent which does not involve random variables and has easy-to-configured step sizes. Our optimization method is easy to train, and the objective function converges stably.
Learning from synthetic data. Due to low data acquisition costs, learning from synthetic data is an attractive way to increase training set scale. Many applications exist in areas such as semantic segmentation [hoffman2018cycada, gaidon2016virtual], navigation [kolve2017ai2], object re-identification [sun2019dissecting, tang2019pamtri], etc. Usually, prior knowledge is utilized during data synthesis since we will inevitably need to determine the distribution of attributes in our defined environment. Richter et al. suggest that attribute randomness in a reasonable range is beneficial [tremblay2018training]. Even if it is random, we need to specify the range of random variables in advance. Our work investigates and learns these attribute distributions for vehicle re-ID tasks.
Automatic data augmentation. This paper is also related to automatic data augmentation [cubuk2019autoaugment, geng2018learning], which learns parameters for data augmentation techniques like random crop or image flipping. They share a similar objective with us in generating extra data for model training. The key difference is that they are not specifically designed for the problem of domain gaps, let alone on the content level.
3 VehicleX Generation Engine
We introduce a large-scale synthetic dataset generator named VehicleX that includes two components: (1) vehicles created using the graphics engine Unity and (2) a Python API that interacts with the Unity 3D engine.
VehicleX has a diverse range of vehicle models and identities
, allowing it to be able to adapt to the variance of real-world datasets. It has 272 backbones that are hand-crafted by artists. The backbones include various vehicle types including sedan, minivan, police car, ambulance, fire truck, tank truck,etc. Each backbone represents a real-world model. From these backbones, we obtain 1,209 identities by adding various colors or accessories. A comparison of VehicleX with some existing vehicle re-ID datasets is presented in Table 1. VehicleX is three times larger than the synthetic PAMTRI dataset [tang2019pamtri] and can potentially render an unlimited number of images and cameras.
In this work, we define a training mode and a testing mode of VehicleX. The training mode has a black background and is used for attribute descent learning (see Section 4); in comparison, the testing mode uses random images (e.g., from CityFlow [tang2019cityflow]) as backgrounds, and generates attribute-edited images. In addition, to increase randomness, the testing mode contains street objects such as lamp posts, billboards and trash cans. Figure 2 shows the simulation platform, and some sample vehicle identities.
Comparison of some real-world and synthetic vehicle re-ID datasets. VehicleX will be released open source and can be used to generate (possess) an unlimited number of images (cameras).
We build the Unity-Python interface using the Unity ML-Agents plugin [juliani2018unity]. It allows Python to modify the attributes of the environment and vehicles, and obtain the rendered images. With this API, users can easily obtain rendered images by editing attributes without needing expert knowledge about Unity. The code of this API will be released together with VehicleX.
4 Proposed Method
4.1 Attribute Distribution Modeling
Important attributes. For vehicle re-ID, we consider the following attributes to be potentially influential on the training set simulation and testing accuracy. Figure 3 shows examples of the attribute editing process.
Vehicle orientation is the horizontal viewpoint of a vehicle and takes the value between 0°and 359°. In the real world, this attribute is important because the camera position is usually fixed and vehicles usually move along predefined trajectories. Therefore, the vehicle orientations captured by a certain camera can be highly similar, and we would like to learn this attribute using VehicleX engine.
Light direction is a competent modeling of the daylight as cars generally present in outdoor scenes. Here, we assume directional parallel light, and the light direction is modeled from east (0°) to west (180°), which is the moving trajectory of the sun. For simplicity, we set the light’s color as light yellow.
Light intensity is usually considered a critical factor for re-ID tasks. We manually defined a reasonable range for intensity from bright to light.
Camera height describes the vertical distance from the ground, and significantly influences viewpoints.
Camera distance determines the horizontal distance from vehicles. This factor has a strong effect on the vehicle resolution since the resolution of the entire image is predefined as 19201080. Additionally, the distance has slight impacts on viewpoints.
We model the aforementioned attributes with Multivariate Gaussian or Gaussian Mixture Model (GMM). This modeling strategy is also used in Natanielet al.’s work [ruiz2019learning]. We denote the attribute list as: , where is the number of attributes considered in the system, and is the random variable representing the th attribute.
For the vehicle orientation, we use a GMM with six components. This is based on our prior knowledge that the field-of-view of a camera covers either a road or an intersection. If we do not consider vehicle turning, there are rarely more than four major directions at a crossroad. For lighting conditions and camera coordinates, we use four independent Gaussian distributions, each modeling light direction, light intensity, camera height, and camera distance, respectively. Therefore, givenattributes, we optimize mean values of the Gaussians, where .
We speculate that the means of the Gaussian distributions or components are more important than the standard deviations, because means reflect how the majority of the vehicles look. Although our method has the ability to handle variances, this would significantly increase the search space. As such, we predefine the values of standard deviations using experience and only optimize the means of all the Gaussians, where is the mean of the th Gaussian. As a result, given the means of the Gaussians, we can sample an attribute list as , where G is a function that generates a set of attributes given means of Gaussian.
The objective of our optimization step is to generate a dataset that has a similar content distribution with respect to a target real dataset.
Measuring distribution difference. We use the Fréchet Inception Distance (FID) [heusel2017gans] to quantitatively measure the distribution difference between two datasets. The FID score was originally introduced to measure the performance of GANs by analyzing the difference between generated and real images. Adversarial loss is not used as the measurement since there exists a huge appearance difference between synthetic and real data, and the discriminator would easily tell the difference between real and fake.
Formally, we denote the sets of simulated data and real data as and respectively, where , and is our rendering function through the 3D graphics engine working on a given attribute list that controls the environment. is the number of images in the simulated dataset. For FID calculation, we employ the Inception-V3 network [szegedy2016rethinking] to map an image into its feature space. We view the feature as a multivariate real-valued random variable and assume that it follows a Gaussian distribution. To measure the distribution difference between two Gaussians, we resort to their means and covariance matrices. Under FID, the distribution difference between simulated data and real data is written as,
where and denote the mean and covariance matrix of the feature distribution of the simulated data, and and are from the real data.
An important difficulty for attribute optimization is that the rendering function (through the 3D engine) is not differentiable, so the widely used gradient-descent based methods cannot be readily used. Under this situation, there exist several methods for gradient estimation, such as finite-difference[kar2019meta] and reinforcement learning [ruiz2019learning]. These methods are developed in scenarios where there are many attributes/parameters to optimize. In comparison, our system only contains a few parameters, allowing us to design a more stable and efficient approach that is sufficiently effective in finding a global minimum.
We are motivated by coordinate descent, an optimization algorithm that can work in derivative-free contexts [wright2015coordinate]. The most commonly known algorithm that uses coordinate descent is -means [lloyd1982least]. Coordinate descent successively minimizes along coordinate directions to find a minimum of a function. The algorithm selects a coordinate to perform the search at each iteration.
Using Eq. 1 as the objective function, we propose attribute descent to optimize each single attribute in the attribute list. Specifically, we view each attribute as a coordinate in the coordinate descent algorithm. In each iteration, we successively change the value of an attribute to search for the minimum value of the objective function.
Formally, for our defined parameters for attributes list , the objective is to find,
We achieve this objective iteratively. Initially, we have,
At iteration , we optimize a single variable in ,
where the define a specific search space for mean variable . For example, the search space for vehicle orientation is from to by degree increments; the search space for camera height is the equally divided editable range with 9 segments.
are the training epochs. We find that training 1-2 epochs usually suffices. In this algorithm, we perform greedy search for the optimized value of an attribute in each iteration, and achieve local minimum for each attribute when fixing the rest. The algorithm visualization is shown in Figure4.
Discussions. Our method can be seen as an analogy to -means. In -means, the search for the minimum of the objective function is done by iterating between the E-step and M-step. The E-step or M-step can be viewed as searching along a coordinate. In our method, we view each Gaussian mean (associated with attributes) as a coordinate and search for the minimum along it.
Existing gradient based methods [kar2019meta, ruiz2019learning] involve many intermediate random variables to be optimized and are relatively sensitive to learning rates. Moreover, FID measures the distribution difference of two datasets, so the non-differentiable supervision signal provided by FID is far less precise than cross-entropy loss. These difficulties potentially over-complicates training in a simple environment like ours. Our method does not contain any random variables, and the step size for optimizing each attribute is stable. Therefore, attribute descent converges stably and much easier to train. But we speculate that when simulating more complex environments where the search space is much larger, reinforcement learning or finite-difference could also be effective.
5.1 Datasets and Evaluation Protocol
Datasets. We use three real-world datasets for evaluation. The VeRi-776 dataset [liu2016large] contains 49,357 images of 776 vehicles captured by 20 cameras. The vehicle viewpoints and illumination cover a diverse range. The training set has 37,778 images, corresponding to 576 identities; the test set has 11,579 images of 200 identities. There are 1,678 query images. The train/test sets share the same 20 cameras. VehicleID [liu2016deep] is larger scale, containing 221,567 images of 26,328 identities. Half of the identities are used for training, and the other half for testing. Officially there 6 test splits for the gallery. When performing the re-ID task, one random image from each identity will be used to form the gallery and the rest will become the query. Such a procedure will be repeated ten times and numbers reported are averaged across these ten times. Here, we use VehicleID (large) which has 2,400 image as the gallery. CityFlow [tang2019cityflow] has more complex environments, and it has 40 cameras in a diverse environment where 34 are used for the training set. This dataset has in total 666 IDs where half are used for training and the rest for testing. In order to perform attribute learning on the CityFlow test set, we manually labeled the camera information on the CityFlow test set.
Evaluation metrics. We use the mean average precision (mAP) and rank-1 accuracy to measure the vehicle re-ID performance. The average precision records the area under the precision-recall curve, and the rank-1 accuracy denotes the success rate of the top-1 match.
5.2 Implementation Details
Data generation. Vehicle re-ID datasets like VeRi-776 and CityFlow are naturally divided according to camera views. Because a specific camera view usually has stable attribute features (e.g., viewpoint), we perform the proposed attribute descent algorithm on each individual camera, so as to simulate images with similar content to images from each camera. For CityFlow, because the training set and test set contain entirely different cameras, we do the optimization on the test set. Note here that we do not use any ID labels from the test set, just the images themselves. For example, we optimize 20 attribute lists for the VeRi-776 test set, which has 20 cameras. An exception is for the VehicleID dataset. Its training set and test set have similar front and rear views. As such, we only optimize a single attribute list.
Moreover, during data simulation, we follow the same ratio of image number among different cameras in the target datasets. To explain, in the VeRi test set, the ratio of image number between camera 4 and camera 8 is approximately . So we also simulate a similar number of images with attributes learned from cameras 4 and 8. For CityFlow, we manually label the cameras in the test set because they are not provided. Note that camera labels can be conveniently obtained in practice.
Image style transformation. We apply the SPGAN [deng2018image] for image style transformation, which is a state-of-the-art algorithm in unsupervised domain adaptation for person re-ID. Results are shown in Figure 5. In our implementation, each image is sized to 256 256. We pretrained all image translation models for three datasets using random attribute, with 112,042 images as source domain and the training set in three vehicle datasets as target domain separately. For performing learned attributes domain adaption, we directly inference the learned attributes images, based on the fact that our learned attributes are a subset of the random range.
|IDE (CE loss)||R||77.35||90.28||83.10||75.24||87.45||80.73||72.78||85.56||78.51|
|IDE (CE loss)||R||92.19||96.89||70.39|
Baseline configuration. For VeRi and VehicleID, we use ID-discriminative embedding (IDE) [zheng2016mars]. We also adopt the part-based convolution baseline (PCB) [sun2018beyond] for VeRi for improved accuracy. In PCB, we horizontally divide the picture into six equal parts and perform classification on each part. For training for VehicleID, we adopt the strategy from [luo2019bag]luo2019bag] using a combination of the cross-entropy loss and the triplet loss.
Two-stage training is applied for CityFlow [zheng2019vehiclenet]
. In the first stage, we train on both real and simulated data. We classify a vehicle image into one of the 1,542 (333 from real + 1,209 from simulated data) identities. In the second stage, we replace the classification layer with a new classifier that will be trained on the real dataset (e.g., recognizing 333 classes). From Table 2, two stage training leads to a significant performance boost on CityFlow. We think this is caused by the huge data variance in CityFlow. We have tested this method on VeRi and VehicleID but does not observe noticeable improvement.
5.3 Simulated Data for Data Augmentation
An important hyperparameter under the data augmentation setting is the ratio between the numbers of simulated data and real data. We vary this ratio fromand report re-ID accuracy on VeRi in Figure 6. First, we find that under an appropriate ratio (e.g., between 0.25 and 1), the data augmentation strategy is consistently higher than training with real data only. This supports our idea of simulating images to augment the real data. Second, we observe that setting the ratio to 0.75 achieves the best result on VeRi. In our experiment, we empirically use 0.65 for CityFlow and 0.4 for VehicleID.
Simulated data with randomly distributed attributes does not work consistently well. Several existing works indicate that random attributes are beneficial for data simulation in real-world tasks [tremblay2018training, tang2019pamtri]. We evaluated on the effectiveness of this strategy in Table 4. On VeRi-776, the use of random attributes slightly improves over the IDE and PCB baselines. We obtain mAP improvements of +0.21% and +0.67% under the IDE baseline and PCB baseline, respectively. For CityFlow results in Table 5, we see a significant mAP improvement of +1.82% under the IDE baseline.
Interestingly, random attributes barely improve the performance on VehicleID. For example, when the medium test set is used, random attributes lead to a drop of 0.26% in mAP. In fact, while VeRi-776 and CityFlow have diverse attribute distributions (e.g., nearly all the orientations are included), VehicleID mostly contains rear and front orientations. As such, simulating a training set with random orientations may decrease the domain gap between VeRi-776 and simulated data, but will increase the domain gap for VehicleID. This observation suggests that random attributes are effective when attributes of the target dataset tend to also have a random distribution. In other words, for target dataset like VehicleID where some important attributes take on some particular distribution, simulating a dataset with random attributes is most likely not a good idea.
The effectiveness of simulated data with learned attributes (attribute descent). We evaluate the effectiveness of our key contribution, attribute descent based data simulation. Results displayed in Table 3, Table 4 and Table 5 show that our learned attributes method achieves consistent improvement on three datasets, compared with both baselines and random generation methods. We achieve +1.66%, +1.61% and +2.60% compared with random attributes on VehicleID(large), VeRi-776 and CityFlow respectively. In the supplementary material, We further strengthen this observation using statistical significance analysis.
Style domain adaptation is indispensable. Our ablation study on style domain adaption is shown in Figure 7. There is a significant improvement on and joint training. The effect of using synthetic data only is shown in Table 6 and the style transfer achieve +7.21% mAP improvement. Due to the consistent and significant improvement brought by this approach, we always use it in our re-ID system.
Comparison with the state-of-the-art. We report the comparison with some state-of-art approaches in Table 3, Table 4 and Table 5. Our reported accuracy is competitive when compared with these methods. Note that our accuracy improvement mainly comes from the addition of synthetic data (with necessary style DA), without modifying the re-ID learning architectures [zhou2018aware], using more annotations [chu2019vehicle], applying pose processing steps [huang2019multi], or leverage spatial-temporal cues [wang2017orientation]. We speculate that these techniques are orthogonal to ours.
5.4 Training with Simulated Data Only
Training with simulated data only. In the experiment above, our simulated data is mixed with the real data. Here, we evaluate re-ID accuracy when only simulated data is used for training. We report the result on the VeRi-776 in Table 6 and Table 7. Three observations are made.
First, if we compare the numbers in Table 4 and Table 6, it is obvious that training with only the simulated data has much lower accuracy. For example, under random attributes, data augmentation is higher than the simulation-only scheme by a significant margin of +44.97% in rank-1 accuracy. As such, the inclusion of real data is essential for achieving competitive accuracy. However, as graphic engines become stronger, it could be possible for simulation data alone to contribute to real-world systems.
Second, the benefit of attribute learning is more significant than joint training. Because the training set consists of only the simulated data, the higher quality of attributes will have a more direct impact on the re-ID results. Under the settings of Section 5.3 and Section 5.4, the improvements of learned attributes over random attributes are +1.49% and +3.87% in rank-1 accuracy, respectively.
Last, from Table 7, when the number of simulated data increases, we can observe accuracy improvement, too. This suggests the potential of simulated data for future research. Since training with only the simulated is not our major focus, we do not explore further along this direction here.
|- Style DA||S||30.39||44.82||12.42|
This paper addresses the domain gap problem from the content level. That is, we automatically edit the source domain image content in a graphic engine so as to reduce the content gap between the simulated images and the real images. We use this idea to study the vehicle re-ID task, where the usage of vehicle bounding boxes decreases the set of attributes to be optimized. Fewer attributes-of-interest allow us to optimize them one by one using our proposed attribute descent approach. We show that the learned attributes bring about improvement in re-ID accuracy with statistical significance. Moreover, our experiment reveals some important insights regarding the usage of simulated data, e.g., style domain adaptation is beneficial.
We will continue investigating data simulation and its role in computer vision. Potential directions include more effective objective functions, alternative derivative-free optimization methods, and new recognition training schemes.
In this appendix, we first detail the attribute descent algorithm. Then, we demonstrate the effectiveness of successively optimizing each attribute and using Fréchet Inception Distance (FID) as the measure of distribution difference. At last, we provide the statistical significance of the results on three datasets.
7.1 Implementation Details
In this section, we provide a detailed algorithm and the hyperparameter setting for the proposed attribute descent method. First, we present a step-by-step instruction for the proposed attribute descent method in Algorithm 1, where we successively optimize each attribute by minimizing the FID value. Second, we set the hyperparameters as follows. The synthetic dataset size is set to . That is, we generate 400 images each iteration. For attribute descent iteration, we use for the VehicleID [liu2016deep] dataset and for the VeRi [liu2016large] and the CityFlow [tang2019cityflow] datasets. For FID calculation only, all images are resized into 64 64 since it is nearly the minimal size of the real-world dataset. FID is calculated by the 2048-dimensional feature vector from the Inception-V3 [szegedy2016rethinking] network. The initial parameter vector of attribute generation function is set as the minimal value in the search space (see Figure 4 of original paper).
7.2 Effectiveness of Attributes Descent
In this section, we prove the effectiveness of the proposed attribute descent method. As shown in Algorithm 1, we successively optimize each attribute by minimizing the FID value. Along the process of optimizing each attribute, the FID value decreases (lower is better) and the re-ID accuracy (mAP) increases (Figure 8). Based on this, we have some observations.
FID can measure the distribution difference, and thus can be used as an objective function to optimize the attributes of synthetic data;
The inclusion of synthetic data can improve the re-ID accuracy, even with initialize attributes parameter vector for data generation;
Our attribute descent algorithm can successively find a more suitable value for each attribute. Synthetic data generated from these attributes is more beneficial to re-ID feature training, resulting in higher mAP.
7.3 Statistical Significance Analysis
Learned attributes bring statistical significant improvements on re-ID accuracy (mAP) (Figure 9). Specifically, improvements on the CityFlow, VehicleID large and VehicleID small datasets are statistically very significant (-value ).