Echo state neural networks library.
Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.READ FULL TEXT VIEW PDF
Simulation is increasingly being used for generating large labelled data...
We present a novel learning-based method to build a differentiable
Realistic simulators are critical for training and verifying robotics
Is a deep learning model capable of understanding systems governed by ce...
Currently available quadrotor simulators have a rigid and highly-special...
Simulation models are valuable tools for resource usage estimation and
We present ongoing work on a tool that consists of two parts: (i) A raw
Echo state neural networks library.
In order to train deep neural networks, significant effort has been directed towards collecting large-scale datasets for tasks such as machine translation(Luong et al., 2015), image recognition (Deng et al., 2009) or semantic segmentation (Geiger et al., 2013; Cordts et al., 2016). It is, thus, natural for recent works to explore simulation as a cheaper alternative to human annotation (Gaidon et al., 2016; Ros et al., 2016; Richter et al., 2016). Besides, simulation is sometimes the most viable way to acquire data for rare events such as traffic accidents. However, while simulation makes data collection and annotation easier, it is still an open question what distribution should be used to synthesize data. Consequently, prior approaches have used human knowledge to shape the generating distribution of the simulator (Sakaridis et al., 2018), or synthesized large-scale data with random parameters (Richter et al., 2016). In contrast, this paper proposes to automatically determine simulation parameters such that the performance of a model trained on synthesized data is maximized.
Traditional approaches seek simulation parameters that try to model a distribution that resembles real data as closely as possible, or generate enough volume to be sufficiently representative. By learning the best set of simulation parameters to train a model, we depart from the above in three crucial ways. First, the need for laborious human expertise to create a diverse training dataset is eliminated. Second, learning to simulate may allow generating a smaller training dataset that achieves similar or better performances than random or human-synthesized datasets (Richter et al., 2016), thereby saving training resources. Third, it allows questioning whether mimicking real data is indeed the best use of simulation, since a different distribution might be optimal for maximizing a test-time metric (for example, in the case of events with a heavy-tailed distribution).
More formally, a typical machine learning setup aims to learn a function that is parameterized by and maps from domain to range , given training samples . Data usually arises from a real world process (for instance, someone takes a picture with a camera) and labels are often annotated by humans (someone describing the content of that picture). The distribution is assumed unknown and only an empirical sample is available.
The simulator attempts to model a distribution . In prior works, the aim is to adjust the form of and parameters to mimic as closely as possible. In this work, we attempt to automatically learn the parameters of the simulator such that the loss of a machine learning model is minimized over some validation data set . This objective can be formulated as the bi-level optimization problem
where is parameterized by model parameters , describes a data set generated by the simulator and denotes the implicit dependence of the model parameters on the model’s training data and consequently, for synthetic data, the simulation parameters . In contrast to Fan et al. (2018), who propose a similar setup, we focus on the actual data generation process and are not limited to selecting subsets of existing data. In our formulation, the upper-level problem (equation 1a) can be seen as a meta-learner that learns how to generate data (by adjusting ) while the lower-level problem (equation 1b) is the main task model (MTM) that learns to solve the actual task at hand. In Section 2, we describe an approximate algorithm based on policy gradients (Williams, 1992) to optimize the objective 1. Since we must work with a black-box simulator, an interface is required for the algorithm to interact with it, which is also described in Section 2.
In various experiments on both toy data and real computer vision problems, Section 4 analyzes different variants of our approach and investigates interesting questions, such as: “Can we train a model with less but targeted high-quality data?”, or “Are simulation parameters that approximate real data the optimal choice for training models?”. The experiments indicate that our approach is able to quickly identify good scene parameters that compete and in some cases even outperform the actual validation set parameters for synthetic as well as real data, on computer vision problems such as object counting or semantic segmentation.
Given a simulator that samples data as , our goal is to adjust such that the MTM trained on that simulated data minimizes the risk on real data . Assume we are given a validation set from real data and we can sample synthetic datasets . Then, we can can train on by minimizing equation 1b. Note the explicit dependence of the trained model parameters on the underlying data generating parameters in equation 1b. To find , we minimize the empirical risk over the held-out validation set , as defined in equation 1a. Our desired overall objective function can thus be formulated as the bi-level optimization problem in equation 1.
Attempting to solve it with a gradient-based approach poses multiple constraints on the lower-level problem 1b like smoothness, twice differentiability and an invertible Hessian (Bracken & McGill, 1973; Colson et al., 2007). For our case, even if we choose the model to fulfill these constraints, the objective would still be non-differentiable as we (i) sample from a distribution that is parameterized by the optimization variable and (ii) the underlying data generation process (e.g., an image rendering engine) is assumed non-differentiable for the sake of generality of our approach. In order to cope with the above defined objective, we resort to a reinforcement learning approach and use policy gradients (Williams, 1992) to optimize .
Our goal is to generate a synthetic dataset such that the main task model (MTM) , when trained on this dataset until convergence, achieves maximum accuracy on the test set. The test set is evidently not available during train time. Thus, the task of our algorithm is to maximize MTM’s performance on the validation set by generating suitable data. Similar to reinforcement learning, we define a policy parameterized by that can sample parameters for the simulator. The simulator can be seen as a generative model which generates a set of data samples conditioned on . We provide more details on the interface between the policy and the data generating function in the following section and give a concrete example for computer vision applications in Section 4. The policy receives a reward that we define based on the accuracy of the trained MTM on the validation set. Figure 1 provides a high-level overview.
Specifically, we want to maximize the objective
with respect to . The reward is computed as the negative loss or some other accuracy metric on the validation set. Following the REINFORCE rule (Williams, 1992) we obtain gradients for updating as
An unbiased, empirical estimate of the above quantity is
where is the advantage estimate and is a baseline (Williams, 1992) that we choose to be an exponential moving average over previous rewards. In this empirical estimate, is the number of different datasets sampled in one policy optimizing batch and designates the reward obtained by the -th MTM trained until convergence.
Given the basic update rule for the policy
, we can design different variants of our algorithm for learning to simulate data by introducing three control knobs. First, we define the number of training epochsof the MTM in each policy iteration as a variable. The intuition is that a reasonable reward signal may be obtained even if MTM is not trained until full convergence, thus reducing computation time significantly. Second, we define the size of the data set generated in each policy iteration. Third, we either choose to retain the MTM parameters from the previous iteration and fine-tune on the newly created data or we estimate from scratch (with a random initialization). This obviously is a trade-off because by retaining parameters the model has seen more training data in total but, at the same time, may be influenced by suboptimal data in early iterations. We explore the impact of these three knobs in our experiments. Algorithm 1 summarizes our approach.
We defined a general black-box simulator as a distribution over data samples parameterized by . In practice, a simulator is typically composed of a deterministic “rendering” process and a sampling step as , where the actual data description (e.g., what objects are rendered in an image) is sampled from the provided simulation parameters and specific rendering settings (e.g., lighting conditions) are sampled from some prior. To enable efficient sampling (via ancestral sampling) (Bishop, 2006)
, the data description distribution is often modeled as a Bayesian network (directed acyclic graph) wheredefines the parameters of the distributions in each node, but more complex models are possible too.
The interface to the simulator is thus
which describes parameters of the internal probability distributions of the black-box simulator. Note that
can be modeled as an unconstrained continuous vector and still describe various probability distributions. For instance, a continuous Gaussian is modeled by its mean and variance. A K-dimensional discrete distribution is modeled with K real values. We assume the black-box normalizes the values to a proper distribution via a softmax.
With this convention all input parameters to the simulator are unconstrained continuous variables. We thus model our policy as the multivariate Gaussian with as many dimensions as simulation parameters . For simplicity, we only optimize for the mean and set the variance to in all cases, although the policy gradients defined above can handle both. Note that our policy can be extended to a more complicated form, e.g., by including the variance or by taking various contextual data as input and implementing the policy as a sequential model similar to (Zoph & Le, 2016).
The proposed approach can be seen as a meta-learner that alters the data a machine learning model is trained on to achieve high accuracy on a validation set. This concept is similar to recent papers that learn policies for neural network architectures (Zoph & Le, 2016) and optimizers (Bello et al., 2017). In contrast to these works, we are focusing on the data generation parameters and actually create new, randomly sampled data in each iteration. While (Fan et al., 2018) proposes a subset selection approach for altering the training data, we are actually creating new data. This difference is important because we are not limited to a fixed probability distribution at data acquisition time. We can thus generate or oversample unusual situations that would otherwise not be part of the training data.
Similar to the above-mentioned papers, we also choose a variant of stochastic gradients (policy gradients (Williams, 1992)) to overcome the non-differentiable sampling and rendering and estimate the parameters of the policy
. While alternatives for black-box optimization exist, like evolutionary algorithms(Salimans et al., 2017) or sampling-based methods (Bishop, 2006), we favor policy gradients in this work for their sample efficiency and success in prior art. Another direction to explore is to approximate the data-generation/rendering process in a differentiable way (Loper & Black, 2014), which may enable better training. However, for applications like computer vision, differentiable rendering (Loper & Black, 2014) cannot yet compete with more mature rendering engines in terms of data fidelity and expressive power.
When relying on simulated data for training machine learning models, the issue of “domain gap” between real and synthetic data arises. Many recent works (Ganin et al., 2016; Chen et al., 2017; Tsai et al., 2018) focus on bridging this domain gap, particularly for computer vision tasks. We believe the contributions of our work are orthogonal.
The intent of our experimental evaluation is (i) to illustrate the concept of our approach in a controlled toy experiment (section 4.1), (ii) to analyze different properties of the proposed algorithm 1 on a high-level computer vision task (section 4.3) and (iii) to demonstrate our ideas on real data for semantic image segmentation (section 4.5).
To illustrate the concept of our proposed ideas we define a binary classification task on the 2-dimensional Euclidean space, where data distribution
of the two classes is represented by Gaussian mixture models (GMM) with 3 components, respectively. We generate validation and test sets from. Another GMM distribution reflects the simulator that generates training data for the main task model (MTM) , which is a non-linear SVM with RBF-kernels in this case. To demonstrate the practical scenario where a simulator is only an approximation to the real data, we fix the number of components per GMM in to be only 2 and let the policy only adjust mean and variances of the GMMs. Again, the policy adjusts such that the accuracy (i.e., reward ) of the SVM is maximized on the validation set.
The top row of figure 2 illustrates how the policy gradually adjusts the data generating distribution such that reward is increased. The learned decision boundaries in the last iteration (right) well separate the test data. The bottom row of figure 2 shows the SVM decision boundary when trained with data sampled from (left) and with the converged parameters from the policy (middle). The third figure in the bottom row of figure 2 shows samples from . The sampled data from the simulator is clearly different than the test data, which is obvious given that the simulator’s GMM has less components per class. However, it is important to note that the decision boundaries are still learned well for the task at hand.
For the following experiments we use computer vision applications and thus require a generative scene model and an image rendering engine. We focus on traffic scenes as simulators/games for this scenario are publicly available (CARLA (Dosovitskiy et al., 2017) with Unreal engine (Epic-Games, 2018) as backend). However, we need to note that substantial extensions were required to actually generate different scenes according to a scene model rather than just different viewpoints of a static map. Many alternative simulators like (Mueller et al., 2017; Richter et al., 2016; Shah et al., 2018) are similar where an agent can navigate a few pre-defined maps, but the scene itself is not parameterized and can be changed on the fly.
To actually synthesize novel scenes, we first need a model that allows to sample instances of scenes given parameters of the probability distributions of the scene model. Recall that is produced by our learned policy .
Our traffic scene model handles different types of intersections, various car models driving on the road, road layouts, buildings on the side or weather conditions. Please see the appendix for more details. In our experiments, the model is free to adjust some of the variables, e.g., the probability of cars being on the road, while we keep others at a fixed distribution, e.g., the weather conditions. These fixed distributions are considered the rendering parameters
and modeled with a uniform distribution as. Given these two distributions, we can sample a new scene and render it as . Figure 3 shows examples of rendered scenes.
As a first high-level computer vision task we choose counting cars in rendered images, where the goal is to train a convolutional neural network
to count all instances individually for different types of cars in an image. The evaluation metric and (also the loss) is thedistance between predicted and ground truth count, averaged over the different car types. The reward is the negative loss. For this experiment, we generate validation and test sets with a fixed and pre-defined distribution .
We first evaluate our proposed policy (dubbed “lts” in the figures) for two different initializations, a standard random one and a deliberately adversarial one (compared to validation and test sets), and we also compared with a model trained on a data set sampled with , i.e., the test set parameters. Figure 4 explains our results. We can see in figure 4 that our policy (“lts”) quickly reaches high reward on the validation set, equal to the reward we get when training models with . Figure 4 shows that high rewards are also obtained on the test set for both initializations, albeit convergence is slower for the adversarial initialization. The reward after convergence is comparable to the model trained on . Since our policy is inherently stochastic, we show in figure 4 convergence for different random initializations and observe a very stable behavior.
Next, we explore the difference between training the MTM from scratch in each policy iteration or retaining its parameters and fine-tune, see algorithm 1. We call the second option the “accumulated main task model (AMTM)” because it is not re-initialized and accumulates information over policy iterations. The intention of this experiment is to analyze the situation where the simulator is used for generating large quantities of data, like in (Richter et al., 2016). First, by comparing figures 4 and 5, we observe that the reward gets significantly higher than when training MTM from scratch in each policy iteration. Note that we still use the MTM reward as our training signal, we only observe the AMTM reward for evaluation purposes.
For the case of accumulating the MTM parameters, we further compare with two baselines. First, replicating a hand-crafted choice of simulation parameters, we assume no domain expertise and randomly sample simulator parameters (within a sensible range) in each iteration (“random policy params”). Second, we take the parameters given by our learned policy after convergence (“final policy params”) as a stronger baseline. For reference, we train another AMTM with the ground truth validation set parameters (“validation params”). All baselines are accumulated main task models, but with fixed parameters for sampling data, i.e., resembling the case of generating large datasets. We can see from figure 5 that our approach gets very close to the ground truth validation set parameters and significantly outperforms the random parameter baseline. Interestingly, “lts” even outperforms the “final policy params” baseline, which we attribute to increased variety of the data distribution. Again, “lts” converges to a high reward even with an adversarial initialization, see figure 5
Next, we explore the parameter of our algorithm that controls the dataset size generated in each policy iteration. For example, when , we generate at each policy step a dataset of 100 images using the parameters from the policy which are then used to train our main task model . We evaluate policies with sizes 20, 50, 100 and 200. For a fair comparison, we generate 40,000 images with our final learned set of parameters and train for 5 epochs and evaluate on the test set. We observed that for this task a dataset size of just 20 suffices for our model to converge to good scene parameters , which is highly beneficial for the wall-time convergence speed. Having less data per policy iteration means faster training of the MTM .
Similar to the previous experiment, we now analyze the impact of the number of epochs used to train the main task model . The setup is similar to the previous experiment: We generate 40,000 images with our final learned set of parameters and train for a fixed number of epochs and evaluate on the test set. Figure 5 shows the reward of the accumulated MTM for four different values of (1, 3, 7, 10). Our conclusion, for the car-counting task, is that learning to simulate is robust to lower training epochs, which means that even if the MTM has not fully converged yet the reward signal is good enough to provide guidance for our system, again leading to a potential wall-time speed up of the overall algorithm. All four cases converge, including the one where we train the MTM for only one epoch. Note that this is dependent on the task at hand, and a more complicated task might necessitate convergence of the main task model to provide discriminative rewards.
For the next set of experiments we use semantic segmentation as our test bed, which aims at predicting a semantic category for each pixel in a given RGB image (Chen et al., 2018). Our modified CARLA simulator (Dosovitskiy et al., 2017) provides ground truth semantic segmentation maps for generated traffic scenes, including categories like road, sidewalk or cars. For the sake of these experiments, we focus on the segmentation accuracy of cars, measured as intersection-over-union (IoU), and allow our policy to adjust parameters related to cars to maximize reward (i.e., car IoU). This includes the probability of generating different types of cars, but also the length of the road because the more road available, the more cars can be placed in the scene. The main task model is a CNN that takes a rendered RGB image as input and predicts a per-pixel classification output.
We first generate validation set parameters that reflect traffic scenes moderately crowded with cars, random intersections and buildings on the side. As a reference point for our proposed algorithm, we sample a few data sets with the validation set parameters , train MTMs and report the maximum reward (IoU of cars) achieved. We compare this with our learned policy and can observe in figure 6 that it actually outperforms the validation set parameters. This is an interesting observation because it shows that the validation set parameters may not always be the optimal choice for training a segmentation model. Our observation also reflects recent strategies, particularly for semantic segmentation, to adjust the label distribution of the training set via weighting schemes to counteract unbalanced or heavy-tail data distributions (Rota Bulo et al., 2017).
We finally demonstrate the practical impact of our learning-to-simulate approach on semantic segmentation on KITTI (Geiger et al., 2013) by training a main task model (MTM) with a reward signal coming from real data. Using simulated data for semantic segmentation was recently investigated from a domain adaptation perspective (Tsai et al., 2018; Richter et al., 2016), where an abundant set of simulated data is leveraged to train models applicable on real data. Here, we investigate targeted generation of simulated data and its impact on real data. Since the semantic label space of KITTI and our CARLA simulator are not identical, we again focus on segmentation of cars by measuring IoU for that category. The form of the main task model is the same as in the previous experiment.
Note that for real data, we actually do not have access to the underlying data generating parameters . Our baselines in this experiments use training data generated by the simulator with random parameters sampled within a range to generate sensible scenes with a moderate amount of cars. However, it is important to mention that is unclear and hard to guess what parameters are actually good for training an MTM , making our automatic approach attractive in such situations. Figure 6 shows the rewards obtained by all methods on the validation set, where we observe that learning-to-simulate outperforms mean and max of the baseline models trained on samples from the random parameters. The results on the real KITTI test set in table 1 confirm the superior results of learning-to-simulate. We use a set of 10 sampled datasets for random and learned parameters and display the mean value of the Car IoU metric across these sampled datasets.
|Generative Parameters||Car IoU|
Learning to simulate can be seen as a meta-learning algorithm that adjusts parameters of a simulator to generate synthetic data such that a machine learning model trained on this data achieves high accuracies on validation and test sets, respectively. Given the need for large-scale data sets to feed deep learning models and the often high cost of annotation and acquisition, we believe our approach is a sensible avenue for practical applications to leverage synthetic data. Our experiments illustrate the concept and demonstrate the capability of learning to simulate on both synthetic and real data.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
Effective approaches to attention-based neural machine translation.In
Empirical Methods in Natural Language Processing (EMNLP), pp. 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL http://aclweb.org/anthology/D15-1166.
Loss max-pooling for semantic image segmentation.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Our model comprises the following elements:
A straight road of variable length.
Either an L, T or X intersection at the end of the road.
Cars of 5 different types which are spawned randomly on the straight road.
Houses of a unique type which are spawned randomly on the sides of the road.
Four different types of weather.
All of these elements are tied to parameters: can be decomposed into parameters which regulate each of these objects. The scene is generated ”block” by ”block”. A block consists of a unitary length of road with sidewalks. Buildings can be generated on both sides of the road and cars can be generated on the road. designates the probability of car presence in any road block. Cars are sampled block by block from a Bernouilli distribution . To determine which type of car is spawned (from our selection of 5 cars) we sample from a Categorical distribution which is determined by 5 parameters where is an integer representing the identity of the car and . designates the probability of house presence in any road block. Houses are sampled block by block from a Bernouilli distribution .
Length to intersection is sampled from a Categorical distribution determined by 10 parameters with where denotes the length from the camera to the intersection in ”block” units. Weather is sampled randomly from a Categorical distribution determined by 4 parameters where is an integer representing the identity of the weather and . L, T and X intersections are sampled randomly with equal probability.
We explore the parameter of our algorithm that controls the dataset size generated in each policy iteration. In figure 7 we show a comparative graph of final errors on the validation and test sets for different values of . For a fair comparison, we generate 40,000 images with our final learned set of parameters and train for 5 epochs. All values of used resulted in very similar errors in the validation and test set, we conclude that learning to simulate is robust to the when used on the car counting task.
Since our method is stochastic in nature we verify that ”learning to simulate” converges in the car counting task using different random seeds. We observer in figure 8 that the reward converges to the same value with three different random seeds. Additionally, in figure 8, we observe that the accumulated main task network test reward also converges with different random seeds.