1 Introduction
Datadriven optimization problems arise in a range of domains: from protein design (Brookes et al., 2019) to automated aircraft design (Hoburg and Abbeel, 2012), from the design of robots (Liao et al., 2019)
to the design of neural network architectures
(Zoph and Le, 2017). Such problems require optimizing unknown score functions using datasets of inputscore pairs, without direct access to the score function being optimized. This can be especially challenging when valid inputs lie on a lowdimensional manifold in the space of all inputs, e.g., the space of valid aircraft designs or valid images. Existing methods to solve such problems often use derivativefree optimization (Snoek et al., 2012). Most of these techniques require active data collection, where the unknown function is queried at new inputs. However, when function evaluation involves a complex realworld process, such as testing a new aircraft design or evaluating a new protein, such active methods can be very expensive. On the other hand, in many cases there is considerable prior data – existing aircraft and protein designs, and advertisements and user click rates, etc. – that could be leveraged to solve the optimization problem.In this work, our goal is to develop an optimization approach to solve such optimization problems that can (1) readily operate on highdimensional inputs comprising a narrow, lowdimensional manifold in the input space, (2) readily utilize offline static data, and (3) learn with minimal active data collection if needed. We can define this problem setting formally as the optimization problem
(1) 
where the function is unknown, and we have access to a dataset , where denotes the value . If no further data collection is possible, we call this datadriven modelbased optimization, else we refer to it as active modelbased optimization. This can also be extended to the contextual setting, where the aim is to optimize the expected score function across a context distribution. That is,
(2) 
where maps contexts to inputs , such that the expected score under the context distribution is optimized. As before, is unknown, and we use a dataset , where is the value of . Such contextual problems with logged datasets have been studied in the context of contextual bandits (Swaminathan and Joachims, 2015a; Joachims et al., 2018).
A simple way to approach these modelbased optimization problems is to train a proxy function or , with parameters , to approximate the true score, using the dataset . However, directly using in place of the true function in Equation (1) generally works poorly, because the optimizer will quickly find an input for which outputs an erroneously large value. This issue is especially severe when the inputs lie on a narrow manifold in a highdimensional space, such as the set of natural images (Zhu et al., 2016). The function
is only valid near the training distribution, and can output erroneously large values when queried at points chosen by the optimizer. Prior work has sought to addresses this issue by using uncertainty estimation and Bayesian models
(Snoek et al., 2015) for , as well as active data collection (Snoek et al., 2012). However, explicit uncertainty estimation is difficult when the function is very complex or when is highdimensional.Instead of learning , we propose to learn the inverse function, mapping from values to corresponding inputs . This inverse mapping is onetomany, and therefore requires a stochastic mapping, which we can express as , where
is a random variable. We term such models
model inversion networks (MINs). MINs can handle highdimensional input spaces such as images, can tackle contextual problems, and can accommodate both static datasets and active data collection. We discuss how to design active data collection methods for MINs, leverage advances in deep generative modeling (Goodfellow et al., 2014; Brock et al., 2019), and scale to very highdimensional input spaces. We experimentally demonstrate MINs in a range of settings, showing that they outperform prior methods on highdimensional input spaces, perform competitively to Bayesian optimization methods on tasks with active data collection and lowerdimensional inputs, and substantially outperform prior methods on contextual bandit optimization from logged data (Swaminathan and Joachims, 2015a).2 Related Work
Bayesian and modelbased optimization. Most prior work on modelbased optimization has focused on the active setting. This includes algorithms such as the cross entropy method (CEM) and related derivativefree methods (Rubinstein, 1996; Rubinstein and Kroese, 2004), reward weighted regression (Peters and Schaal, 2007), Bayesian optimization methods based on Gaussian processes (Shahriari et al., 2016; Snoek et al., 2012, 2015), and variants that replace GPs with parametric acquisition function approximators, such as Bayesian neural networks (Snoek et al., 2015) and latent variable models (Kim et al., 2019; Garnelo et al., 2018b, a), as well as more recent methods such as CbAS (Brookes et al., 2019). These methods require the ability to query the true function at each iteration to iteratively arrive at a nearoptimal solution. We show in Section 3.3 that MINs can be applied to such an active setting as well, and in our experiments we show that MINs can perform competitively with these prior methods. Additionally, we show that MINs can be applied to the static setting, where these prior methods are not applicable. Furthermore, most conventional BO methods do not scale favourably to highdimensional input spaces, such as images, while MINs can handle image inputs effectively.
Contextual bandits. Equation 2 describes contextual bandit problems. Prior work on batch contextual bandits has focused on batch learning from bandit feedback (BLBF), where the learner needs to produce the best possible policy that optimizes the score function from logged experience. Existing approaches build on the counterfactual risk minimization (CRM) principle (Swaminathan and Joachims, 2015a, b), and have been extended to work with deep nets (Joachims et al., 2018). In our comparisons, we find that MINs substantially outperform these prior methods in the batch contextual bandit setting.
Deep generative modeling. Recently, deep generative modeling approaches have been very successful at modelling highdimensional manifolds such as natural images (Goodfellow et al., 2014; Van Den Oord et al., 2016; Dinh et al., 2016), speech (van den Oord et al., 2018), text (Yu et al., 2017), alloy composition prediction (Nguyen et al., ), and other data. Unlike inverse map design, MINs solve an easier problem by learning the inverse map accurately only on selectively chosen datapoints, which is sufficient for optimization. MINs combine the strength of such generative models with important algorithmic choices to solve modelbased optimization problems. In our experimental evaluation, we show that these design decisions are important for adapting deep generative models to modelbased optimization, and it is difficult to perform effective optimization without them.
3 Model Inversion Networks
Here, we describe our model inversion networks (MINs) method, which can perform both active and passive modelbased optimization over highdimensional inputs.
Problem statement. Our goal is to solve optimization problems of the form , where the function is not known, but we must instead use a dataset of inputoutput tuples . In the contextual setting described in Equation (2), each datapoint is also associated with a context . For clarity, we present our method in the noncontextual setting, but the contextual setting can be derived analogously by conditioning all learned models on the context. In the active setting, which is most often studied in prior work, the algorithm can query one or more times on each iteration to augment the dataset, while in the static or datadriven setting, only an initial static dataset is available. The goal is to obtain the best possible (i.e., the one with highest possible value of ).
One naïve way of solving MBO problems is to learn a proxy score function via empirical risk minimization, and then maximize it with respect to . However, naïve applications of such a method would fail for two reasons. First, the proxy function may not be accurate outside the distribution on which it is trained, and optimization with respect to it may simply lead to values of for which makes the largest mistake. The second problem is more subtle. When lies on a narrow manifold in a very highdimensional space, such as the space of natural images, the optimizer can produce invalid values of , which result in arbitrary outputs when fed into . Since the shape of this manifold is unknown, it is difficult to constrain the optimizer to prevent this. This second problem is rarely addressed or discussed in prior work, which typically focuses on optimization over lowdimensional and compact domains with known bounds.
3.1 Optimization via Inverse Maps
Part of the reason for the brittleness of the naïve approach above is that has a highdimensional input space, making it easy for the optimizer to find inputs for which the proxy function produces an unreasonable output. Can we instead learn a function with a small input space, which implicitly understands the space of valid, indistribution values for ? The main idea behind our approach is to model an inverse map that produces a value of given a score value , given by . The input to the inverse map is a scalar, making it comparatively easy to constrain to valid values, and by directly generating the inputs , an approximation to the inverse function must implicitly understand which input values are valid. As multiple values can correspond to the same , we design as a stochastic map that maps a score value along with a
dimensional random vector to a
, , where is distributed according to a prior distribution .The inverse map training objective corresponds to standard distribution matching, analogously to standard generative models, which we will express in a somewhat more general way to simplify the exposition later. Let denote the data distribution, such that is the marginal over , and let be an any distribution on , which could be equal to . We can train the proxy inverse map by minimizing the following objective:
(3) 
where is obtained by marginalizing over , and
is a measure of divergence between the two distributions. Using the KullbackLeibler divergence leads to maximum likelihood learning, while JensenShannon divergence motivates a GANstyle training objective. MINs can be adapted to the contextual setting by passing in the context as an input and learning
.While the basic idea behind MINs is simple, a number of implementation choices are important for good performance. Instead of choosing to be , as in standard ERM, in Section 3.3 we show that a careful choice of leads to better performance. In Section 3.4, we then describe a method to perform active data sampling with MINs. We also present a method to generate the best optimization output from a trained inverse map during evaluation in Section 3.2. The structure of the full MIN algorithm is shown in Algorithm 1, and a schematic flowchart of the procedure is shown in Figure 1.
3.2 Inference with Inverse Maps (ApproxInfer)
Once the inverse map is trained, the goal of our algorithm is to generate the best possible , which will maximize the true score function as well as possible. Since a score needs to be provided as input to the inverse map, we must select for which score to query the inverse map to obtain a nearoptimal
. One naïve heuristic is to pick the best
and produce as the output. However, the method should be able to extrapolate beyond the best score seen in the dataset, especially in contextual settings, where a good score may not have been observed for all contexts.In order to extrapolate as far as possible, while still staying on the valid data manifold, we need to measure the validity of the generated values of . One way to do this is to measure the agreement between the learned inverse map and an independently trained forward model : the values of for which the generated samples are predicted to have a score similar to are likely indistribution, whereas those where the forward model predicts a very different score may be too far outside the training distribution. This amounts to using the agreement between independently trained forward and inverse maps to quantify the degree to which a particular score is outofdistribution. Since the latent variable captures the multiple possible outputs of the onetomany inverse map, we can further optimize over for a given to find the best, most trustworthy subject to the constraint that has a high likelihood under the prior. This can be formalized as the following optimization:
(4) 
This optimization can be motivated as finding an extrapolated score, higher than the observed dataset , that corresponds to values of that lie on the valid input manifold, and for which independently trained forward and inverse maps agree. Although this optimization uses an approximate forward map , we show in our experiments in Section 4 that it produces substantially better results than optimizing with respect to a forward model alone. The inverse map substantially constraints the search space, requiring an optimization over a 1dimensional and a (relatively) lowdimensional , rather than the full space of inputs. This can be viewed as a special (deterministic) case of a probabilistic optimization procedure, which we describe in Appendix A.
3.3 Reweighting the Training Distribution
A naïve implementation of the training objective in Equation (3) samples from the data distribution . However, as we are most interested in the inverse map’s predictions for high values of , it is much less important for the inverse map to predict accurate values for values of that are far from the optimum. We could consider increasing the weights on points with larger values of . In the extreme case, we could train only on the best points – either the single datapoint with the largest or, in the contextual case, the points with the largest for each context. To formalize this notion, we can define the optimal distribution , which is simply the delta function centered on the best , in the deterministic case. If we assume that the observed scores have additive noise (i.e., we observe ), then would be a distribution centered around the optimal .
We could attempt to train only on , as values far from optimum are much less important. However, this is typically impractical, since
heavily downweights most of the training data, leading to a very highvariance training objective. We can instead choose
to trade off the variance due to an overly peaked training distribution and the bias due to training on the “wrong” distribution (i.e., anything other than ).When training under a distribution other than , we can use importance sampling, where we sample from and assign an importance weight to each datapoint . The reweighted objective is given by . By bounding the variance and the bias of the gradient of , we obtain the following result, with the proof given in Appendix B:
Theorem 3.1 (Bias + variance bound in MINs).
Let be the objective under without sampling error: . Let be the number of datapoints with the particular value observed in , For some constants , with high confidence,
where is the exponentiated Renyi divergence.
Theorem 3.1 suggests a tradeoff between being close to the optimal distribution (third term) and reducing variance by covering the full data distribution (second term). The distribution that minimizes the bound in Theorem 3.1 has the following form: , where is a monotonically increasing function of that ensures that the distributions and are close. We empirically choose an exponential parameteric form for this function , which we describe in Section 3.5. This upweights the samples with higher scores, reduces the weight on rare values (i.e., those with low ), while preventing the weight on common values from growing, since saturates to for large . This is consistent with our intuition: we would like to upweight datapoints with high values, provided the number of samples at those values is not too low. For continuousvalued scores, we rarely see the same score twice, so we bin the values into discrete bins for the purpose of reweighting, as we discuss in Section 3.5.
3.4 Active Data Collection via Randomized Labeling
While the passive setting requires care in finding the best value of for the inverse map, the active setting presents a different challenge: choosing a new query point at each iteration to augment the dataset
and make it possible to find the best possible optimum. Prior work on bandits and Bayesian optimization often uses Thompson sampling (TS)
(Russo and Van Roy, 2016; Russo et al., 2018; Srinivas et al., 2010) as the datacollection strategy. TS maintains a posterior distribution over functions . At each iteration, it samples a function from this distribution and queries the point that greedily minimizes this function. TS offers an appealing query mechanism, since it achieves sublinear Bayesian regret (the expected cumulative difference between the value of the optimal input and the selected input), given by , where is the number of queries. Maintaining a posterior over highdimensional parametric functions is generally intractable. However, we can approximate Thompson sampling with MINs. First, note that sampling from the posterior is equivalent to sampling pairs consistent with – given sufficiently many pairs, there is a unique smooth function that satisfies . For example, we can infer a quadratic function exactly from three points. For a more formal description, we refer readers to the notion of eluder dimension (Russo and Van Roy, 2013). Thus, instead of maintaining intractable beliefs over the function, we can identify a function by the samples it generates, and define a way to sample synthetic points such that they implicitly define a unique function sample from the posterior.To apply this idea to MINs, we train the inverse map at each iteration with an augmented dataset , where is a dataset of synthetically generated inputscore pairs corresponding to unseen values in . Training on corresponds to training to be an approximate inverse map for a function sampled from , as the synthetically generated samples implicitly induce a model of . We can then approximate Thompson sampling by obtaining from , labeling it via the true function, and adding it to to produce . Pseudocode for this method, which we call “randomized labeling,” is presented in Algorithm 2. In Appendix C, we further derive regret guarantees under mild assumptions. Implementationwise, this method is simple, does not require estimating explicit uncertainty, and works with arbitrary function classes, including deep neural networks.
3.5 Practical Implementation of MINs
In this section, we describe our instantiation of MINs for highdimensional inputs with deep neural network models. GANs (Goodfellow et al., 2014) have been successfully used to model the manifold of highdimensional inputs, without the need for explicit density modelling and are known to produce more realistic samples than other models such as VAEs (Kingma and Welling, 2013) or Flows (Dinh et al., 2016). The inverse map in MINs needs to model the manifold of valid thus making GANs a suitable choice. We can instantiate our inverse map with a GAN by choosing in Equation 3 to be the JensenShannon divergence measure. Since we generate conditioned on , the discriminator is parameterized as , and trained to output 1 for a valid pair (i.e., where and comes from the data) and 0 otherwise. Thus, we optimize the following objective:
This model is similar to a conditional GAN (cGAN), which has been used in the context of modeling distribution of conditioned on a discretevalued label (Mirza and Osindero, 2014). As discussed in Section 3.3, we additionally reweight the data distribution using importance sampling. To that end, we discretize the space into discrete bins and, following Section 3.3, weight each bin according to , where is the number of datapoints in the bin, is the maximum score observed, and
is a hyperparameter. (After discretization, using notation from Section
3.3, for any that lies in bin , and .) Experimental details are provided in Appendix C.4.In the active setting, we perform active data collection using the randomized labelling algorithm described in Section 3.4. In practice, we train two copies of . The first, which we call the exploration model , is trained with data augmented via synthetically generated samples (i.e., ). The other copy, called the exploitation model , is trained on only real samples (i.e., ). This improves stability during training, while still performing data collection as dictated by Algorithm 2. To generate the augmented dataset in practice, we sample values from (the distribution over highscoring s observed in ), and add positivevalued noise, thus making the augmented values higher than those in the dataset which promotes exploration. The corresponding inputs are simply sampled from the dataset or uniformly sampled from the bounded input domain when provided in the problem statement. (for example, benchmark function optimization) After training, we infer best possible from the trained model using the inference procedure described in Section 3.2. In the active setting, the inference procedure is applied on , the inverse map which is trained only on real data points.
4 Experimental Evaluation
The goal of our empirical evaluation is to answer the following questions. (1) Can MINs successfully solve optimization problems of the form shown in Equations (1) and (2), in static settings and active settings, better than or comparably to prior methods? (2) Can MINs generalize to high dimensional spaces, where valid inputs lie on a lowerdimensional manifold, such as the space of natural images? (3) Is reweighting the data distribution important for effective datadriven modelbased optimization? (4) Does our proposed inference procedure effectively discover valid inputs with better values than any value seen in the dataset? (5) Does randomized labeling help in active data collection?
4.1 DataDriven Optimization with Static Datasets
We first study the datadriven modelbased optimization setting. This requires generating points that achieve a better score than any point in the training set or, in the contextual setting, better than the policy that generated the dataset for each context. We evaluate our method on a batch contextual bandit task proposed in prior work (Joachims et al., 2018), and on a highdimensional contextual image optimization task. We also evaluate our method on several noncontextual tasks that require optimizing over highdimensional image inputs to evaluate a semantic score function, including handwritten characters and realworld photographs.
Batch contextual bandits. We first study the contextual optimization problem described in Equation (2). The goal is to learn a policy, purely from static data, that predicts the correct bandit arm for each context , such that the policy achieves a high score on average across contexts drawn from a distribution . We follow the protocol set out by Joachims et al. (2018), which evaluates contextual bandit policies trained on a static dataset for a simulated classification tasks. The data is constructed by selecting images from the (MNIST/CIFAR) dataset as the context , a random label as the input , and a binary indicator indicating whether or not the label is correct as the score . Multiple schemes can be used for mapping contexts to labels for generating the training dataset, and we evaluate on two such schemes, as described below. We report the average score on a set of new contexts, which is equal to the average 01 accuracy of the learned model on a held out test set of images (contexts). We compare our method to previously proposed techniques, including the BanditNet model proposed by Joachims et al. (2018) on the MNIST and CIFAR10 (Krizhevsky, 2009) datasets. Note that this task is different from regular classification, in that the observed feedback ( pairs) is partial, i.e. we do not observe the correct label for each context (image) , but only whether or not the label in the training tuple is correct or not.
Dataset & Type  BanditNet  BanditNet  MIN w/o I  MIN (Ours)  MINs w/o R 

MNIST ( corr.)  95.0 0.16  95.0 0.21  
MNIST (Uniform)  93.67 0.51  92.8 0.01  
CIFAR10 ( corr.)  92.21 1.0  89.02 0.05  
CIFAR10 (Uniform)  77.12 0.54  74.87 0.12 
We evaluate on two datasets: (1) data generated by selecting random labels for each context and (2) data where the correct label is used 49% of the time, which matches the protocol in prior work (Joachims et al., 2018). We compare to BanditNet (Joachims et al., 2018) on identical dataset splits. We report the average 01 test accuracy for all methods in Table 1. The results show that MINs drastically outperform BanditNet on both MNIST and CIFAR10 datasets, indicating that MINs can successfully perform contextual modelbased optimization in the static (datadriven) setting.
Ablations. The results in Table 1 also show that utilizing the inference procedure in Section 3.2 produces an improvement of about 1.5% and 1.0% in testaccuracy on MNIST and CIFAR10, respectively. Utilizing reweighting gives a slight performance boost of about 2.5% on CIFAR10.



), and direct optimization of a forward model, which starts with a random image from the dataset and updates it via stochastic gradient descent for the highest score based on the forward model. Observe that MINs can produce thickest characters which resemble valid digits. Optimizing the forward function often turns nondigit pixels on, thus going off the valid manifold. Both the reweighting and inference procedure are important for good results. Scores are mentioned beneath each figure. The larger score the better, provided the solution
is the image of a valid digit. Dataset average is 149.0.Character stroke width optimization. In the next experiment, we study how well MINs optimize over highdimensional inputs, where valid inputs lie on a lowerdimensional manifold. We constructed an image optimization task out of the MNIST (LeCun and Cortes, 2010) dataset. The goal is to optimize directly over the image pixels, to produce images with the thickest stroke width, such that the image corresponds, in the first scenario, (a) – to any valid character, and in the second scenario, (b), to a valid instance of a particular character class ( in this case). A successful algorithm will produce the thickest character that is still recognizable.
In Figure 2, we observe that MINs generate images that maximize the respective score functions in each case. We also evaluate on a harder task, where the goal is to maximize the number of disconnected blobs of black pixels in an image of a digit. For comparison, we evaluate a method that directly optimizes the image pixels with respect to a forward model, of the form . In this case, the solutions are far off the manifold of valid characters.
Ablations. We also compare to MINs without reweighting (MINR) and without the inference procedure (MINI), where is the maximum possible in the dataset to demonstrate the benefits of these two aspects. Observe that MINI sometimes yeilds invalid solutions ((a) and (c)), and MINR fails to output belonging to the highest score class ((a) and (c)), thus indicating their importance.
Semantic image optimization. The goal in these tasks is to quantify the ability of MINs to optimize highlevel properties that require semantic understanding of images. We consider MBO tasks on the IMDBWiki faces (Rothe et al., 2015, 2016) dataset, where the function is the negative of the age of the person in the image. Hence, images with younger people have higher scores.


We construct two versions of this task: one where the training data consists of all faces older than 15 years, and the other where the model is trained on all faces older than 25 years. This ensures that our model cannot simply copy the youngest face. To obtain ground truth scores for the generated faces, we use subjective judgement from human participants. We perform a study with 13 users. Each user was asked to answer a set of 35 binarychoice questions each asking the user to pick the older image of the two provided alternatives. We then fit an age function to this set of binary preferences, analogously to Christiano et al. (2017).
Task  MIN  MIN (best) 

15  13.6  12.2 
25  26.2  23.9 
Figure 3 shows the images produced by MINs. For comparison, we also present some sample of images from the dataset partitioned by the ground truth score. We find that the most likely age for optimal images produced by training MINs on images of people 15 years or older was 13.6 years, with the best image having an age of 12.2. The model trained on ages 25 and above produced more mixed results, with an average age of 26.2, and a minimum age of 23.9. We report these results in Table 2. This task is exceptionally difficult, since the model must extrapolate outside of the ages seen in the training set, picking up on patterns in the images that can be used to produce faces that appear younger than any face that the model had seen, while avoiding unrealistic images.
We also conducted experiments on contextual image optimization with MINs. We studied contextual optimization over handwritten digits to maximize stroke width, using either the character category as the context , or the top onefourth or top half of the image. In the latter case, MINs must learn to complete the image while maximizing for the stroke width. In the case of classconditioned optimization, MINs attain an average score over the classes of 237.6, while the dataset average is 149.0.
Mask  MIN  Dataset 

mask A  223.57  149.0 
mask B  234.32  149.0 
In the case where the context is the top half or quarter of the image, MINs obtain average scores of 223.57 and 234.32, respectively, while the dataset average is 149.0 for both tasks. We report these results in Table 3. We also conducted a contextual optimization experiment on faces from the CelebA dataset, with some example images shown in Figure 4. The context corresponds to the choice for the attributes brown hair, black hair, bangs, or moustache. The optimization score is given by the sum of the attributes wavy hair, eyeglasses, smiling, and no beard. Qualitatively, we can see that MINs successfully optimize the score while obeying the target context, though evaluating the true score is impossible without subjective judgement on this task. We discuss these experiments in more detail in Appendix D.1.
4.2 Optimization with Active Data Collection
In the active MBO setting, MINs must select which new datapoints to query to improve their estimate of the optimal input. In this setting, we compare to prior modelbased optimization methods, and evaluate the exploration technique described in Section 3.4.
Global optimization on benchmark functions. We first compare MINs to prior work in Bayesian optimization on standard benchmark problems (DNGO) (Snoek et al., 2015): the 2D Branin function, and the 6D Hartmann function. As shown in Table 4, MINs reach within units of the global minimum
(minimization is performed here, instead of maximization), performing comparably with commonly used Bayesian optimization methods based on Gaussian processes. We do not expect MINs to be as efficient as GPbased methods, since MINs rely on training parametric neural networks with many parameters, which is less efficient than GPs on lowdimensional tasks. Exact Gaussian processes and adaptive Bayesian linear regression
(Snoek et al., 2015) outperform MINs in terms of optimization precision and the number of samples queried, but MINs achieve comparable performance with about more samples.Ablations. We also report the performance of MINs without the random labeling query method, instead selecting the next query point by greedily maximizing the current model with some additive noise (MIN + greedy). Random labeling method produces better results than the greedy data collection approach, indicating the importance of effective active data collection methods for MINs.
Function  Spearmint  DNGO  MIN  MIN + greedy 

Branin (0.398)  (800)  
Hartmann6 (3.322)  (600)  (1200) 
Protein fluorescence maximization. In the next experiment, we study a highdimensional active MBO task, previously studied by Brookes et al. (2019). This task requires optimizing over protein designs by selecting variable length sequences of codons, where each codon can take on one of 20 values. In order to model discrete values, we use a Gumbelsoftmax GAN also previously employed in (Gupta and Zou, 2018), and as a baseline in (Brookes et al., 2019)
. For backpropagation, we choose a temperature
for the Gumbelsoftmax operation. This is also mentioned in Appendix D. The aim in this task is to produce a protein with maximum fluorescence. Each algorithm is provided with a starting dataset, and then allowed a identical, limited number of score function queries. For each query made by an algorithm, it receives a score value from an oracle. We use the trained oracles released by (Brookes et al., 2019). These oracles are separately trained forward models, and are inaccurate, especially for datapoints not observed in the starting static dataset.We compare to CbAS (Brookes et al., 2019) and other baselines, including CEM (cross entropy method), RWR (reward weighted regression) and a method that uses a forward model – GB (GómezBombarelli et al., 2018) reported by Brookes et al. (2019).
Method  Max  50ile 

MIN (Ours)  3.42  3.24 
MIN  R  3.37  3.28 
CbAS  3.36  3.28 
RWR  3.00  2.97 
CEMPI  2.92  2.9 
GB  3.25  3.25 
For evaluation, we report the ground truth score of the output of optimization (max), and the 50thpercentile ground truth score of all the samples produced via sampling (without inference, in case of MINs) so as to be comparable to (Brookes et al., 2019). In Table 5, we show that MINs are comparable to the best performing method on this task, and produce samples with the highest score among all the methods considered.
Ablations. In Table 5, we also compare to MINs without reweighting in the active setting, which lead to more consistent sample quality (higher 50% score), but do not produce the highest scoring sample unlike MINs with reweighting.
These results suggest that MINs can perform competitively with previously proposed modelbased optimization methods in the active setting, reaching comparable or better performance when compared both to Bayesian optimization methods and previously proposed methods for a higherdimensional protein design task.
5 Discussion
In this work, we presented a novel approach towards modelbased optimization (MBO). Instead of learning a proxy forward function from inputs to scores , MINs learn a stochastic inverse mapping from scores to inputs. MINs are resistent to outofdistribution inputs and can optimize over high dimensional values where valid inputs lie on a narrow manifold. By using simple and principled design decisions, such as reweighting the data distribution, MINs can perform effective modelbased optimization even from static, previously collected datasets in the datadriven setting without the need for active data collection. We also described ways to perform active data collection if needed. Our experiments showed that MINs are capable of solving MBO optimization tasks in both contextual and noncontextual settings, and are effective over highly semantic score functions such as age of the person in an image.
Prior work has usually considered MBO in the active or "onpolicy" setting, where the algorithm actively queries data as it learns. In this work, we introduced the datadriven MBO problem statement and devised a method to perform optimization in such scenarios. This is important in settings where data collection is expensive and where abundant datasets exist, for example, protein design, aircraft design and drug design. Further, MINs define a family of algorithms that show promising results on MBO problems on extremely large input spaces.
While MINs scale to highdimensional tasks such as modelbased optimization over images, and are performant in both contextual and noncontextual settings, we believe there are a number of interesting open questions for future work. The interaction between active data collection and reweighting should be investigated in more detail, and poses interesting consequences for MBO, bandits and reinforcement learning. Better and more principled inference procedures are also a direction for future work. Another avenue is to study various choices of training objectives in MIN optimization.
Acknowledgements
We thank all memebers of the Robotic AI and Learning Lab at UC Berkeley for their participation in the human study. We thank anonymous reviewers for feedback on an earlier version of this paper. This research was funded by the DARPA Assured Autonomy program, the National Science Foundation under IIS1700697, and compute support from Google, Amazon, and NVIDIA.
References
 Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Conditioning by adaptive sampling for robust design. In Proceedings of the 36th International Conference on Machine Learning, External Links: Link Cited by: §1, §2, §4.2, §4.2, §4.2, Table 5.
 Deep reinforcement learning from human preferences. In NIPS, Cited by: §4.1.
 Density estimation using real nvp.. CoRR abs/1605.08803. External Links: Link Cited by: §2, §3.5.
 Conditional neural processes. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §2.
 Neural processes. CoRR abs/1807.01622. External Links: Link, 1807.01622 Cited by: §2.
 Automatic chemical design using a datadriven continuous representation of molecules. In ACS central science, Cited by: §4.2.
 Generative adversarial nets. NIPS’14. Cited by: §1, §2, §3.5.
 Feedback gan (fbgan) for dna: a novel feedbackloop architecture for optimizing protein functions. ArXiv abs/1804.01694. Cited by: §4.2.
 Geometric programming for aircraft design optimization. Vol. 52, pp. . External Links: ISBN 9781600869372, Document Cited by: §1.
 Categorical reparameterization with gumbelsoftmax.. CoRR abs/1611.01144. External Links: Link Cited by: §D.4.
 Deep learning with logged bandit feedback. In International Conference on Learning Representations, External Links: Link Cited by: §D.4, §1, §2, §4.1, §4.1, §4.1, Table 1.
 Attentive neural processes. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Autoencoding variational bayes. Note: cite arxiv:1312.6114 External Links: Link Cited by: §3.5.
 Learning multiple layers of features from tiny images. Technical report . Cited by: §4.1.
 MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §4.1.
 Dataefficient learning of morphology and controller for a microrobot. In 2019 IEEE International Conference on Robotics and Automation, External Links: Link Cited by: §1.
 [18] PyTorch GAN. External Links: Link Cited by: §D.4.

Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, Cited by: §D.1.  Policy optimization via importance sampling. NIPS’18. External Links: Link Cited by: Lemma B.1.
 Conditional generative adversarial nets. Note: cite arxiv:1411.1784 External Links: Link Cited by: §3.5.
 [22] Incomplete conditional density estimation for fast materials discovery. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 549–557. External Links: Document, Link, https://epubs.siam.org/doi/pdf/10.1137/1.9781611975673.62 Cited by: §2.

Variational discriminator bottleneck: improving imitation learning, inverse RL, and GANs by constraining information flow
. In International Conference on Learning Representations, External Links: Link Cited by: §D.4.  Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07. Cited by: §2.
 DEX: deep expectation of apparent age from a single image. In IEEE International Conference on Computer Vision Workshops (ICCVW), Cited by: §4.1.
 Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV). Cited by: §4.1.

The cross entropy method: a unified approach to combinatorial optimization, montecarlo simulation (information science and statistics)
. SpringerVerlag, Berlin, Heidelberg. External Links: ISBN 038721240X Cited by: §2.  Optimization of computer simulation models with rare events. European Journal of Operations Research 99, pp. 89–112. Cited by: §2.
 A tutorial on thompson sampling. Found. Trends Mach. Learn. 11 (1), pp. 1–96. External Links: ISSN 19358237, Link, Document Cited by: Appendix C, §3.4.
 Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems 26, Cited by: Appendix C, §3.4.
 An informationtheoretic analysis of thompson sampling. J. Mach. Learn. Res. 17 (1), pp. 2442–2471. External Links: ISSN 15324435, Link Cited by: Appendix C, Appendix C, Appendix C, Appendix C, Appendix C, Lemma C.1, §3.4.
 [32] BanditNet. External Links: Link Cited by: §D.4.
 Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: §2.
 Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12. External Links: Link Cited by: §1, §1, §2.
 Scalable bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §1, §2, §4.2.
 Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10. Cited by: §3.4.
 Counterfactual risk minimization: learning from logged bandit feedback. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15. Cited by: §1, §1, §2.
 The selfnormalized estimator for counterfactual learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15. Cited by: §2.
 . In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16. Cited by: §2.
 Parallel WaveNet: fast highfidelity speech synthesis. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §2.

SeqGAN: sequence generative adversarial nets with policy gradient.
In
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence
, AAAI’17. Cited by: §2.  Generative visual manipulation on the natural image manifold. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §1.
 Neural architecture search with reinforcement learning. External Links: Link Cited by: §1.
Appendix A Probabilistic Interpretation of Section 3.2
In this section, we show that the inference scheme described in Equation 4, Section 3.2 emerges as a deterministic relaxation of the probabilistic inference scheme described below. We reiterate that in Section 3.2, a singleton is the output of optimization, however the procedure can be motivated from the perspective of the following probabilistic inference scheme.
Let denote a stochastic inverse map, and let be a probabilistic forward map. Consider the following optimization problem:
(5) 
where
is the probability distribution induced by the learned inverse map (in our case, this corresponds to the distribution of
induced due to randomness in ), is the learned forward map, is Shannon entropy, and is KLdivergence measure between two distributions. In Equation 4, maximization is carried out over the input to the inversemap, and the input which is captured in in the above optimization problem, i.e. maximization over in Equation 4 is equivalent to choosing subject to the choice of singleton/ Diracdelta . The Lagrangian is given by:In order to derive Equation 4, we restrict to the Diracdelta distribution generated by querying the learned inverse map at a specific value of . Now note that the first term in the Lagrangian corresponds to maximizing the "reconstructed" similarly to the first term in Equation 4. If is assumed to be a Gaussian random variable with a fixed variance, then , where is the mean of the probabilistic forward map. With deterministic forward maps, we make the assumption that (the queried value of ), which gives us the second term from Equation 4.
Finally, in order to obtain the term, note that, (by the data processing inequality for KLdivergence). Hence, constraining instead of the true divergence gives us a lower bound on . Maximizing this lower bound (which is the same as Equation 4) hence also maximizes the true Lagrangian .
Appendix B BiasVariance Tradeoff during MIN training
In this section, we provide details on the biasvariance tradeoff that arises in MIN training. Our analysis is primarily based on analysing the bias and variance in the norm of the gradient in two cases – if we had access to infinte samples of the distribution over optimal s, (this is a Diracdelta distribution when function evaluations are deterministic, and a distribution with nonzero variance when the function evaluations are stochastic or are corrupted by noise). Let denote the empirical objective that the inverse map is trained with. We first analyze the variance of the gradient estimator in Lemma B.2. In order to analyse this, we will need the expression for variance of the importance sampling estimator, which is captured in the following Lemma.
Lemma B.1 (Variance of IS (Metelli et al., 2018)).
Let and be two probability measures on the space such that . Let be randomly drawn samples from , and is a uniformlybounded function. Then for any , with probability atleast ,
(6) 
Equipped with Lemma B.1, we are ready to show the variance in the gradient due to reweighting to a distribution for which only a few datapoints are observed.
Lemma B.2 (Gradient Variance Bound for MINs).
Let the inverse map be given by . Let denote the number of datapoints observed in with score equal to , and let be as defined above. Let , where the expectation is computed with respect to the dataset . Assume that and . Then, there exist some constants such that with a confidence at least ,
Proof.
We first bound the range in which the random variable can take values as a function of number of samples observed for each . All the steps follow with high probability, i.e. with probability greater than ,
(7) 
where is the exponentiated Renyidivergence between the two distributions and , i.e. . The first step follows by applying Hoeffding’s inequality on each inner term in the sum corresponding to and then bounding the variance due to importance sampling s finally using concentration bounds on variance of importance sampling using Lemma B.1.
Thus, the gradient can fluctuate in the entire range of values as defined above with high probability. Thus, with high probability, atleast ,
(8) 
∎
The next step is to bound the bias in the gradient that arises due to training on a different distribution than the distribution of optimal s, . This can be written as follows:
(9) 
where is the total variation divergence between two distributions and , and L is a constant that depends on the maximum magnitude of the divergence measure . Combining Lemma B.2 and the above result, we prove Theorem 3.1.
Appendix C Argument for Active Data Collection via Randomized Labeling
In this section, we explain in more detail the randomized labeling algorithm described in Section 3.4. We first revisit Thompson sampling, then provide arguments for how our randomized labeling algorithm relates to it, highlight the differences, and then prove a regret bound for this scheme under mild assumptions for this algorithm. Our proof follows commonly available proof strategies for Thompson sampling.
Notation
The TS algorithm queries the true function at locations and observes true function values at these points . The true function is one of many possible functions that can be defined over the space . Instead of representing the true objective function as a point object, it is common to represent a distribution over the true function . This is justified because, often, multiple parameter assignments , can give us the same overall function. We parameterize by a set of parameters .
The period regret over queries is given by the random variable
Since selection of can be a stochastic, we analyse Bayes risk (Russo and Van Roy, 2016; Russo et al., 2018), we define the Bayes risk as the expected regret over randomness in choosing , observing , and over the prior distribution . This definition is consistent with Russo and Van Roy (2016).
Let be the policy with which Thompson sampling queries new datapoints. We do not make any assumptions on the stochasticity of , therefore, it can be a stochastic policy in general. However, we make 2 assumptions (A1, A2). The same assumptions have been made in Russo and Van Roy (2016).
A1: (Difference between max and min scores is bounded by 1) – If this is not true, we can scale the function values so that this becomes true.
A2: Effective size of is finite. ^{1}^{1}1By effective size we refer to the intrinsic dimensionality of . This doesn’t necessarily imply that should be discrete. For example, under linear approximation to the score function , i.e., if , this defines a polyhedron but just analyzing a finite set of just extremal points of the polyhedron works out, thus making effectively finite.
TS (Alg 3) queries the function value at based on the posterior probability that is optimal. More formally, the distribution that TS queries from can be written as: . When we use parameters to represent the function parameter, and thus this reduces to sampling an input that is optimal with respect to the current posterior at each iteration: .
MINs (Alg 2) train inverse maps , parameterized as , where . We call an inverse map optimal if it is uniformly optimal given , i.e. , where
is controllable (usually the case in supervised learning, errors can be controlled by crossvalidation).
Now, we are ready to show that the regret incurred the randomized labelling active data collection scheme is bounded by . Our proof follows the analysis of Thompson sampling presented in Russo and Van Roy (2016). We first define information ratio and then use it to prove the regret bound.
Information Ratio
Russo and Van Roy (2016) related the expected regret of TS to its expected information gain i.e. the expected reduction in the entropy of the posterior distribution of . Information ratio captures this quantity, and is defined as:
where is the mutual information between two random variables and all expectations are defined to be conditioned on . If the information ratio is small, Thompson sampling can only incur large regret when it is expected to gain a lot of information about which is optimal. Russo and Van Roy (2016) then bounded the expected regret in terms of the maximum amount of information any algorithm could expect to acquire, which they observed is at most the entropy of the prior distribution of the optimal .
Lemma C.1 (Bayesregret of vanilla TS)(Russo and Van Roy, 2016)).
For any , if (i.e. information ratio is bounded above) a.s. for each ,
We refer the readers to the proof of Proposition 1 in Russo and Van Roy (2016). The proof presented in Russo and Van Roy (2016) does not rely specifically on the property that the query made by the Thompson sampling algorithm at each iteration is posterior optimal, but rather it suffices to have a bound on the maximum value of the information ratio at each iteration . Thus, if an algorithm chooses to query the true function at a datapoint such that these queries always contribute in learning more about the optimal function, i.e. appearing in the denominator of is always more than a threshold, then information ratio is lower bounded, and that active data collection algorithm will have a sublinear asymptotic regret. We are interested in the case when the active data collection algorithm queries a datapoint at iteration , such that is the optimum for a function , where is a sample from the posterior distribution over , i.e. lies in the high confidence region of the posterior distribution over given the data seen so far. In this case, the mutual information between the optimal datapoint and the observed inputscore pair is likely to be greater than 0. More formally,
(10) 
The randomized labeling scheme for active data collection in MINs performs this step. The algorithm samples a bunch of datapoints, sythetically generated, – for example, in our experiments, we add noise to the values of , and randomly pair them with unobserved or rarely observed values of . If the underlying true function is smooth, then there exist a finite number of points that are sufficient to uniquely describe this function . One measure to formally characterize this finite number of points that are needed to uniquely identify all functions in a function class is given by Eluder dimension (Russo and Van Roy, 2013).
By augmenting synthetic datapoints and training the inverse map on this data, the MIN algorithm ensures that the inverse map is implicitly trained to be an accurate inverse for the unique function that is consistent with the set of points in the dataset and the augmented set . Which sets of functions can this scheme represent? The functions should be consistent with the data seen so far , and can take randomly distributed values outside of the seen datapoints. This can roughly argued to be a sample from the posterior over functions, which Thompson sampling would have maintained given identical history .
Lemma C.2 (Boundederror training of the posterioroptimal preserves asymptotic Bayesregret).
, let be any input such that . If MIN chooses to query the true function at and if the sequence satisfies , then, the regret from querying this optimal which is denoted in general as the policy is given by
Comments
There are no comments yet.