1 Introduction
We design objects with the expectation that they will exhibit a certain behavior. In fluid dynamics optimization, we want an airplane wing to experience low drag forces, but also have a particular lift profile, depending on angle of attack and air speed. We want to understand how the design of our public transportation hubs, dealing with large influxes of travelers, can cause congestion at maximal flow rates. We want our buildings to cause as little wind nuisance as possible and understand how their shape and the wind turbulence they cause are linked. In all these cases, it is not easy to design without prior experience and we often require long iterative design processes or trialanderror methods.
What if we could quickly understand the possible types of behavior in expensive engineering problems and get an early intuition about how shape and behavior are related? In this work, we try to answer these questions, and in particular, whether we can discover different high performing behaviors of shapes, designing air flow simultaneously to the shapes that causes it. An overview of related work is given in Section 2, where we explain quality diversity (QD) algorithms and the use of surrogate assistance. In Section 3 we introduce a new QD algorithm that performs surrogateassisted phenotypic niching. Two problem domains are used (Section 4): one inexpensive domain that optimizes the symmetry of polygons, allowing us to perform an in depth evaluation of various QD methods, and an expensive air flow domain (Section 5).
2 Quality Diversity
QD algorithms combine performance based search with “blind” novelty search, which searches for novel solutions without taking into account performance [14]. QD finds a diverse set of high performing optimizers [3, 15] by only allowing solutions to compete in local niches. Niches are based on features that describe phenotypic aspects, like shape, structure or behavior. It keeps track of an archive of niches and solutions are added if their phenotype fills an empty niche or their quality is higher than that of the solution that was previously placed inside.
QD became applicable to expensive optimization problems after the introduction of surrogateassisted illumination (SAIL) [7]. In this Bayesian interpretation of QD, a multidimensional archive of phenotypic elites (MAPElites) [3] is created based on upper confidence bound (UCB) [1] sampling, which takes an optimistic view at surrogateassisted optimization. A Gaussian Process (GP) regression [19]
model predicts the performance of new solutions based on the distance to previous examples, which is modeled using a covariance function. A commonly used covariance function is the squared exponential, which has two hyperparameters: the length scale (or sphere of influence) and the signal variance, which are found by minimizing the negative loglikelihood of the process. For any location, the GP model predicts a mean value
of the prediction. is added to with the idea that a location where the model has low confidence also has the promise of holding a better performing solution: . The parameter allows us to tune the UCB function between exploitation () and exploration ().In SAIL, after MAPElites fills the acquisition map which contains “optimistic” solution candidates, a random selection of those candidates is analyzed in the expensive evaluation function to form additional training samples for the GP model. This loop continues until the evaluation budget is exhausted. Then is set to 0 in a final MAPElites run to create a feature map that now contains a diverse set of solutions that is predicted to be highperforming. SAIL needs a budget orders of magnitudes smaller than MAPElites because it can exploit the surrogate model without “wasting” samples. SAIL, however, is constrained to features that are cheap to calculate, like shape features [7] that can be determined without running the expensive evaluation.
With SAIL it became possible to use performance functions of expensive optimization domains. But the strength of QD, to perform niching based on behavior, cannot be applied when determining those behaviors is expensive. In this work we evaluate whether we can include surrogate models for such features.
3 Surrogateassisted Phenotypic Niching
To be able to handle expensive features, we introduce surrogateassisted phenotypic niching (SPHEN) (Fig. 1 and Alg. 1). By building on the insight that replacing the performance function with a surrogate model decreases the necessary evaluation budget, we replace the exact feature descriptors as well.
The initial training sample set, used to form the first seeds of the acquisition map, is produced by a spacefilling, quasirandom lowdiscrepancy Sobol [21] sequence in the design space (Fig. 1a). Due to the lack of prior knowledge in blackbox optimization, using spacefilling sequences has become a standard method to ensure a good coverage of the search domain. The initial set is evaluated, for example in a computational fluid dynamics simulator. Performance and phenotypic features of those samples are derived from the results, or, in the case of simpler nonbehavioral features, from the solutions’ expression or shape themselves. The key issue here is to check the range of the initial set’s features. Since we do not know what part of the phenotypic feature space will be discovered in the process, the initial set’s feature coordinates only give us a hint of the reachable feature space. Just because we used a spacefilling sampling technique in the design space, does not mean the samples are spacefilling in feature space.
After collecting performance and feature values, the surrogate models are trained (Fig. 1b). We use GP models, which limit the number of samples to around 1,000, as the training and prediction becomes quite expensive. A squared exponential covariance function is used and the (constant) mean function is set to the training samples’ mean value. The covariance function’s hyperparameters, length scale and signal variance, are deduced using the GP toolbox GPML’s [18] conjugate gradients based minimization method for 1,000 iterations.
MAPElites then creates the acquisition map by optimizing the UCB of the performance model (with a large exploration factor ), using feature models to assign niches for the samples and new solutions (Fig. 1c). Notably, we do not take into account the confidence of those feature models. Surrogate assisted QD works, because, although the search takes place in a highdimensional space, QD only has to find the elite hypervolume [23], or prototypes [9], the small volumes consisting of highperforming solutions. Only the performance function can guide the search towards the hypervolume. Taking into account the feature models’ confidence intervals adds unnecessary complexity to the modeling problem. SPHEN’s goal is to be able to only predict features for highperforming solutions, so we let feature learning “piggyback” on this search. We use a Sobol sequence on the bins of the acquisition map to select new solutions (Fig. 1d) that are then evaluated to continue training the surrogate models. This process iterates as long as the evaluation budget is not depleted. Finally, MAPElites is used to create a prediction map, ignoring the models’ confidence (Fig. 1e), which is filled with diverse, highperforming solutions.
4 Domains
Phenotypic features describe phenomena that can be related to complex domains, like behavioral robotics, mechanical systems, or computational fluid dynamics (CFD). Before we apply SPHEN to an expensive CFD domain, we compare its performance to MAPElites and SAIL in a simpler, inexpensive domain.
4.1 Polygons
To be able to calculate all performance and feature values, we optimize free form deformed, eightsided polygons. The polygons are encoded by 16 parameters controlling the polar coordinate deviation of the control points (Fig. 2a). The first half of the genome determines the corner points’ radial deviation (). The second half of the genome determines their angular deviation (). The phenotypic features are the area of the polygon and its circumference (Fig. 2b). These values are normalized between and by using predetermined ranges ( and ). The performance function (Fig. 2c) is defined as the point symmetry . The polygon is sampled at equidistant locations on the polygon’s circumference, after which the symmetry metric is calculated (Eq. 1), based on the symmetry error , the sum of Euclidean distances of all opposing sampling locations to the center:
(1) 
4.2 Air Flow
The air flow domain is inspired by the problem of wind nuisance in the built environment. Wind nuisance is defined in building norms [10, 16] and uses the wind amplification factor measured in standardized environments, with respect to the hourly mean wind speed. In a simplified 2D setup, we translate this problem to that of minimizing maximum air flow speed () based on a fixed flow input speed. The performance is determined as the inverse over the normalized maximum velocity: . However, we only need to keep within a nuisance threshold, which we set to .
The encoding from the polygon domain is used to produce 2D shapes that are then placed into a CFD simulation. To put emphasis on the architectural nature of the domain, we use two features, area and air flow turbulence. The chaotic behavior of turbulence provokes oscillations around a mean flow velocity, which influences the maximum flow velocity. Both features are not optimization goals. Rather, we want to analyze, under the condition of keeping the flow velocity low, how the size of the area and turbulence are related to each other. We want to produce polygons that are combinations between their appearance (small to large) and their effect on the flow (low to high turbulence). Concretely, at the lowest and highest values of area and turbulence, regular intuitive shapes should be generated by the algorithm such as slim arrowlike shapes for low turbulence and area, or regular polygons for high turbulence and area. However, for area/turbulence combinations in between, the design of the shape is not unique and will possibly differ from intuition.
4.2.1 Lattice Boltzmann Method
The Lattice Boltzmann method (LBM) is an established tool for the simulation of CFD [13]. Instead of directly solving the NavierStokes equations, the method operates a stream and collide algorithm of particle distributions derived from the Boltzmann equation. In this contribution, LBM is used on a 2D grid with the usual lattice of nine discrete particle velocities. At the inlets and outlets, the distribution function values are set to equilibrium according to the flow velocity. The full bounceback boundary condition is used at the solid grid points corresponding to the polygon. Although there are more sophisticated approaches for the boundaries, this configuration is stable throughout all simulations. In addition, the bounceback boundary condition is flexible, as the boundary algorithm is purely local with respect to the grid points. As an extension of the BhatnagarGrossKrook (BGK) collision model [2], a Smagorinsky subgrid model [6] is used to account for the underresolved flow in the present configuration. For a more detailed description of the underlying mechanisms, we refer to [13]. Note that the results of the 2D domain do not entirely coincide with results that will be found in 3D, caused by the difference in turbulent energy transport [22].
The simulation domain consists of grid points. A bitmap representation of the polygon is placed into this domain, occupying up to grid points. As the Lattice Boltzmann method is a solver of weakly compressible flows, it is necessary to specify a Mach number (0.075), a compromise between computation time and accuracy. The Reynolds number is with respect to the largest possible extent of the polygon. For the actual computation, the software package Lettuce is used [11]
, which is based on the PyTorch framework
[17], allowing easy access to GPU functionality. The fluid dynamics experiment was run on a cluster with four GPU nodes, each simulation taking ten minutes. Fig. 3 shows the air flow around a circular polygon at four different, consecutive time steps. Brighter colors represent higher magnitudes of air flow velocity. Throughout the 100,000 time steps of the simulation, maximum velocity and enstrophy are measured. The enstrophy, a measure for the turbulent energy dissipation in the system with respect to the resolved flow quantities [8, 12], increases as turbulence intensity increases in the regarded volume.4.2.2 Validation and Prediction of Flow Features
The maximum velocity and enstrophy are measured every 50 steps. We employ a running average over the last 50,000 time steps. To test whether we indeed converge to a stable value, we run simulations with different shapes (nine variedsize circles and nine deformed star shapes) and calculate the moving average of the enstrophy values, which is plotted in Fig. 4. The value converges to the final feature value (red).
To validate the two measures, we simulate two small shape sets of circles and stars. Increasing the radius of the circles set should lead to higher and , as more air is displaced by the larger shapes. The stars set is expected to have larger and for the more irregular shapes. This is confirmed in Fig. 5.
Next, we investigate whether we can predict the simple shapes’ flow feature values correctly. Although GP models are often called “parameter free”, this is not entirely accurate. The initial guess for the hyperparameter’s values, before minimization of the negative log likelihood of the model takes place, can have large effects on the accuracy of the model. The log likelihood landscape can contain local optima. We perform a grid search on the initial guesses for length scale and signal variance. Using leaveoneout cross validation, GP models are trained on all but one shape, after which we measure the accuracy using the mean absolute percentage error (MAPE), giving a good idea about the magnitude of the prediction error. The process is repeated until all examples were part of the test set once. The MAPE on was for both sets. The enstrophy was harder to model, at and for the respective sets, but still giving us confidence that these two small hypervolumes can be modeled.
5 Evaluation
We evaluate how well SPHEN performs in comparison to SAIL and MAPElites when we include the cost of calculating the features, how accurate the feature models are when trained with a performance based acquisition function, and whether we can apply SPHEN to an expensive domain.
5.1 Quality Diversity Comparison
Parameter  MAPElites  SAIL^{A}  SAIL^{B}  SPHEN 

Generations  4,096  1,024  63  1,024 
Descendants  16  32  16  32 
Budget (per iteration)    1,024 (16)  1,024 (16)  1,024 (16) 
Resolution (acquisition)    16x16  16x16  32x32 

Due to the number of feature evaluations, MAPElites uses and SAIL uses evaluations.

Here, SAIL is restricted to the number of evaluations that was used in MAPElites. Number of generations ().
We run QD optimization without (MAPElites) and with surrogate model(s) (SAIL, SPHEN) on the polygon domain (Section 4.1). This allows us to check all ground truth performance and feature values in a feasible amount of time. The shape features should be easier to learn than the flow features of the air flow domain. The working hypothesis is that we expect SPHEN to perform somewhere between SAIL and MAPElites, as it has the advantage of using a surrogate model but also has to learn two phenotypic features. But in the end, since the ultimate goal is to be able to use QD on expensive features, SPHEN will be our only choice. The parameterization of all algorithms is listed in Table 1
. The initial sample set of 16 examples as well as the selection of new samples (16 in every iteration) is handled by a pseudorandom Sobol sequence. The mutation operator adds a value drawn from a Gaussian distribution with
.Due to the expected inaccuracy of the feature models, misclassifications will decrease the accuracy of the maps. Fig. 6 shows a prediction map at a resolution of 32x32 and the true performance and feature map. Holes appear due to misclassifications, which is why we train SPHEN on a higher resolution map and then reduce the prediction map to a resolution of 16x16. Most bins are now filled. In this experiment all prediction maps have a resolution of 16x16 solutions.
The mean amount of filled map bins and performance values for five replicates are shown in Fig. 7. SAIL and SPHEN find about the same number of solutions using the same number of performance evaluations (PE). Notably, the mean performance of SPHEN’s solutions is higher than that of SAIL. However, in domains with expensive feature evaluations we need to take into account the performance or feature evaluations (PFE). SAIL now needs more than two million PFE to perform almost as well as SPHEN, which only needs 1,024, which is over three orders of magnitude less and still more than an order of magnitude less than MAPElites. Since in expensive real world optimization problems we cannot expect to run more than about 1,000 function evaluations, due to the infeasibly large computational investment, the efficiency gain of SPHEN is substantial. If we lower the number of PFE of SAIL to the same budget as MAPElites and give it more time to search the iteratively improving surrogate model before running out of the budget of 65,536 PFE (see Table 1), SAIL still takes a big hit, not being able to balance out quality and diversity. The example prediction maps are labeled with the number of PFE necessary to achieve those maps. Although we do not sample new training examples to improve the feature models specifically, their root mean square error (RMSE) ended up at and
respectively. Finally, we test SPHEN to the three alternative algorithms on the null hypothesis that they need the same number of PFE to reach an equally filled map or equal performance. Significance levels, calculated using a twosample ttest, are shown in Fig.
7. In all cases, the null hypothesis is improbable (), although for the comparison of filled levels to SAIL it is rejected with less certainty.We conclude that we do not need to adjust the acquisition function. SPHEN and SAIL search for the same elite hypervolume, which is only determined by the performance function.
5.2 Designing Air Flow
After showing that SPHEN can learn both performance as well as feature models, we now run SPHEN in the air flow domain (Section 4.2). The objective is to find a diverse set of air flows using a behavioral feature, turbulence, and one shape feature, the surface area of the polygon. We want to find out how the size of the area and turbulence are related to each other and which shapes do not pass the wind nuisance threshold. We use the same parameters for SPHEN as were listed in Table 1, but allow 4,096 generations in the prediction phase. The enstrophy and velocity are normalized between and using a predetermined value range of and .
The resulting map of solutions in Fig. 8 shows that turbulence and surface area tend to increase the mean maximum air flow velocity, as expected. A small selection of air flows is shown in detail. RMSE of the models is 0.06, 0.01 and 0.10 respectively and Kendall’s tau rank correlation to the ground truth amounts to 0.78, 1.00 and 0.73 (1.00 for A, B, C and D).
Due to the chaotic evolution of turbulent and transient flows, a static snapshot of the velocity field provides only limited information about the flow structures. Therefore, dynamic mode decomposition (DMD) is used to extract and visualize coherent structures and patterns over time from the flow field [4, 20].
Especially those shapes at the extrema of area and turbulence align with the aerodynamic expectations as detailed in Section 4.2. At low turbulence intensity, the shapes tend to be slim and long with respect to the flow direction (shapes A and B). High turbulence levels at small shape areas are achieved if the shapes are oriented perpendicularly to the flow (shape E). Pentagons or hexagons evoke high turbulence levels at large areas (shapes H and I). However, impressively, there is an enormous variety of nuances in between these extrema with nonintuitive shapes, enabling the designer to determine a shape for given flow parameters down to a lower turbulence bound for each area value. Furthermore, the algorithm also suggests known tricks to manipulate the flow. Side arms are an appropriate measure to vary the turbulence intensity in the wake (shapes C, D, E, and G). Indentations or curved shapes redirect the flow and extract kinetic energy similar to turbine blades [5], which can be observed in shape D. Conclusively, for the highest and lowest area and turbulence values, SPHEN matches the expectations while for the shapes in between SPHEN exceeds expectations by introducing unusual shape nuances, which encourage further investigation.
5.3 Discussion
In the polygon domain, both surrogateassisted algorithms are able to find a large variety of solutions. When features do not have to be modeled, they show similar performance, although SAIL converges much sooner. However, when taking into account the number of feature evaluations, SPHEN clearly outperforms SAIL as well as MAPElites. Modeling features does not lower the performance of a prediction map. In terms of solution performance, both surrogateassisted algorithms are outperformed by MAPElites in the simple domain, but SPHEN clearly beats MAPElites by requiring less evaluations. The feature models become more accurate even when sampling only to improve the performance model.
When designing diverse air flows, one SPHEN run took 23 hours, producing 494 different air flow profiles. With SAIL, obtaining the same result would have taken over five years. Although MAPElites outperformed SAIL in the simple polygon domain, and might have outperformed it in the air flow domain as well, it still would have taken two months to calculate with uncertain result. Fig. 8 shows that we can find structure in the air flows that can appear in this problem domain. We can efficiently combine variations (area) of the object we want to design as well as their effect on the environment (turbulence). Even when only using two phenotypic features, the nuances between the variations give us an idea which shapes do not pass the wind nuisance threshold and which ones do, and could continue the design process based on our new intuition.
6 Conclusion
In this work we showed that expensive phenotypic features can be learned along with an expensive performance function, allowing SPHEN, an evolutionary QD algorithm, to find a large diversity of air flows. In an inexpensive domain we showed that, when we take into account the number of feature evaluations, SPHEN clearly outperforms state of the art algorithms like MAPElites and SAIL. The result clears the way for QD to find diverse phenotypes as well as behaviors in engineering domains without the need for an infeasible number of expensive simulations. This is made possible because only the elite hypervolume needs to be modeled. Fluid dynamics domains count as some of the most complicated. Although often solved in ingenious ways by engineers relying on experience, QD can add automated intuition to the design process. Variations of the object we want to optimize as well as variations in the effects on the object’s environment can be seen “at a glance”, which is what intuition is all about.
The most urgent future work is to study whether we can make adjustments to the acquisition function, taking into account feature models’ confidence intervals to improve SPHEN. Furthermore, the solution diversity should be analyzed in higherdimensional feature spaces and applied to 3D shapes.
We showed what expected and unexpected behavioral patterns can emerge in complicated problem domains using surrogateassisted phenotypic niching. Our main contribution, automatic discovery of a broad intuition of the interaction between shape and behavior, allows engineers to think more outofthebox.
Acknowledgments.
This work was funded by the Ministry for Culture and Science of the state of NorthrhineWestphalia (grant agreement no. 13FH156IN6) and the German Research Foundation (DFG) project FO 674/171. The authors thank Andreas Krämer for the discussions about the Lettuce solver.
References

[1]
(2002)
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
3 (Nov), pp. 397–422. Cited by: §2.  [2] (1954) A model for collision processes in gases. i. small amplitude processes in charged and neutral onecomponent systems. Physical review 94 (3), pp. 511. Cited by: §4.2.1.
 [3] (2015) Robots that can adapt like animals. Nature 521 (7553), pp. 503–507. Cited by: §2, §2.

[4]
(2018)
PyDMD: python dynamic mode decomposition.
Journal of Open Source Software
3 (22), pp. 530. Cited by: §5.2.  [5] (2017) Transitional flows with the entropic lattice Boltzmann method. J. Fluid Mech. 824, pp. 388–412. External Links: 1710.02375, ISSN 14697645 Cited by: §5.2.
 [6] (2018) Application of a lattice boltzmann method combined with a smagorinsky turbulence model to spatially resolved heat flux inside a refrigerated vehicle. Computers & Mathematics with Applications 76 (10), pp. 2315–2329. Cited by: §4.2.1.
 [7] (2017) DataEfficient Exploration, Optimization, and Modeling of Diverse Designs through SurrogateAssisted Illumination. External Links: ISBN 9781450349208 Cited by: §2, §2.
 [8] (2013) On the accuracy of highorder discretizations for underresolved turbulence simulations. Theoretical and Computational Fluid Dynamics 27 (34), pp. 221–237. Cited by: §4.2.1.
 [9] (2018) Prototype discovery using qualitydiversity. In International Conference on Parallel Problem Solving from Nature, pp. 500–511. Cited by: §3.
 [10] (2013) Pedestrian wind comfort around buildings: comparison of wind comfort criteria based on wholeflow field data for a complex case study. Building and Environment 59, pp. 547–562. Cited by: §4.2.
 [11] Lettuce: PyTorchbased Lattice Boltzmann Solver Cited by: §4.2.1.
 [12] (2019) Pseudoentropic derivation of the regularized lattice Boltzmann method. Phys. Rev. E 100 (2), pp. 1–16. External Links: ISSN 24700045 Cited by: §4.2.1.
 [13] (2017) The lattice Boltzmann method: Principles and practice. External Links: ISBN 9783319446479, ISSN 10986596 Cited by: §4.2.1.
 [14] (2011) Abandoning objectives: evolution through the search for novelty alone. Evolutionary computation 19 (2), pp. 189–223. Cited by: §2.
 [15] (2011) Evolving a diversity of virtual creatures through novelty search and local competition. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pp. 211–218. Cited by: §2.
 [16] (2006) Wind comfort and wind danger in the built environment (in dutch). Norm Technical Report NEN 8100. Cited by: §4.2.

[17]
(2019)
PyTorch: An Imperative Style, HighPerformance Deep Learning Library
. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. D’AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: §4.2.1.  [18] (2010) Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research 11, pp. 3011–3015. External Links: ISSN 15324435 Cited by: §3.

[19]
(1997)
Evaluation of gaussian processes and other methods for nonlinear regression
. Ph.D. Thesis, University of Toronto Toronto, Canada. Cited by: §2.  [20] (2010) Dynamic mode decomposition of numerical and experimental data. Journal of fluid mechanics 656, pp. 5–28. Cited by: §5.2.
 [21] (1967) On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 7 (4), pp. 784–802. Cited by: §3.
 [22] (1978) Turbulent flow in two and three dimensions. Bulletin of the American Meteorological Society 59 (1), pp. 22–28. Cited by: §4.2.1.
 [23] (2018) Discovering the elite hypervolume by leveraging interspecies correlation. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 149–156. Cited by: §3.