1 Introduction
In mechanics, optics, thermal transport, physical chemistry, climate models, and many other fields, datadriven surrogate models—such as polynomial fits, radial basis functions, or neural networks—are widely used as an efficient solution to replace repetitive calls to slow numerical solvers
[baker2019workshop, benner2015survey, willard2020integrating, hoffmann2019machine, pestourie2018inverse]. However the reuse benefit of surrogate models comes at a significant cost in training time, where a costly highfidelity numerical solver must be evaluated many times to provide an adequate training set, and this cost rapidly increases with the number of model parameters (the “curse of dimensionality”)
[boyd2007chebyshev]. In this paper, we explore one promising route to increasing trainingdata efficiency: incorporating some knowledge of the underlying physics into the surrogate by training a generative neural network (NN) “endtoend” with an approximate physics model. We call this hybrid system a “physicsenhanced deep surrogate” (PEDS), and demonstrate multipleorderofmagnitude improvements in sample and time complexity on a test problem involving optical metamaterials—composite materials whose properties are designed via microstructured geometries [pestourie2020active]. In inverse design of metamaterials, similar geometric components may be reused thousands or millions of times in a large structure such as an optical metasurface [pestourie2018inverse, pestourie2020assume], making surrogate models especially attractive to accelerate computational design [bayati2021inverse, li2021inverse].In particular, we present a PEDS architecture for modeling transmission through a microstructured multilayer “metasurface” [pestourie2020active], where the highfidelity model solves Maxwell’s equations, in which a deep NN is combined with an fast approximate Maxwell solver based on an extremely coarse discretization, as depicted in Fig. 1 (Sec. 2). By itself, the coarse model yields error if it is applied directly to a downsampled/coarsified geometry, but it qualitatively captures key underlying physics of scattering and resonance. To obtain an accurate surrogate, we apply a deep NN to generate a coarse geometry, adaptively mixed with the downsampled input geometry, which is then used as an input into approximate solver and trained endtoend to minimize the overall error. In this way, the NN learns to nonlinearly correct for the errors in the coarse model, but at the same time the coarse model “builds in” some knowledge of the physics and geometry. We compare the result of our PEDS model against a NNonly baseline model (Sec. 4) as well as previous “spacemapping” (SM) [bakr2000neural, zhu2016novel, feng2019coarse] approach where we combine a coarse Maxwell solver with a NN transforming only a lowdimensional parameterization of the fine geometry to a similar lowdimensional parameterization of the coarse geometry (Sec. 2).We find that PEDS not only lowers the error of the surrogate for a given amount of data, but it actually seems to improve the asymptotic rate of learning ( larger power law), so that the benefits increase as accuracy tolerance is lowered (Fig. 2 and Sec. 2). For 3.5% accuracy, PEDS requires several orders of magnitude less data than the competing approaches. We show through an ablation study that adding information from the downsampled structure increases the accuracy by 15% in a lowdata regime. Furthermore, we find that PEDS gains significant additional benefits by combining it with activelearning techniques from our earlier work [pestourie2020active], and in fact the benefits of active learning seem to be even greater for PEDS than for competing approaches. Although the resulting PEDS surrogate is more expensive to evaluate than a NN by itself, due to the coarse Maxwell solver, it is still much faster than the highfidelity Maxwell solver ( in 2D, in 3D). Furthermore, since our NN generates a coarsified version of the geometry, this output can be further examined to gain insight into the fundamental physical processes affecting the output.
PEDS should not be confused with physicsinformed neural networks (PINNs), which solve the full PDE (imposed pointwise throughout the domain) for the entire PDE solution (not a surrogate for a finite set of outputs like the complex transmission) [karniadakis2021physics, lu2021physics], and which do not employ any preexisting solver; indeed, current PINNs tend to be slower than conventional “fine” PDE solvers (e.g. based on finite elements) [shin2020convergence]
, but offer potentially greater flexibility. Universal ordinary differential equations (UODEs)
[rackauckas2020universal] also tackle a different problem from PEDS: they identify unknown dynamics in an ODE by replacing the unknown terms with neural networks trained on data. In contrast to DeepONet [lu2021learning] and Fourier neural operators [li2020fourier], PEDS includes a numerical solver layer. Finally, in contrast to error correction techniques at the output level of the surrogate [lu2020extraction, koziel2006space], PEDS includes the solver in an endtoend fashion during the training process.2 Results
2.1 Physical model and solvers
Similarly to Ref. pestourie2020active, our surrogate model predicts the complex transmission of a 2D “metaatom” unit cell with a parameterized geometry , which consists of ten layers of air holes with independent widths etched in a substrate (of dielectric constant corresponding to silica), with periodic boundary conditions in and outgoing radiation boundary conditions in the direction and an incoming normalincident planewave from below. In terms of the vacuum wavelength of the incident wave (for the largest considered below), the period in is and the total thickness is (with hole heights of 0.75 and interstices of 0.35); the fact that the structure is several wavelengths in diameter causes the transmission to be a complicated oscillatory function that makes the surrogate training challenging [pestourie2020active]. (A “metasurface” consists of a collection of many of these metaatoms, designed to perform some optical function such as focusing [li2021inverse]. The full solution for a metasurface can be approximated in terms of the transmissions of individual periodic ‘unit cells via a local periodic approximation [pestourie2018inverse].) A schematic unit cell with 3 holes is showed in Fig. 1, and an example 10hole structure from the training set is shown in Fig. 2 (right).
Both the “fine” (highfidelity) and “coarse” (lowfidelity) solvers in this paper employ finitedifference frequencydomain (FDFD) discretizations of Maxwell’s equations
[champagne2001fdfd], using perfectly matched layers (PMLs) [sacks1995perfectly] to implement outgoing boundary conditions. FDFD essentially represents the geometry by a grid of discretized “pixels,” which is some function of the parameters (hole widths) . (In particular, each pixel’s is assigned to a subpixel average of the infiniteresolution structure, which both increases accuracy [oskooi2009accurate] and makes piecewise differentiable.)An FDFD resolution of 40 pixels per wavelength is used as our “fine” solver, the source of our training data as indicated in Fig. 1 (inset). This resolution is typical for highfidelity solvers in electromagnetism, because it is comparable to the manufacturing accuracy in nanophotonics and hence suffices for practical metalens design [li2021inverse, bayati2021inverse] within fabrication uncertainty. (Sharp/narrowband resonances can shift if one refines the resolution further, but the positions and the bandwidths of the resonances are accurate to within a few percent.) Each finesolver data point required s (on a 3.5 GHz 6Core Intel Xeon E5); an analogous simulation in 3D takes tens of minutes. Our PEDS surrogate (below) uses an FDFD solver at a coarser resolution of 10 pixels per wavelength, which is about faster in 2D and faster in 3D, but has much worse accuracy; as quantified in Sec. 2, it differs from the fine solver’s transmission by % on our test set.
The upfront cost of building the training dataset is the most timeconsuming part of developing a supervised surrogate model. By building some approximate “coarse” physics knowledge into the surrogate, we will show that PEDS greatly reduces the number of expensive simulations, especially when combined with active learning.
2.2 Peds
The PEDS surrogate model is shown schematically in Fig. 1, and is computed in the following stages (“layers”):

Given the parameters of the geometry, a deep generative NN model yields a grid of pixels describing a coarse FDFD geometry. We call this function

We also compute a coarsegrid downsampling (subpixel averaging) of the geometry, denoted , e.g. by downsampling .

We make a weighted combination of the NNgenerated and downsampled geometries: , with a weight (independent of ) sta that is another learned parameter.

If there are any additional constraints/symmetries that the physical problem imposes on the geometry, they can be applied as projections . (For example, in our metasurface problem we could average with its mirror image to ensure that the generated structure is mirrorsymmetric like the exact structure.)

Finally, given , we evaluate the coarse solver to obtain the complex transmission .
In summary, the PEDS model is
(1) 
A basic PEDS training strategy could simply minimize the meansquare error (for a training set ) with respect to the parameters of the NN and the weight
. In our case, we employ a more complicated loss function that allows us to easily incorporate activelearning strategies
[pestourie2020active]. We optimize the Gaussian negative loglikelihood of a Bayesian model [lakshminarayanan2016simple](2) 
where is a Gaussian likelihood parameterized by the model parameters
, and the heteroskedastic “standard deviation”
is the output of another NN (trained along with our surrogate model). In practice, rather than examining the entire training set at each training step, we follow the standard “batch” approach [goodfellow2016deep] of sampling a random subset ofand minimizing the expected loss with the Adam stochastic gradientdescent algorithm
[kingma2014adam] (via the Flux.jl [innes:2018] software in the Julia language).We train our model to predict the complex transmission for 3 frequencies, which are encoded with a onehot encoding vector. The input of the model
is the concatenation of the 10 widths and the onehot encoding of the frequency. The coarse solver is a layer of the PEDS model, which is trained endtoend, so we must backpropagate its gradient
through the other layers to obtain the overall sensitivities of the loss function. This is accomplished efficiently using wellknown “adjoint” methods [molesky2018inverse], which yield a vectorJacobian product that is then automatically composed with the other layers using automatic differentiation (AD) (via the Zygote.jl [innes2018don] software).Static and dynamic training
In this paper, we investigated two types of supervised endtoend training, a static training which takes a training set sampled at random, and a dynamic Bayesian training where the training set is iteratively expanded using an active learning algorithm [pestourie2020active]. Essentially, active learning attempts to sample training points where the model uncertainty is greatest, thereby reducing the number of costly training points that must be generated (by the fine solver). Our previous work on active learning reached more than an order of magnitude improvement in data efficiency for a blackbox NN, and in this paper (Sec. 2.3 below) we also substantial improvements from active learning for PEDS.
The activelearning algorithm iteratively builds a training set by filtering randomly generated points with respect to a trained measure of uncertainty [pestourie2020active]
. The hyperparameters of this algorithm are
, which is the number of points the surrogate models is initially trained with, , the number of exploration iteration, and , which are such that points are randomly generated at each iteration and only points with highest uncertainty are explored, i.e. we run the expensive fine solver to get the PDE solutions of these points. We have trained surrogates as well as ensemble of 5 independent surrogates. We found that models optimizing the negative loglikelihood perform similarly to models optimizing the mean squared error in the case static training. This is not surprising, because the mean squared error is part of the negative loglikelihood objective.SM baseline
Our PEDS has similarities with input space mapping (SM) [koziel2008space], especially neural SM [bakr2000neural] and coarse mesh/fine SM [feng2019coarse], where the input of a fine solver is mapped into the input of a coarse solver. However, SM uses the same parameterization (e.g. the widths of the holes) for the fine solver and the coarse solver, whereas PEDS uses a much richer coarsegeometry input (a grid of material values, whose dimensionality is different in the coarse and fine geometries) and can therefore incorporate more geometric and physical inductive biases, such as symmetries and the downsampled structure. For comparison, we trained an input SM baseline model, which is a combination of neural and coarse mesh/fine mesh SM [zhu2016novel, feng2019coarse]. In this model, the NN is learning a mapping that creates modified geometry parameters which are then fed to the coarse solver (see Sec. 4 for implementation details). Since the coarsified parameterized geometry is implemented via subpixel averaging as described in Sec. 2.1, this function is differentiable, so the gradient can backpropagate all the way to the mapping NN.
2.3 Accuracy and performance
We compared PEDS to a NNonly baseline (Sec. 4) and an SM baseline. In Fig. 2, we show that PEDS clearly outperforms all other models when combined with active learning. In lowdata regime, it is more accurate than the baseline. Asymptotically, in highdata regime, it converges to the true value with a power law exponent better, with a slope of 0.5, in contrast to 0.1 for the baseline on the loglog plot. SM does not gain much accuracy from ensembling nor from active learning, and is even worse than the baseline NN. (We found that SM can perform comparably to the baseline NN if the coarsesolver resolution is doubled to 20, not shown here, at the expense of more computational effort.)
From a dataefficiency perspective, the PEDS+AL solver achieves 20% accuracy on the test set with less data than the baseline NN, less data than the baseline NN with AL, and orders of magnitudes less data than space mapping (SM) with AL. Only PEDS+AL reaches % accuracy, but if we extrapolate the other curves it is clear that they would require at least two orders of magnitude more data to achieve similar accuracy.
(Note that the fractional error (FE) of the coarse solver is 1.24 on the test set, which shows that the coarse solver alone is clearly insufficient to achieve reasonable accuracy.)
Timing was compared on a 3.5 GHz 6Core Intel Xeon E5. Evaluating the baseline (with an ensemble of neural networks) takes 500 , while PEDS evaluates in , which is about a ten times slower. However the fine solver is about a hundred times slower, evaluating at . In order to simulate the data set quickly, and without loss of generality, we showed results for PEDS in 2D. Although PEDS is already faster than the fine model by two orders of magnitude, this difference will be even starker for 3D simulations. The simulation of the equivalent structure in 3D evaluates in about with the coarse model, and in with the fine model. In this occurrence, PEDS would represent a speedup by at least four orders of magnitude. Moreover, as we discuss in Sec. 3, the general PEDS approach can be applied to a wide variety of “coarse physics” models, which can be chosen to have a wide range of performance benefits compared to a highfidelity model.
In order to understand the effect of mixing the generated structure with a downsampled structure, we performed an ablation study on an AL ensemble model in the lowdata regime (1280 training points), with results given in Table 1. The edge cases of using only the downsampled structure with the coarse solver performs the worst (1.24 error), corresponding to in (1). Conversely, using the NN generator only, corresponding to in (1), is still about 15% worse (0.20 error) than using adaptive mixing . Imposing mirror symmetry, via in (1), did not improve the accuracy of the model in this case (but is a useful option in general, since symmetry may have a larger effect on the physics in other applications).
Generative model for coarse geometry  FE on test set  PEDS improvement 

(coarsified only)  1.24  86% 
(generator only)  0.20  15% 
PEDS with symmetry  0.18  5% 
PEDS  0.17  — 
2.4 Data analysis of the PEDS model
Because the trained PEDS model includes a NN that generates “equivalent” coarsegrained geometries to the input structure, it can be interesting to analyze these geometries and potentially extract physical insights.
Frequency dependence:
The neural network generates structures that are qualitatively different as a function of the input frequency (Fig. 3 (right insets)). As might be expected on physical grounds (e.g. effectivemedium theory [holloway2011characterizing]
), the lowest frequency (longer wavelengths) corresponds to the smoothest generated structures, because the wavelength sets the minimum relevant lengthscale for wave scattering. To help quantify this, we performed a principal components analysis (PCA) of
for uniform random values (including random frequency). We show the first few principal components in Fig. 3 (left). The first and second components explain 67% and 13% of the variations, respectively. We show in Fig. 3(right) that the coordinates of the first two components are sufficient to classify generated geometries according to the input frequency.
Scattering richness:
To explore the effect of additional scattering physics produced by multiple layers of holes, we generated coarse geometries for different numbers of layers (equivalently, fixing the parameters of the “deleted” layers to zero). We then decomposed the resulting into the PCA components from above. As we increase the number of layers, the average coordinates of some principal components monotonically increases in magnitude. Since we know that more layers contain more scattering richness, the corresponding principal components geometries provides some geometrical insight into how scattering richness translates into the generated structure. From our analysis of generated structures for the smallest frequency, the first principal component geometry clearly contributes to scattering richness, with an average coordinate (across 1000 generated structures) increasing 11 to 26 as the number of layers goes from 1 to 9.
3 Discussion and outlook
The significance of the PEDS approach is that it can easily be applied to a wide variety of physical systems. It is common across many disciplines to have models at varying levels of fidelity, whether they simply differ in spatial resolution (as in this paper) or in the types of physical processes they incorporate. For example, in fluid mechanics the “coarse” model could be Stokes flow (neglecting inertia), while the “fine” model might be a full Navier–Stokes model (vastly more expensive to simulate) [ferziger2002computational], with generator NN correcting for the deficiencies of the simpler model. As another example, we are currently investigating a PEDS approach to construct a surrogate for complex Boltzmanntransport models [romano2021openbte] where the “coarse” heattransport equation can simply be a diffusion equation. PEDS provides a datadriven strategy to connect a vast array of simplified physical models with the accuracy of bruteforce numerical solvers, offering both more insight and more data efficiency than physicsindependent blackbox surrogates.
In addition to applying the PEDS approach to additional physical systems, there are a number of other possible technical refinements. For example, one could easily extend the PEDS NN to take an image of the finestructure geometry rather than its parameterization, perhaps employing convolutional neural networks to represent a translationindependent “coarsification.” This type of surrogate could then be employed for topology optimization in which “every pixel” is a degree of freedom
[molesky2018inverse]. Another interesting direction might be to develop new “coarsified” physics models that admit ultrafast solvers but are too inaccurate to be used except with PEDS; for instance, mapping Maxwell’s equations in 3D onto a simpler (scalarlike) wave equation or mapping the materials into objects that admit especially efficient solvers (such as impedance surfaces [perez2018sideways] or compact objects for surfaceintegral equation methods [jin2015finite]).4 Methods
Coarse solver and gradient
In the present work, the coarse solver is similar to the fine solver except that it uses a much coarser resolution of 10, which corresponds to a resolution of less than 5 pixels per wavelength in the worst case, instead of 40 for the fine model. The symmetry action was a simple mirror symmetry, implemented by averaging of the geometry with its mirror flip.
Implementation details of PEDS and baselines
The generator neural network of PEDS has two hidden layers with 256 nodes and relu activation functions, and outputs a flattened version of the coarse geometry of dimension 1100 with a hardtanh activation function (
). The network that outputs the variance of the models, takes the generated coarse geometry as input, has 3 hidden layers with relu activation functions and outputs a scalar with a relu activation function. The corresponding baseline, which is a neuralnetwork only (NNonly) method, was chosen to be as close as possible to PEDS architecture, it replaces the coarse solver with a fully connected layer, and outputs two scalars with a tanh activation function. Note that it does not have the information of the downsampled structure. The mapping neural network of the input SM implementation has two hidden layers with 256 nodes and relu activation functions, and outputs the coarse geometry parameters of dimension 10 with a hardtanh activation function. The variance network is similar to PEDS except that the inputs are the SM output geometry parameters. The batch size was set to 64 and the learning rate to
. Every training went through 10 epochs.
Active learning implementation details
The active learning training [pestourie2020active] used the following parameters , , , and K took powers of 2 ranging from to .
Parallelization
In order to accelerate the training of the surrogate model, we parallelized the training at the batch loop level. For ensemble learning, we used 320 computing units which were split into 5 groups (one group per model in the ensemble) of 64 computing units. With a batch size of 64, each worker evaluates the surrogate only once per batch loop (a batch size of a multiple of 64 would work well too).
Code
The code was implemented in Julia language version 1.6, using MPI.jl for parallelization with MPI, Flux.jl for the neural network training framework, ChainRules.jl for custom differentiation rules, and Zygote.jl for other automatic differentiation.
Comments
There are no comments yet.