1 Introduction
Numerous physical systems are described by ordinary or partial differential equations whose solutions are given by holomorphic or meromorphic functions in the complex domain, where and are respectively the magnitude and phase of . In many cases, only , the magnitude of , is observed on a finite set of points () of the purely imaginary axis^{1}^{1}1For readers unfamiliar with phase retrieval problems and have questions about why function values are only observed on the purely imaginary line, please refer to the explanation we have included in the Appendix., since the coherent measurement of their phases may be expensive (see Ch. 13 of PhaseRetrieval2016) or require significant expertise (king2009hilbert). However, it is desirable to retrieve the lost phase from measurements of the magnitudes for
, since the phase often encodes important information. For example in both spectroscopy and optical imaging, estimation of
is often not possible. However, phase information determines the optical constants in the spectroscopy of materials (lucarini2005kramerskronig), and is more important than in optical imaging (PhaseRetrieval2016). The phase retrieval problem is also important in a number of other disciplines, including holography (Fienup1982), xray crystallography (Nugent2007), astronomy (stark1987image), blind deconvolution (Shechtman2015), and coherent diffractive imaging (Latychevskaia2018).Many techniques for phase retrieval have been developed in the literature. The Kramers–Kronig formula, see lucarini2005kramerskronig (also known as Sokhotski–Plemelj formula), is noteworthy and is derived by the application of the Hilbert Transform to . Unfortunately, for phase retrieval, this elegant formula requires knowledge of the magnitude along the entire axis. However, as mentioned above, acquiring these values is often impossible since is typically a finite set with a small number of elements. This severely limits the application of Kramers–Kronig’s formula in many applications. Another approach is based on compressed sensing, but it requires sparsity assumptions, and stringent assumptions on the design or projection matrix (i.e., Restricted Isometry Property (RIP) (candes2008restricted), mutual incoherence (foucart2013invitation), etc) that often do not hold for meromorphic functions encountered in practice.
To overcome the above limitations, phase retrieval can also be cast as a datadriven regression problem, where we wish to learn a mapping from the magnitude of a signal, , to its phase, . If the number of frequencies () measured is
then the phase can be represented as a vector of complex numbers
, where is the phase at th frequency. Similarly, the signal’s corresponding magnitude vector can be given by , and the goal is to infer based upon . We can then learn this mapping using a datadriven model (see Figure 1 below for an example).With a sufficient amount of training data, it is likely that a universal function approximator (i.e. neural network), can learn the underlying function well within a tolerance range. For instance, deep learning has shown great success on phase retrieval problems in the imaging area where abundant training data is available
(sinha2017lensless; metzler2018prdeep). But sufficient training data can be expensive to acquire in many practical applications, or sometimes even impossible.In this paper, we depart from the existing literature by exploring the applications of physicsinfused neural networks to this setting. The infusion of physics reduces the number of required training examples to a small enough amount that is practical for many measurements and design applications of interest. The main idea of the approach is described next. Given the fact that any phase value has a magnitude of 1 (i.e. for all ), we observe that conditions of a Theorem of Helson and Sarason are then satisfied indicating that can be approximated by a rational function whose numerator and denominator are both Blaschke products (Garcia). This is then used to design a neural network for calculating coefficients of these Blaschke products which may be be used for phase retrieval.
In Section 2, we briefly review conventional phase retrieval methods, and existing applications of the neural network to phase retrieval problems. In Section 3, we briefly review Blaschke products and the HelsonSarason Theorem. In Section 4, we propose our approach for baking these results into neural networks resulting in Blaschke Product Neural Networks. In Section 5, we provide experimental results on various synthetic and metamaterials datasets along with a inclusive set of baselines to demonstrate the performance improvements of our construction.
2 Related Work
We first briefly review existing methods of phase retrieval applicable to phase retrieval of analytic and meromorphic functions of interest.
Perhaps the most important result in this area is the elegant calculation of Kramers and Kronig (also known as Sokhotski–Plemelj formula) obtained by the application of the Hilbert Transform to and is given by
where P.V. denotes Cauchy’s principal value. We note that this can also be derived by invoking the Cauchy integration formula along a path containing the imaginary axis.
However, there is a major obstacle in the application of this elegant formula. In many important practical scenarios, the set of frequencies (points on axis) that we have measurements of has a small cardinal number. In such cases, the approximation of KramersKronig integral by a Riemann Sum produces unsatisfactory results. Other approaches used to analytically extend or extrapolate have also demonstrated limited success (Chalashkanov2012).
Another approach to phase retrieval is by sparse representation. This approach assumes that it is a priori known that has a sparse representation in a known dictionary of functions i.e. it can be written as a linear sum of elements of with only a small number of nonzero coefficients. Under this assumption, many techniques are developed to recover these coefficients and the phase of . Again, this is an interesting proposal but the underlying assumptions may not hold in many problems of interest. In particular, neither a dictionary may be known nor classes of functions of interest may be sparse in any fixed dictionary. Recently, some new techniques based on deep learning have been developed to learn the structure of the underlying signal which we want to recover its phase. That is, deep neural networks are used to encode the prior knowledge on the structure of the signal instead of hardcoded assumptions such as sparsity (jagatap2019phase). However, these methods require a significant number of training samples from the magnitude of the underlying signal, which limits the applicability of these approaches on material science applications.
Recently, DL has demonstrated its capability to solve the phaseretrieval problem, especially in the imaging area: sinha2017lensless achieved lensless imaging for phase objects utilizing CNN. xue2019reliable
constructed a bayesian convolution neural network to achieve phase imaging while quantifying the uncertainty of the network output.
metzler2018prdeep combined regularizationbydenoising framework with a convolution neural network for noisy phase retrieval for images. goy2018low achieved low photon count phase retrieval with a Convolutional Neural Network(CNN) as well. To avoid the difficulty of getting a large set of paired phase and intensity measurements, Zhang et al. zhang2021phasegan used an unpaired dataset for phaseretrieval. Other works of tayal2020inverse; houhou2020deep have also used deep learning for endtoend phase recovery. None of these works use the underlying physics in their deep learning approach or they require a large number of training samples; hence, are not directly applicable to the physics applications we are interested in this paper. We took a different path of applying physical prior knowledge to our phase retrieval process.One closely related line of work to ours is solving partial differential equations using neural networks (dissanayake1994neural; berg2018unified; zhang2019bilinear; chen2018neural). However, these methods typically need a lot of data and require careful selection of hyperparameters; additionally, they do not consider directly solving the phase retrieval problem. Our approach, on the other hand, needs no specific hyperparameter search algorithm, and achieves highaccuracy result with a very limited training samples.
3 Mathematical Background
We briefly review the mathematical background required in the rest of this paper. In particular, we review Blaschke products and the HelsonSarason Theorem. We note that the standard presentation of the Blaschke products is motivated by inner functions in Hardy spaces presented over the unit disk whose boundary is the unit circle (Rudin; Garcia). In our case, we are interested in the right hand half plane whose boundary is the axis. These two are equivalent using the complex Mobius transformation . In this light, our exposition is closer to Akotuwicz.
Let be an infinite sequence of numbers for which for all for which
then for any integer , any complex function given by
is referred to as a infinite Blaschke Product. If the set is finite, is referred to as a finite Blaschke product. For a finite Blaschke Product, it is clear that since there are only finite number of coefficients , and therefore this condition can be dropped. Note that roots with do not contribute to the product.
For both finite and infinite Blaschke products, we can see that for on the purely imaginary axis ^{2}^{2}2Details proving can be found in the appendix. In this paper, we are only interested in finite Blaschke products because of the following approximation Theorem of Helson and Sarason.
Theorem 1
Let be any continuous function on the axis for which for all . Let . Then there exists finite Blaschke products and such that
where the denotes the standard norm computed over the axis.
Let . We know that because . Applying the HelsonSarason Theorem, for analytic or meromorphic functions (with no poles on the imaginary axis), the phase can be approximated by a rational function of finite Blaschke products. The main theme of this paper is that these Blaschke products can be in turn computed using neural networks with a small amount of data.
From the expression of finite Blaschke products, we can see that for where a complex number with and is a complex polynomial of and , where is constructed from by replacing all its coefficients with their respective conjugates. Note that . Let and represent the Blaschke products and in the above theorem with the complex polynomials, we have the approximating rational function written as with Thus we have the estimation for phase function as:
(1) 
where and and correspond to respectively factors of with roots with positive and negative real parts.
3.1 Blaschke Product Neural Network
Inspired by the above, we model the phase function by equation 1, and propose the Blaschke Product Neural Network (BPNN). A BPNN is a neural network with the magnitude of observations as the input and the coefficients of the rational function of Blaschke products as the output. In this paper, we only present BPNNs which are fully connected neural networks (FCNNs), although such a restriction is not necessary. We next describe the operation of BPNNs by discussing the forward and backward pass mechanisms.
Forward Pass
The output of BPNN are Blaschke coefficients of the complex polynomials that are in turn used to construct predicted phase function according to equation 1. Without loss of generality, let be the set of frequencies for which experimental results measuring the magnitude of signals is performed, and their corresponding phases must be retrieved. It can be assumed that . The BPNN then produces estimates of the phase . Combining with the known magnitude at , this gives an estimate
Backward Pass
After predictions are generated with forward pass , the prediction error is computed as
The training seeks to minimize
over all training samples in a standard manner using the stochastic gradient descent algorithm and its variants. Crossvalidation and testing can also be done in a standard format.
An important hyperparameter of interest is number of degrees for complex polynomials and . This hyperparameter can either be crossvalidated or chosen based on the practitioner’s knowledge about the complexity of the specific phase retrieval problem. In our experiments, we found that complex phase retrieval problems may require larger degree polynomials, and calculating the value of these high degree polynomials over large frequency ranges (100THz  1500THz and above) may lead to the overflow problem. To overcome this problem, we propose a piecewise implementation of BPNN.
Piecewise Implementation of BPNN
In this approach, given a large frequency range, the whole frequency band is partitioned into nonoverlapping contiguous segments. The Blaschke method is then applied to each segment (see figure1(b)). The BPNN is then trained to output the respective Blaschke coefficients for each segment. During training, the predictions of phases on all segments are concatenated. Then the following training is the same as nonpiecewise BPNN. The exact method of partitioning and degrees of corresponding Blaschke polynomials for each sequence are hyperparameters of choice. We have two example applications of piecewise BPNNs to metamaterial datasets in our numerical experiments.
4 Experiments
We conduct simulations on five phase retrieval datasets, two of which are synthetic, while the remaining three are metamaterial datasets. A detailed listing of the datasets is shown in Table 1. We endeavor to test the BPNN on diverse, yet complete, datasets, and thus we choose datasets that include both holomorphic and meromorphic complex functions. Additionally, the selected metamaterial datasets are comprehensive, and well represent the discipline.
4.1 Datasets
PHASE RETRIEVAL PROBLEM  TYPE  # FREQUENCY 

Polynomial ODE (4.1.1)  Synthetic  200 
Lorentzian response function (A.7)  Synthetic  1000 
Metamaterial absorber (4.1.3)  Metamaterial  1000 
Huygens’ metasurface (4.1.4)  Metamaterial  2001 
Membrane metasurface (4.1.5)  Metamaterial  2001 
4.1.1 Synthetic I: Polynomial ODE
We first select points on the
axis and fix these values. The first synthetic dataset is the solution to a ordinary differential equation (ODE) given by:
where is a complex polynomial of degree with its roots lying on the unit circle. We uniformly sample roots of on the unit circle, then solve the above ODE on the axis using the Euler method with the real and imaginary part of the initial condition uniformly selected in for randomly generated initial values. The values are then calculated and their magnitudes and phases are used in this experiment, of which are used for training and for testing.
4.1.2 Synthetic II: Lorentzian Response Function
The second synthetic dataset is generated by considering a causal model of a nonmagnetic material, with a frequencydependent complex dielectric permittivity defined as a sum of Lorentzian oscillators. A simple Lorentzian model of the dielectric function is applicable to a wide range of resonant material systems with response functions that must obey KramersKronig relations. Using the transfermatrix method, Markos2008, the complex transmission coefficient can be computed directly from the dielectric function via closed form equations.(Please see appendix for more details). We generate a dataset of complex transmission coefficients by uniformly sampling Lorentzian parameters for the frequency range 100500 THz. We consider a fixed set of four Lorentzian oscillators, and generate 50 sample transmission spectra for training, and 2000 spectra for the test dataset.
4.1.3 Metamaterial I: Metamaterial Absorber
Metamaterials are structured materials consisting of periodic arrays of resonant elements that derive their spectral properties from their unit cell geometry smith2000composite. With effective material properties that can consist of resonances in both the electric permittivity and magnetic permeability, and hence more unknown material parameters, metamaterials often represent a more complicated phaseretrieval problem than many conventional systems.
The first metamaterial dataset considers a metalbased absorber geometry with a metallic ground plane backing a dielectric spacer layer, designed to operate in the millimeter wave (50200 GHz) frequency range. This metamaterial absorber (MMA) dataset has 55,000 pairs of MMA geometry and complex frequency dependent reflection coefficient ( is zero due to the metal backing). The data are obtained with commercial computational electromagnetic simulation software (CST Microwave Studios).
4.1.4 Metamaterial II: Huygens’ Metasurface
Huygens’ metasurfaces made of dielectric materials represent another class of metamaterials decker2015high, with underlying properties that are different than their metallic counterparts, including for example the lack of Ohmic losses liu2017experimental. When individual designs are combined into supercells, alldielectric metasurfaces can exhibit very rich spectral phenomena. The Huygens’ metasurface dataset consists of the complex reflection and transmission response derived from a supercell of elliptical resonators adopted from the work of deng2021neural. The phase information can be inferred directly from the complex reflection or transmission. Out of a total of 1000 samples, we randomly sample 80 for training and reserve the rest for testing.
4.1.5 Metamaterial III: Membrane Metasurface
The membrane metasurfaces are similar to the Huygens’ metasurfaces. However, instead of having four standalone elliptical resonators in the supercell, it consists of a thin slab of SiC with regions of empty holes. The SiC slab filling the supercell has a thickness defined by , and a periodicity . In the SiC slab, we placed four elliptical holes where the elliptical SiC resonators were placed. The four holes thus have the same geometrical parameters , , and . The fourteen geometrical parameters share the same range as the elliptical resonators dataset. Similarly, the input of these datasets contains fourteen geometry parameters, and the outputs are complex reflection and transmission with 2001 frequency points range from 100 THz to 500 THz. Out of a total of 7331 samples, we randomly sample 80 for training and reserve the rest for testing.
4.2 Benchmarking Neural Networks
Comparison is done on each dataset with training datasets of increasing size. For each phase retrieval problem and each training dataset, we compare BPNN to a family of NNs of various sizes and structures. For each structure, we search its hyperparameters of training to guarantee we have NNs close to optimal. And we record test error at each training epoch and report the best test error. We note that recording test error is uncommon but this guarantees that the reported performance is close to its limit.
We use NNs of various structures and all hidden layers of each of the NNs have the same width. The range of hidden layer width in search is [32,64,128,256] and the number of hidden layers in search is [1,2,3,4]. For hyperparameters of training, we search dropout rate in the list of [0., 0.05, 0.1, 0.15, 0.2] and learning rate in the list of [1e4, 5e4, 1e3, 5e3, 1e2]. Each NN is trained with 6000 epochs after which training losses of all neural networks have converged. The training is performed with 3 random initialization and we report the best test error. Huygens’ metasurface dataset is more complex than the other datasets, so the range of hidden layer size in search is increased to [64,128,256,512] while keeping the ranges of the other searched parameters the same. Unlike the extensive searching for NNs, we do not search hyperparameters of BPNN and only use one BPNN for each phase retrieval problem. We also train BPNN with 3 random initialization on each dataset and report its best test error. We use ReLU as activation function for both NNs and BPNNs while Adam
(kingma2017adam) is used for training of all NNs and BPNNs.4.3 Results
For each dataset, we compare the family of regular fullyconnected neural networks, each with its best searched hyperparameters to BPNN without hyperparameters search. For three metamaterial datasets, we further include four baselines: (1)The KramersKronig method (KKmethod): a conventional phase retrieval method (lucarini2005kramerskronig), (2)AAAnetwork algorithm: a established rational approximation technique, similar to BPNN we used its functional form to approximate the phase function (AAA), (3)MUSIC algorithm: a datadriven function approximation method that estimates functions as sum of exponentials (MUSIC), and (4) Linear+BP: using linear regressor instead of neural networks in BPNN.(Please see Appendix for more details about KKmethod, AAA algorithm and MUSIC algorithm.) All yaxes are plotted with a dB scale in base 10 ( where is the MSE error) for visual clarity. For the legends in Fig. 3
we use entries of the form (A, B), where A refers to the width (i.e., number of neurons) in each hidden layer of the neural network being evaluated, and B refers to the depth of the neural network (i.e., number of hidden layers).
Performance of Baselines We note that both the KK and MUSIC methods exhibit poor performance, because the KK method requires a large number of magnitude measurements to achieve accurate estimates, while the MUSIC method expects that the target function is sparse in some basis, which cannot be assumed for our problems. The Linear+BP model also performs substantially worse than the the BPNN, suggesting that the relationship between the phase magnitude an the Blaschke Product coefficients is nonlinear. Although representing the phase function with a Blaschke Product decreases the complexity of the problem, mapping from magnitude to the Blaschke coefficients still appears to be beyond the capability of a linear regressor. In the following paragraphs we describe results of each dataset in detail.
Synthetic Datasets We use BPNN of 4 roots with a fullyconnected neural network of 2 hidden layers, each of size 64 and no dropout. It is obvious from figure[3] that BPNN outperforms a family of optimized neural networks by a large margin. This supports our theoretical justification of using BPNN when the phase functions are precisely meromorphic functions. The outstanding performance of BPNN on the Lorentzian dataset implies its value on phase retrieval problems in metamaterials. To confirm this, we conduct more experiments on real metamaterial datasets.
MMA, ADM, Membrane
For the MMA dataset, similar to synthetic datasets, we also use BPNN with 4 roots while using a fullyconnected neural network with 2 hidden layers of size 64. No dropout nor weightdecay was added. Although without any regularization techniques for neural networks, BPNN still demonstrate superior performance to all the optimized neural networks. Given that ADM dataset and Membrane dataset are much more complex than the synthetic datasets and MMA dataset, we use a Piecewise BPNN with frequencies partitioned into 20 equallength sequences and we assign each sequence with 3 roots. We note that this is the most naive split for Piecewise BPNN and a more reasonable split can lead to improved performance. We find that the best errors of regular neural networks are close to BPNN’s best error. However, we also find that the median error of BPNN is much smaller than the median error of the best neural network (the neural network with smallest test MSE error across all architectures). This shows that the BPNN is making continued good predictions through the sacrifice of small amounts of poor predictions. This property can be valuable when the prediction problem at hand is the onoroff type due to BPNN’s higher probability of making good predictions.
5 Conclusion
We presented a physicsinfused neural network referred to as Blaschke Product Neural Network (BPNN) for phase retrieval problem of holomorphic and meromorphic functions. Through extensive experiments on synthetic and metamaterial datasets, we have found that the data required for training neural networks is significantly less than expected when the underlying functions are holomorphic or meromorphic. By baking the physics into neural networks, even less training data is required and the performance is further improved.
6 Reproducibility Statement
We upload four datasets. Due to restriction of maximum supplementary size of 100MB, we include three datasets in supplementary: two synthetic datasets and Huygens’ metasurface dataset. Membrane metasurface can be accessed with this link (https://figshare.com/s/966f196d1847269a447c
). Details about hyperparameter search space and the process of searching can be found in the experiment section. Included in the supplementary we also have our implementation of the BPNN and neural networks by Pytorch with easy option of the hyperparameters.
Acknowledgments
This work was supported in part by the Office of Naval Research (ONR) under grant number N000142112590. YD, OK, SR, JM, and WJP acknowledge funding from the Department of Energy under U.S. Department of Energy (DOE) (DESC0014372).
References
Appendix A Appendix
a.1 Observations of phase function values
Here we provide a short explanation for the reason why the function values are only available on the imaginary line. In physics, electrical engineering and other fields, solutions to many linear PDEs with constant coefficients, and also many other measurements are done in the frequency domain. This in principle is achieved by multiplying a real signal x(t) by sinusoidals
and , and integrating over time (Calculating Cosine and Sine Transforms) for various values of . If x(t) is a real physical signal, these cosine and sine transforms are real and imaginary parts of, the complex Fourier transform of x(t). Thus
gives the value of over the imaginary axis.a.2
We provide the details proving the norm of Blashke product is 1 if . To show that it has a norm of one, we should show that all component (that are multiplied) would have a norm of one individually:
a.3 Hyperparameters of Piecewise BPNN
In order to provide analysis and some insights on hyperparameters of Piecewise BPNN, we conduct experiments on the two most complex datasets, Hyugens’ metasurface dataset and membrane metasurface dataset, to have empirical results regarding the effect of hyperparameters (e.g number of segments and number of roots per segment) on performance. We use training dataset of size 50 for both datasets and fix BPNN’s neural networks to have 2 hidden layers and hidden size of 64 for each layer. From figure[6] we can see that Piecewise BPNN’s performance is robust to hyperparameters: a wide range of different settings of hyperparameters can generate satisfactory results as long as the total number of roots is above the threshold necessary. Similarly, the Piecewise BPNN is also robust to overfitting caused by high number of roots: while the optimal total number of roots appears in the range of 6080, overshooting total number of roots to 160 or even 200 doesn’t cause a sharp increase in test error. We believe this is a strong indication that the Piecewise BPNN learns the necessary complexity and thus zeros out the unnecessary degrees
a.4 Baseline model 1: Kramers–Kronig
The Kramers–Kronig relations, implied by causality of a stable physics system, are often used to convert between imaginary parts and real parts of physical functions such as dielectric functions, susceptibility, and reflection coefficient. We adopt the notation from grosse1991analysis, since the metamaterials problems included in the main text focus on the phase retrieval tasks on spectral data, including reflectance. In general, the Kramers–Kronig relations for complex reflection coefficient are given by:
which and are real and complex part of reflection coefficient respectively.
In practice, and are often unknown. The reflectance is the physical quantity that is directly measurable by intensity measurement, but the phase information is lost. Therefore, it is beneficial to derive phase from reflectance . A general form of complex reflection coefficient can be written as . We can infer that:
If KramersKronig relations are valid for , we can first measure from zero frequency to the highest measurable frequency. Then we can apply KramersKronig relations on as real part of to infer the imaginary part . In our implementation, we follow the above procedure to derive , and therefore can get the KramersKronig reconstructed complex reflection coefficient . Since the loss metric in the main text is the averaged mean square error of the real and imaginary parts of the reflection coefficient, and are further taken from to be compared with and .
The above procedure is built on the premise that there is causality for , but is not a real physical function that is proven to be governed by causality. Thus, we cannot prove that KramersKronig relations is valid on . Furthermore, the reflection spectrum in the metamaterials dataset are typically in a finite frequency range that degrades the performance of the KramersKronig relations. Nonetheless, we provided the KramersKronig relations as a baseline to perform phase retrieval.
a.5 Baseline model 2: AAAnetwork algorithm details
Here we supply the details of the Adaptive AntoulasAnderson (AAA) network baseline we included in our main text. Following the notation of nakatsukasa2018aaa, it uses rational barycentric representation of rational apprixmation, which approximate an arbitrary function as rational function of type (m, m):
where are support points, weights and value parameters of this approximation. It approximates a complex function over a set of complex support points using an iterative greedy selection and uses linear leastsquares to solve for weights.
Although AAA also utilizes rational approximation, there are a couple of fundamental differences between AAA and our Blashke product approximation. (1) The functional form difference: AAA uses the barycentric representation, which although limited to type (m, m) rational, covers a much larger functional space compared to our Blashke product, that is specialized to approximate the phase function on the unit circle according to the HelsonSarason Theorem. (2) The fitting process: AAA uses a greedy solution to pick the support point first and then solve the least square for weights and values. Our Blashke product fits simultaneously and does not require iterative steps internally (although the actual fitting is done using gradientbased methods, which requires iterative steps). This means the AAA is difficult to build into a neural network and get gradient feedback from the fit, unlike our BPNN approach.
With AAA, we can successfully approximate the phase function using three sets of barycentric parameters (also known as AAA parameters). However, this process requires knowledge of the phase function or at least the ability to sample points on the phase function. Therefore a naive AAA alone is not able to be applied to our application of phase retrieval, which not only requires the approximation of phase function but also the extrapolation of the mapping between magnitude and the approximation parameters.
For programming implementation, we are using the existing package Driscoll2014^{3}^{3}3We used this public Python version: https://github.com/cfh/baryrat
. Setting the number of support points comparable to our BP (so that they have similar degrees of freedom), we first run the AAA algorithm for all of our training sets (number of points ranging from 1 to 50) and get the training AAA parameters. Then we used a neural network to model the relationship between the measured magnitude to the AAA parameters for the training data. The network structure we used was 64 neurons with 2 hidden layers with Relu activation. After the network is properly trained (reach convergence for 300 epochs), we input the testing spectra magnitude and get the inferred AAA parameters for the test spectra. Then we simply reconstructed the barycentric representation and compared the reconstructed phase function with the ground truth to get the MSE measure.
As graphs in the main text suggested, all of the AAA networks performed poorly compared to BP and to naive NN results. This is expected due to following reasons:

The functional form of barycentric rational functions lacks actual physics meaning: Encompassing all (m, m) rational functions, they are too large and not restrictive enough for the phase function. The lack of actual physical meaning in the rational approximation unlike our Blashke product (which takes into account the unit circle property of phase functions) is the most important reason we suspect this not working

The twostage process error: There are two stages in our AAAnetwork baseline, a AAA parameter fitting stage and a network mapping learning stage. The error of each stage would compound and damage the overall performance

Onetomany issue: Unlike the Blashke product where the parameters form an ordered set, the parameters of the barycentric representation of AAA form an unordered set. The order is important for neural networks to learn a meaningful mapping as it would not have onetomany problems, which damages the training performance
a.6 Baseline model 3: MUSIC algorithm details
Multiple Signal Classification (MUSIC) algorithm schmidt1986multiple is a signal processing algorithm popular in frequency finding and direction finding where the signal is linear combination of multiple signal source of different frequencies and/or directions. MUSIC assumes that the signal vector is consist of p complex exponentials whose frequencies w are unknown.
Since MUSIC is not originally applied to phase retrieval problems and there is no widely single way to do phase retrieval, we would describe fully the process of our MUSIC phase retrieval algorithm here. First, we would assume that our signal is consist of the sum of p complex exponentials with unknown frequency w and noise term n:
where A is the Vandermonde matrix of steering vectors and s is the amplitude vector. In the assumption the actual number of sources is much lower than the observation, therefore p ¡ M. With this assumption, we would make another assumption that the signal comes from the same distribution, namely their autocorrelation matrix (M by M) should be a constant. Therefore, we would estimate this autocorrelation using the average of the sampled correlation matrix:
Now with the estimate of autocorrelation, we estimate the frequency of this matrix using the eigenspace method, where the eigenvectors associated with a small eigenvalue would be treated as the noise subspace and would be used to recover the signal frequency. Specifically, we decomposite the
into eigenvalues and eigenvectors, took the smallest (Mp) eigenvalues and their corresponding eigenvectors to form noise subspace . Since the eigenvectors from a Hermitian matrix should be orthogonal to each other, meaning any pure signal from signal space would be orthogonal to all the vectors that span the noise subspace. Using a squared norm distance, for a signal , the distance:should be zero for all pure signals. Or, the inverse of should be the peaks in the space when we sweep through a range of possible values. Now, we sweep the space with a fine grid and the peaks in the would be our values. Therefore, we can construct the A matrix by the topp peaks in the curve.
The above is a standard MUSIC algorithm to find the signal frequency (assuming the signal is a sum of exponential), now with the frequency of the signal found, finding the actual signal x would be equivalent to finding the corresponding amplitude vector s. Therefore we constructed our phase retrieval problem as an optimization problem:
Where the is the ground truth magnitude or experimental measurements. For each test phase retrieval problem (each test magnitude spectra), we would need to run the optimization again using the estimated A matrix. We tried the Newton raphson method for this operation but it did not converge even if we set a larger tolerance. Therefore we used the PyTorch framework (without any neural network) and utilized the parallel, GPUbased gradient optimization ability. We constructed the computational graph using the above equation and randomly initialized 1024 initial guesses and used Adam optimizer for 300 epoch till the loss converges. Afterward we would take the best performing single fit among the 1024 candidates as our final answer for this single magnitude. Due to time constraints (300 epoch with 1024 initialization for each of our test cases), we did 500 test cases instead of the full dataset like other methods.
a.7 Details on metamaterials dataset
Metamaterial absorber:
The input parameters include the four geometric dimensions of the top metal layer , material properties of a top dielectric layer , the conductivity of the metal , and thicknesses for all layers . The output is the complex reflection coefficient , with 1000 frequency points for each of the real and imaginary part of evaluated from 0 – 200 GHz.
Membrane Metasurface:
The metasurface supercell geometry has a periodicity , including four elliptical SiC resonators with consistent height . Each elliptical resonator has a major axis radius , a minor axis radius , and a rotational angle . The fourteen geometry parameters define inputs of computational simulations and yield complex reflection and transmission responses. This ADM dataset is particularly interested in the frequencydependent reflection and transmission range from 100  500 THz.
Synthetic II: Lorentzian Response Function
The second synthetic dataset is generated by considering a causal model of a nonmagnetic material, with a frequencydependent complex dielectric permittivity defined as a sum of Lorentzian oscillators:
A simple Lorentzian model of the dielectric function is applicable to a wide range of resonant material systems with response functions that must obey KramersKronig relations. Using the transfermatrix method, Markos2008, the complex transmission coefficient can be computed directly from the dielectric function via the closed form equation:
where , , and are predefined physical constants. We generate a dataset of complex transmission coefficients by uniformly sampling Lorentzian parameters for optimized for the frequency range 100500 THz. We consider a fixed set of four Lorentzian oscillators, and generate 50 sample transmission spectra for training, and 2000 spectra for the test dataset.