1 Introduction
The retrieval of phase from intensity is one of the most important and most challenging problems in Classical Optics. The utility of phase retrieval stems from the fact that it allows the shape of transparent objects, biological cells in particular, to be quantified in two and three spatial dimensions using visible light [1, 2]. In the xray band, quantitative phase imaging is also useful because phase contrast in tissue is orders of magnitude higher than attenuation contrast [3, 4]. The same argument can be made for identification of liquids [5] and semiconductor materials for integrated circuit characterization and inspection [6].
There are two wellknown challenges in phase retrieval. Firstly, for the phase of the optical field to be welldefined, the illumination needs to be temporally and spatially coherent to a fairly good approximation; this is especially difficult with xrays. One way to relax this requirement is to acknowledge that it is usually the phase delay through the object that is of interest, and seldom the phase of the optical field itself. The former can be obtained even from partially coherent light [7]
but requires the correlation function (mutual intensity) to be measured, which is often problematic because of low contrast. The second challenge is that, since only the intensity of a light beam is observable, the phase may only be inferred indirectly from intensity measurements. Computational approaches to this operation may be classified as interferometric/holographic
[8, 9], where a reference beam is provided; and noninterferometric, or referenceless such as direct/iterative [10, 11] and ptychographic [12, 13], which are both nonlinear, and transportbased [14, 15], where the problem is linearized through a hydrodynamic approximation. Direct methods attempt to retrieve the phase from a single raw intensity image, whereas the transport and ptychographic methods implement axial and lateral scanning, respectively. What referenceless methods have in common is the need to obtain intensity measurements at some distance away from the conjugate plane of the object, i.e. with a small defocus. Direct measurement with defocus is the approach we take here.All computational phase retrieval approaches, interferometric and noninterferometric, involve the solution of a nonlinear and highly illposed inverse problem. For direct phase imaging, which is a nonlinear problem—see Section 2.2.1—the classical GerchbergSaxtonFienup (GSF) algorithm [10, 11, 16] and its variants [17]
are widely used. The main idea is to start with a random estimate for the unknown phase distribution and then to iteratively update it until the modulussquared of its Fourier (or Fresnel) transform matches the observed intensity. For wellbehaved phase fields, the iteration usually converges to the correct phase
[18, 19]. Alternatively, the WienerTikhonov functional minimization approach, described in Section 2.2.2, exploits prior knowledge about the class of phase objects being imaged to combat noise artifacts. A modern and popular implementation is compressive imaging [20, 21, 22, 23, 24] that utilizes sparsity priors, valid when the object is expressed in the appropriate set of basis functions. If the sparsifying set is unknown, it can be learned from examples according to the SVD method or other dictionary approaches [25, 26, 27, 28].In 2010, [29]
proposed a deep neural network in a recurrent scheme to learn the prior from examples, as an alternative to the dictionary approach. Subsequently, generatordiscriminator competition
[30] was adopted as a more secure means of learning the statistical relationship between valid objects and their forward models; and the recursion was unfolded into a cascade for better numerical stability [31]. In these schemes, the physical model of the measurement is taken explicitly into account as a projection operator applied to the reconstruction estimate repeatedly at each recursion or cascade stage. This generalization of dictionaries to deep learning has been successful in a number of linear inverse problems, most notably superresolution [32, 33, 34] and tomography [35, 36].Recently, deep learning regression has been investigated for application to nonlinear inverse problems, in particular phase retrieval: direct [37, 38, 39], holographic [40, 41], and ptychographic [42, 43]. The idea, described briefly in Section 2.2.2, is to train a deep neural network (DNN) in supervised mode from examples of phase objects and their intensity images so that, after training, given an intensity image as input, the DNN outputs an estimate of the phase object. In this case, the physical model is either learned implicitly together with the prior from the examples [37, 38]; or incorporated as a preprocessor (“Approximant”) [39, 40, 41]. An interesting alternative method for the inverse problem, also nonlinear, of reconstructing the threedimensional (3D) refractive index distribution from intensity projections, is to define the DNN architecture according to the strong scattering model and store the refractive index values as weights of the DNN after training [44]. This “indexstoring” DNN itself was subsequently used as Approximant to a traditional DNN for improving the estimates in 3D distributions with exceptionally small and highcontrast features or when the range of available angles of projection is severely limited [45]. Extensive reviews of deep learning use for inverse problems can be found in [36, 46, 47].
Here, we propose a new DNNbased computational architecture for phase retrieval with the unique feature of processing low and high spatial frequency bands as separate channels with two corresponding DNNs trained from an original object database and a highpass filtered version of the database, respectively. Subsequently, the outputs of the two channels are recombined using a third DNN also specifically trained for this task. The motivation for this new approach is an earlier observation [38] that nonlinearities in DNN training and execution algorithms tend to amplify imbalances in the spatial frequency content of the training database and in the way different spatial frequencies are treated as they propagate through the physical optical system; this amplified imbalance typically results in the lower spatial frequencies becoming dominant and ultimately limiting resolution of fine spatial features in the reconstructions. A more detailed overview of this phenomenon can be found in Section 2.2.3. Because the essential feature of our new proposed technique is the synthesis of the two spatial bands through a trained DNN, we refer to it as “learning to synthesize” (LS).
Splitting the spatial frequency content into several bands and processing the bands separately has a long history in signal processing [48, 49, 50, 51, 52, 53, 54, 55]. For image reconstruction, dual band processing has been used in fluorescence microscopy [56, 57, 58] and phase retrieval [59]. However, these cases, unlike ours, required structured illumination. In the context of learningbased inversion, a dual channel method has been tried for superresolution [60] (to be understood as upsampling) albeit the two processed channels were combined as a simple convex sum to form the final image. By contrast, the LS method presented here uses a learnednonlinear filter, implemented as a third DNN trained to optimally recombine the two channels according to the spectral properties of the class of objects that the training database represents.
In addition to requiring a single raw image to retrieve the phase through a learned recombination of the spectral channels, the LS method presented here has the desirable property of resilience to noise, especially in the case of weak photon flux down to a single photon per pixel. We achieved that by using an Approximant filter [39] to preprocess the raw image before submitting it to the two spectral channels. The Approximant produces an inverse estimate that expressly uses the physical model (a single iteration of the GSF algorithm in [39] and here). For very noisy inputs, the Approximant is of very poor quality; yet, if the subsequent learning architecture is trained with this lowquality estimate as input, the final reconstruction results are significantly improved. The LS method with Approximant, as presented here, drastically improves over [39], especially in the reconstruction of fine detail, as [39] did not use separate spectral channels to rebalance frequency content.
The detailed implementation of the LS method is described in Section 3, where we also show how to optionally include the Approximant as initial estimate for input to the LS scheme. Results from experiments are presented in Section 4 with detailed characterization of the LS method’s behavior under different noise conditions. Conclusions and suggestions for future work are in Section 5.
2 Problem formulation
2.1 Phase retrieval
Let
denote the complex transmittance of an optically thin object of modulus response and phase response , and let denote the coherent incident field of wavelength on the object plane. The noiseless intensity measurement (also referred to as noiseless raw image) is carried out on the detector plane located at a distance away from the object plane, and can be written as
(1) 
where denotes the Fresnel (paraxial) propagation operator for distance , i.e. the convolution
(2) 
and is the (nonlinear) noiseless forward operator. Alternatively, may be expressed in the spatial frequency domain as
(3) 
where
denotes the 2D (spatial) Fourier transform operator and
its inverse.We are interested in weakly absorbing objects, i.e. we assume . In all the experiments described here, the illumination is also a normally incident plane wave . Therefore, to a good approximation, we may write
(4) 
This is what we refer to as the direct phase retrieval problem, which GerchbergSaxton and related algorithms solve iteratively [10, 16].
In practice, the measurement is subject to Poisson statistics due to the quantum nature of light; and to Gaussian thermal noise added by the photoelectric conversion process. We express the noisy measurement as
(5) 
where
denotes a Poisson random variable with mean
anda Gaussian random variable with zero mean and variance
. The photon flux in photons per pixel per frame is denoted as and the spatial average of the noiseless raw image in the denominator is necessary as normalization factor. The noisy forward operator is and the purpose of phase retrieval is to invert so as to recover as accurately as possible, despite the nonlinearity and randomness present in the measurements.2.2 Solution of the inverse problem
The WienerTikhonov approach to solving inverse problems of the form is to obtain the estimate of the inverse as
(6) 
Here, is the fitness (or datafidelity) term and is a distance operator. Under the assumption of solely additive Gaussian noise, it is appropriate to use the norm
(7) 
Somewhat less rigorously, but conveniently for mathematical manipulations, the same
metric is used for the fitness term even for measurements strongly subject to nonGaussian statistics, as we consider here. The topic of appropriate fitness metrics is beyond the scope of this paper; in any case, when machine learning is used to approximate (
6), the dilemma of choosing a metric shifts to the loss function for training a deep neural network. We will address this latter problem in some detail in Section
3.3.1.The second term in (6) is the regularizer, or prior knowledge term. Its purpose is to compete with the fitness term in the minimization so as to mitigate illposedness in the solution. That is, the regularizer penalizes solutions that are promoted by the noise in the forward problem, as in (5) for example, but does not meet general criteria known a priori for valid objects. The prior may be defined explicitly, e.g. as a minimum energy [61, 62] or sparsity [20, 21, 22, 23, 24] criterion; or learned from examples as a dictionary [25, 26, 27, 28] or through a deep learning scheme [29, 31, 32, 33, 34, 36, 35, 37, 38, 39, 40, 41, 42, 43].
Here, as in earlier works on direct phase retrieval [37, 38, 39, 40, 41, 42, 43], and due to the nonlinearity of the forward model, we adopt the EndtoEnd and Approximant methods. These we denote as
EndtoEnd:  (8)  
Approximant:  (9) 
where is the output of a deep neural network. In the EndtoEnd approach, the burden is on the DNN to learn from examples both the forward operator and the prior so as to execute, in one shot, an approximation to the ideal solution (6). Training takes place in supervised mode, with known pairs of phase objects and their raw intensity images generated on a phase spatial light modulator (SLM) and measured on a digital camera, respectively. Note that training is generally very slow, taking several hours or days if a few thousand examples are used. However, after training is complete, the execution of (8) or (9) is very fast as it only requires forward (noniterative) computations. This is one significant advantage over the standard way of minimizing the WienerTikhonov functional (6) iteratively for each image.
When the inverse problem becomes severely illposed or the noise is extremely strong, the learning burden on the DNN becomes too high; then, generally, better results are obtained by training the DNN to receive as input the Approximant instead of the raw measurement directly. The Approximant is obtained through an approximate inversion of the forward operator; for example, in [40]
it was implemented as a digital holographic backpropagation algorithm, whereas in
[39] it was the outcome of a single iteration of the GerchbergSaxton algorithm [10]. While these Approximants generally do not look very good, especially in highly noisy situations [39], through training the DNN is able to learn a better association of with its corresponding true object than it can learn with the noisy raw measurement .2.3 Spectral properties of training
The design of deep neural networks is an active field of research and a comprehensive review of methods and caveats is well beyond the scope of this paper. We refer the reader to [36, 46, 47] for more extensive background and references. Here, we discuss the influence on the quality of training of the spatial power spectral density of the database where example are drawn from.
In both EndtoEnd and Approximant methods (89
) the examples determine the object class prior to be learned by the DNN. For instance, if we train on examples drawn from the ImageNet database
[63] of natural objects, the prior learned is weaker than if we train on the MNIST [64] database of handwritten digits or a database of integrated circuit (IC) segments because both latter shapes are more restricted than natural images and with stronger spatial correlations across. Recent work [39, 65] has shown that, unsurprisingly, training with stronger priors results in more robust reconstructions as long as the test objects are drawn from the same class.In [38], we addressed the influence of the spatial power spectral density (PSD) of the example database on the quality of training. It is well known [66, 67, 68, 69, 70, 71] that two dimensional (2D) images of natural objects, such as those contained in ImageNet[63], follow the inverse quadratic PSD law
(10) 
Other types of object classes of practical interest exhibit similar powerlaw decay, perhaps with slightly different exponents. This means that, if a neural network is trained on such an object class, higher spatial frequencies are presented less frequently to the DNN during the training stage. At face value, this is as it should be, since the relative popularity of different spatial frequencies in the database is precisely one of the priors that the DNN ideally should learn.
This understanding needs to be modified in the context of inverse problems because the representation of high spatial frequencies in the raw images is also uneven—typically, to the high spatial frequencies’ disadvantage. In the specific case of phase retrieval, higher spatial frequencies within the spatial bandwidth (as determined by the numerical aperture NA) have uniform transmission modulus but are more severely scrambled by the chirped oscillations of the transfer function (3). Thus, higher spatial frequencies suffer a double penalty [38]: their recovery becomes more sensitive to noise due to the scrambling; and they are less popular due to the inversesquare (or similar) PSD law so they are presented less frequently to the DNN’s training process. Moreover, since the DNN itself and its training routine are both highly nonlinear, there is an acute risk that any unevenness in the treatment of different spatial frequency bands may be amplified in the final result, eventually causing the lower frequencies to dominate. Ref. [38] attributed the inability of the Phase Extraction Neural Network (PhENN) [37] to resolve spatial features well within its admitted spatial bandwidth to this unequal treatment of spatial frequencies. They showed that PhENN’s resolution is approximately doubled by prefiltering the training examples as to flatten their PSD. That is, during the training, each example from the database was replaced with its filtered version
(11) 
The transfer function was defined as the highpass filter
(12) 
exactly compensating for the inversequadratic dependence (10) and flattening the spectrum. Raw images for training were correspondingly filtered as
whereas, during the test, the unfiltered measurements (i.e., as received from the camera) were used to obtain the reconstructions. Unfortunately, with this implementation amplification of high spatial frequency features, especially of artifacts caused even by weak noise, was also evident in the reconstructions. This is not surprising, since technically (11) trades off violating the prior in return for finer spatial resolution. The LS approach that we describe in the next section is meant to fix this problem.
3 Methodology of Learning to Synthesize
3.1 Description of the LSDNN system
In [72], we proposed a twostep approach to tackle the difficulty of restoring high frequency components without introducing significant artifacts and distortions. Here we describe the two steps for training and execution in detail, as well as the design of the DNNs involved in the LS system. For unity in notation, we denote the input to the entire LS system as , to be understood as
(13) 
We will discuss the Approximant implementation in more detail in section 3.2 below.
The two training steps are shown in blockdiagram form in Figure 1. The first step consists of training two separate DNNs in parallel, as follows:

DNNL is trained to match unfiltered patterns at its input with the corresponding unfiltered example phase patterns as ground truth at its output (the superscript enumerates the examples).

DNNH is trained to match unfiltered patterns at its input to corresponding spectrally filtered versions of the ground truth examples at its output. The filter’s transfer function is chosen more generally than (12) as
(14) We have investigated implementations with spanning a broad range, as we discuss in more detail below.
The output of DNNL for a general test input is denoted as . Assuming similar training conditions, matches the output of PhENN as presented in [37] in the EndtoEnd scheme or [39] in the Approximant scheme; that is, is expected to be fairly accurate at low spatial frequencies but missing fine detail.
The output of DNNH is denoted as . Note that [38] required spatial prefiltering the raw inputs ; here, we do not spatially prefilter the input (i.e., or according to whether the EndtoEnd or Approximant scheme is used). We instead train DNNH to produce the filtered output based on an unfiltered input. This leads to better generalization, because DNNH is trained on the broadest set of possible images (whereas the training in [38] was on highfrequency containing images only). Moreover, using unfiltered inputs to DNNH allows for the training process to be parallelized for better efficiency.
Depending on the value of , the PSD of the patterns training DNNH will be flat or almost flat. The output of DNNH is expected to have better fidelity at fine spatial features of the phase objects. However, spectral flattening may also generate artifacts due to overlearning the high spatial frequencies. Therefore, looks rather like a highpass filtered version of the true object , which we found to be more beneficial for subsequent use in the LS scheme.
The second training step consists of combining the two partially accurate reconstructions and into a final estimate with uniform quality at all spatial frequencies, low and high up to the passband. To this end, we train the synthesizer DNNS to receive and as inputs, and use the unfiltered examples as the output. To avoid any further damage to the highspatial frequency content in , we bypass and present it intact to the last layer of DNNS. By operating on alone, DNNS learns how to treat the low frequency reconstruction so as to compensate for artifacts at all bands. The use of the synthesizer DNNS also makes our results less sensitive to the choice of power in the transfer function (14). We found that can produce reconstructions of approximately equal quality, as will be presented in section 4.
After DNNL, DNNH and DNNS have been trained, they are combined in the LS system and operated as shown in Figure 1(b). The input is passed to DNNL and DNNH in parallel fashion, and the respective outputs and are passed to DNNS, which produces the final estimate . It is worth noting that it is not valid to lump the three networks in Figure 1(b) into a single network, due to their separate training schemes described above.
There is a wide variety of DNN structures one may choose to implement DNNL, H and S. In this work, we use for DNNL the same architecture in [39], a residual Unet architecture with skip connections [73]. For simplicity, DNNH and DNNS, are also chosen to be structures similar to DNNL. The details of the implementations are in the Supplementary Material. We made these choices to enable specific and fair comparisons with the earlier works; alternative architectures are certainly possible within the LS scheme, though we judged a full exploration to be outside the scope of the present paper.
Training a neural network is typically implemented as a stochastic optimization procedure [74, 75] where the neural network’s internal coefficients (weights) are adjusted so as to minimize a metric of the distance between the actual output of the neural network and its desired output to a given input (training example). This distance is called training loss function (TLF). In the context of training to solve an inverse problem, the TLF is defined as
(15) 
where the superscript is again used to enumerate the examples, and the dilemma of choosing the appropriate metric operator emerges.
It is generally accepted [76, 77, 78, 79, 80, 37] that the metric (also referred to as mean square error, MSE) is a poor choice that does not generalize well; i.e., deep neural networks trained with MSE do not perform well when presented with examples outside their training set. For image classification tasks, and in some early work on phase retrieval [37], the metric (mean absolute error, MAE) was used instead. In direct analogy with compressive sensing, the metric promotes sparsity in the internal connectivity of the neural network, and that leads to better generalization. However, [65] found that in highly illposed problems this benefit is eclipsed by the inability of MAE and pixelwise metrics more generally to learn spatial structure priors about the object class that are crucial for regularization.
In this paper, we train DNNL, H and S using the Negative Pearson Correlation Coefficient (NPCC) [81, 82, 65] as the TLF. This is defined as
(16) 
where and are the spatial averages of the generic functions , . If and are uncorrelated, the expected value of is zero, whereas if they are identical then . Thus, training the neural network minimizes the TLF toward , where is the number of training examples.
The NPCC has been shown [83] to be more effective in recovering fine features than conventional loss functions such as the Mean Square Error (MSE), Mean Absolute Error (MAE) and the Structural Similarity (SSIM) index [84, 85]. However, the NPCC has the disadvantage that it is invariant to affine transformations to its arguments, i.e.
(17) 
for arbitrary real numbers , , , . For quantitative phase retrieval, where the scale of phase difference matters, the affine ambiguity is resolved with a histogram equalization step after inversion [38].
3.2 Computation of the Approximant
It has been shown that, even under extreme noise conditions, just a single iteration of the GerchbergSaxton (GS) algorithm suffices as Approximant in scheme (9) for phase retrieval [39]. We elected to use the same approach here for the LSDNN architecture. More recently, a comparative study [86] showed that higher iterates or regularized versions of GS do improve the appearance of the Approximant result but do not yield significant improvement in the end output of the DNN. Similar conclusions hold for alternatives to GS, e.g. Gradient Descent. While these alternative schemes are interesting for the LSDNN method, we chose to not pursue them here.
The general form of the GS iterate from the iterate is
(18) 
where we have taken into account that . Accordingly, our Approximant is
(19) 
where denotes the function that is uniformly equal to one within the frame [86].
Figure 2 compares the 2D (logscale) Fourier spectrum magnitude of a ground truth image (from ImageNet [63]), Approximant (19) computed without noise, and Approximant (19) computed from an input subject to Poisson statistics corresponding to average flux of one photon per pixel. We can see that although the single photon Approximant (which we will later use as the input to the LSDNN) has a large support in its spectrum, it is the noise that dominates the midtohigh frequency range. Therefore, the learning process still bears the burden of restoring the correct high frequency contents and relying heavily on high frequency priors, as our DNNH does, is justified.
4 Results
4.1 Experimental Apparatus and Data Acquisition
In each experiment carried out to train and test different LSDNN schemes, 10,000 image objects from ImageNet [63] were successively projected on a phase SLM as phase objects (i.e., with phase value at each pixel proportional to the intensity of the corresponding pixel in the original from the database) and their raw images were recorded by the EMCCD camera at an outoffocus plane. These 10,000 groundtruth phase images and their corresponding raw intensity images were split into a training set of 9,500 images, a validation set of 450 images and a test set of 50 images. The choice of ImageNet [63] is reasonable, since the lowfrequency dominance in its spatial PSD is representative of the broader classes of objects of interest, and therefore, we anticipate our results will generalize well in practical applications.
The experiments were carried out using the apparatus described in Figure 3. The light source was a continuous wave HeliumNeon gas laser at 632.8nm. The laser beam first passed through a variable neutral density filter (VND) that served the purpose of adjusting the photon flux. The beam was then spatially filtered and expanded into a 18mm diameter collimated pencil and sent onto a transmission spatial light modulator (SLM) with pixels, each of size . Phase objects were projected on the SLM and imaged by a telescope (4F system) consisting of lenses L1 (focal length 230mm) and L2 (100mm). The reduction factor in the 4F system was designed to reduce the spatial extent of the defocused raw image to approximately fit the size of the camera. An aperture was placed in the Fourier plane to suppress higher diffraction orders due to the periodicity of the SLM pixels. The raw intensity images were captured by a QImaging EMCCD camera with squareshaped pixels of size 88m placed at a distance from the image plane of the 4F system. Additional details about the implementation of the optical apparatus and its numerical simulation with digital Fresnel transforms are in Supplementary Material.
The photon flux is quantified as the number of photons received by each pixel on average for an unmodulated beam, i.e. with no phase modulation driving the SLM. During an initial calibration procedure, for different positions of the VND filter, the photon level is measured using a calibrated silicon photodetector placed at the position of the camera. The quoted photon count is also corrected for the quantum efficiency of the CCD (60% at ), meaning that we refer to the number of photons actually detected and not the incident number of photons.
Here, we report results for two levels of photon flux % and %, respectively quoted in the text as “10” and “1” photons. The data acquisition, training and testing procedures of the entire LSDNN architecture were repeated separately for each value of .
4.2 Reconstructions and Analysis
Figure 4 shows the reconstructions obtained by the LSDNN method () and its components at fluxes 1 photon and 10 photons per pixel, as defined immediately above. As expected, the reconstructions by DNNL have good fidelity at low spatial frequencies but lose fine details, as in [39]; whereas the reconstructions by DNNH look like highpass filtered versions of the true objects with some additional highfrequency artifacts due to the noise. The reconstructions by DNNS preserve detail at both low and high frequencies, while significantly attenuating the artifacts. The improvement of over is more pronounced under severe noise, i.e. in the photon/pixel case. More examples of reconstructions (obtained with ) for the noisier case () are given in the Supplementary Material.
In Figures 5 and 6, we compare reconstructions by LSDNN with different values of the prefiltering parameter for photon and photons per pixel, respectively. The most detail at high frequencies in the DNNS output is preserved in the range . At lower values of , the quality of reconstructions by DNNS does not noticeably exceed that of DNNL. This is expected, since in the limit , training DNNH becomes identical to DNNL. On the other hand, for values , the DNNH output is dominated by highfrequency artifacts and again the quality of DNNS reconstructions regresses to that of DNNL, since the highfrequency channel is no longer contributing. These observations are valid for both values of , and even stronger for the most severely noiselimited case .
Similar trends are evident according to various quantitative metrics averaged over the entire set of test examples compared to the true phase signals , summarized in Table 1. For comparisons we used the PCC, defined as in (16) but without the minus sign; peak signaltonoise ratio (PSNR) [87]; and structural similarity index metric (SSIM) [84, 85]. As we noted in Section 3.3.1, DNNH is trained with spectrally prefiltered version of the true object so a quantitative comparison of its output with the ground truth does not make sense.
It is noteworthy that in both visualization and quantitative comparisons of Figures Figures 56 and Table 1, respectively, the performance of DNNS remains approximately the same within the range . This is desirable as it suggests that one need not prefilter exactly with the inverse of the PSD power law. This further suggests that for datasets that do not represent natural images and may obey power laws different than (10), not knowing the exact value of may not be catastrophic. We have not tested this hypothesis exhaustively, as it is beyond the scope of this paper.
In Table 2 and in Supplementary Material, we also analyze the case of a bigger DNN (denoted DNNL3) with computational capacity equal to the sum of DNNL, H and S together, but trained with unfiltered examples, and show that DNNL3 cannot achieve reconstructions of equal quality. Therefore, the improvements over [39] resulted from the training procedure followed in the LSDNN method and not simply by brute force due to the use of larger computational capacity.
Average PCC std.dev  Average PSNR std.dev (dB)  Average SSIM std.dev  
Approximant  0.149 0.067  0.181 0.082  8.082 3.979  8.098 3.986  0.220 0.105  0.222 0.106 
DNNL output  0.808 0.099  0.880 0.059  16.207 2.466  18.299 2.536  0.875 0.071  0.925 0.039 
DNNS output ()  0.848 0.075  0.892 0.053  17.511 2.317  18.426 2.321  0.905 0.054  0.929 0.048 
DNNS output ()  0.858 0.074  0.898 0.050  17.758 2.523  18.469 2.469  0.903 0.056  0.926 0.036 
DNNS output ()  0.859 0.071  0.896 0.050  17.769 2.420  18.487 2.313  0.912 0.048  0.930 0.043 
DNNS output ()  0.865 0.067  0.897 0.050  18.116 2.319  18.634 2.380  0.915 0.061  0.934 0.034 
DNNS output ()  0.866 0.069  0.899 0.048  17.940 2.281  18.902 2.443  0.915 0.051  0.937 0.034 
DNNS output ()  0.860 0.071  0.898 0.050  17.880 2.455  18.940 2.320  0.915 0.062  0.935 0.038 
DNNS output ()  0.859 0.088  0.892 0.051  17.827 2.140  18.487 2.311  0.911 0.070  0.927 0.047 
DNNS output ()  0.845 0.081  0.887 0.050  17.375 2.221  18.445 2.470  0.902 0.059  0.934 0.041 
DNNS output ()  0.825 0.111  0.883 0.052  16.991 2.008  18.484 2.363  0.894 0.075  0.934 0.042 
DNNS output ()  0.819 0.077  0.882 0.055  16.820 2.254  18.391 2.600  0.893 0.058  0.933 0.037 
Average PCC std.dev  Average PSNR std.dev (dB)  Average SSIM std.dev  

Approximant  0.149 0.067  0.181 0.082  8.082 3.979  8.098 3.986  0.220 0.105  0.222 0.106 
DNNL output  0.808 0.099  0.880 0.059  16.207 2.466  18.299 2.536  0.875 0.071  0.925 0.039 
DNNL3 output  0.811 0.125  0.880 0.117  16.209 2.522  18.268 2.030  0.835 0.092  0.926 0.073 
DNNS output ()  0.866 0.069  0.899 0.048  17.940 2.281  18.902 2.443  0.915 0.051  0.937 0.034 
To further study the behavior of the LS components in the low and high spatial frequency bands, we studied the reconstructions in the Fourier domain. Figure 7 shows the spectra (2D Fourier transforms) of two randomly selected test examples. Figure 8 and Figure 5 in Supplementary Material show normalized diagonal crosssections of the PSD averaged over all the 50 test images, for and photons per pixel, respectively. These plots illustrate that DNNL and DNNH’s outputs are depleted at the high and low frequencies, respectively, with the losses being more severe in the noisy case ; whereas DNNS’s output mostly recovers the frequency content at both bands, albeit still with some minor loss at high frequencies.
Lastly, we experimentally characterized the spatial resolution of the LSDNN reconstructions, i.e. the ability of DNNS to resolve two pixels at nearby locations having phase delay higher than the rest of the signal. Similar analyses were carried out in [65, 38], where the methodology was also described in detail. In the work presented here, we carried out the analysis under ample illumination, i.e. not under strong Poisson statistics. We made that choice because spatial resolution under highly noisy conditions becomes nontrivially coupled to the noise statistics, and a complete investigation would have been outside the scope of the present investigation. The results are shown in section 6 in Supplementary Material.
5 Concluding Remarks
The LSDNN reconstruction scheme [72] for quantitative phase retrieval has been shown to be resilient to highly noisy raw intensity inputs while preserving high spatial frequency detail better than [39]. The reconstructions are also quite robust to variations of the prefiltering power law around the value following from natural image statistics. Beyond the scope of the work reported here, further improvements may be obtained through modifying the architecture of the DNNs used to process and recombine the two spatial frequency bands. It is also possible that for highly restricted datasets, needs to be investigated further. However, our observations for natural images suggests that the LSDNN approach is relatively insensitive to under fairly wide conditions.
Another obvious alternative strategy is to split the signals into more than two bands, then process and recombine these multiple bands with a synthesizer DNN, according to the LS scheme. While we did not investigate this approach in detail here, we expect it to present a tradeoff between the improvements and the complexity of having to train multiple neural networks implying the need for more examples and the danger of poor generalization.
Acknowledgments
This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) grant No. FA865017C9113 and by the SenseTime company. I. Kang was supported in part by KFAS (Korea Foundation for Advanced Studies) scholarship. We are grateful to Kwabena Arthur for useful discussions and critique of the manuscript.
References
 [1] P. Marquet, B. Rappaz, P. J. Magistretti, E. Cuche, Y. Emery, T. Colomb, and C. Depeursinge, “Digital holographic microscopy: a noninvasive contrast imaging technique allowing quantitative visualization of living cells with subwavelength axial accuracy,” Opt. Lett. 30, 468–470 (2005).
 [2] G. Popescu, T. Ikeda, R. R. Dasari, and M. S. Feld, “Diffraction phase microscopy for quantifying cell structure and dynamics,” Optics letters 31, 775–777 (2006).
 [3] S. C. Mayo, T. J. Davis, T. E. Gureyev, P. R. Miller, D. Paganin, A. Pogany, A. W. Stevenson, and S. W. Wilkins, “Xray phasecontrast microscopy and microtomography,” Opt. Express 11, 2289–2302 (2003).
 [4] F. Pfeiffer, T. Weitkamp, O. Bunk, and C. David, “Phase retrieval and differential phasecontrast imaging with lowbrilliance Xray sources,” Nature Physics 2, 258–261 (2006).
 [5] A. Pan, Ling Xu, J. C. Petruccelli, R. Gupta, B. Singh, and G. Barbastathis, “Contrast enhancement in xray phase contrast tomography,” Optics express 22, 18020–18026 (2014).
 [6] M. Holler, M. GuizarSicairos, E. H. R. Tsai, R. Dinapoli, O. Bunk, J. Raabe, and G. Aeppli, “Highresolution nondesctructive threedimensional analysis of integrated circuits,” Nature 543, 402–406 (2017).
 [7] J. C. Petruccelli, L. Tian, and G. Barbastathis, “The transport of intensity equation for optical path length recovery using partially coherent illumination,” Opt. Exp. 21, 14430 (2013).
 [8] J. W. Goodman and R. Lawrence, “Digital image formation from electronically detected holograms,” Applied physics letters 11, 77–79 (1967).
 [9] K. Creath, “Phaseshifting speckle interferometry,” Applied Optics 24, 3053–3058 (1985).
 [10] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the determination of the phase from image and diffraction plane pictures,” Optik 35, 237–250 (1972).
 [11] J. R. Fienup, “Reconstruction of an object from the modulus of its Fourier transform,” Optics letters 3, 27–29 (1978).
 [12] G. Zheng, R. Horstmeyer, and C. Yang, “Widefield, highresolution fourier ptychographic microscopy,” Nature photonics 7, 739 (2013).
 [13] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for fourier ptychography with an led array microscope,” Biomedical optics express 5, 2376–2389 (2014).
 [14] M. R. Teague, “Deterministic phase retrieval: a Green’s function solution,” JOSA 73, 1434–1441 (1983).
 [15] N. Streibl, “Phase imaging by the transportequation of intensity,” Opt. Commun. 49, 6–10 (1984).
 [16] J. R. Fienup, “Phase retrieval algorithms – a comparison,” Applied Optics 21, 2758–2769 (1982).
 [17] H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Phase retrieval, error reduction algorithm, and fienup variants: a view from convex optimization,” JOSA A 19, 1334–1345 (2002).
 [18] R. W. Gerchberg, “The lock problem in the GerchbergSaxton algorithm for phase retrieval,” Optik 74, 91–93 (1986).
 [19] J. Fienup and C. Wackerman, “Phaseretrieval stagnation problems and solutions,” JOSA A 3, 1897–1907 (1986).

[20]
E. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Inform. Theory
51, 4203–4215 (2005).  [21] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on information theory 52, 489–509 (2006).
 [22] D. L. Donoho, “Compressed sensing,” IEEE Transactions on information theory 52, 1289–1306 (2006).
 [23] E. Candès, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math. 59, 1207–1223 (2006).
 [24] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications (Cambridge University Press, 2012).
 [25] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing 15, 3736–3745 (2006).
 [26] M. Aharon, M. Elad, and A. Bruckstein, “Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing 54, 4311 (2006).
 [27] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE 98, 1045–1057 (2010).
 [28] Changlong Bao, Hui Ji, Yuhui Quan, and Zuowei Shen, “Dictionary learning and sparse coding: algorithms and convergence analysis,” IEEE Tras. Patt. Anal. Mach. Intel. 38, 1356–1369 (2016).
 [29] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, (Omnipress, 2010), pp. 399–406.
 [30] I. J. Goodfellow, J. PougetAbadie, M. Mirza, Bing Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Neural Information Processing Systems (NIPS), , vol. 27 (2014).
 [31] M. Mardani, H. Monajemi, V. Papyan, S. Vasanawala, D. L. Donoho, and J. M. Pauly, “Recurrent generative adversarial networks for proximal learning and automated compressive image recovery,” CoRR (2017).

[32]
C.Y. Yang, C. Ma, and M.H. Yang, “Singleimage superresolution: A benchmark,” in
European Conference on Computer Vision,
(Springer, 2014), pp. 372–386. 
[33]
Chao Dong, Chen Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional neural network for image superresolution,” in
European Conference on Computer Vision (ECCV) / Lecture Notes on Computer Science Part IV, , vol. 8692 (2014), pp. 184–199.  [34] Chao Dong, Chen Loy, Kaiming He, and Xiaoou Tang, “Image superresolution using deep convolutional networks,” pami 38, 295–307 (2015).
 [35] Kyong Hwan Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process. 26, 4509–4522 (2017).
 [36] M. T. McCann, Kyong Hwan Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging,” IEEE Sig. Process. Mag. pp. 85–95 (2017).
 [37] A. Sinha, Justin Lee, Shuai Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4, 1117–1125 (2017).
 [38] Shuai Li and G. Barbastathis, “Spectral premodulation of training examples enhances the spatial resolution of the phase extraction neural network (PhENN),” Opt. Express 26, 29340–29352 (2018).
 [39] A. Goy, K. Arthur, Shuai Li, and G. Barbastathis, “Low photon count phase retrieval using deep learning,” Phys. Rev. Lett. 121, 243902 (2018).
 [40] Y. Rivenson, Yibo Zhang, H. Gunaydin, Da Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light: Science and Applications 7, 17141 (2018).
 [41] Y. Wu, Y. Rivenson, Y. Zhang, Z. Wei, H. Günaydin, X. Lin, and A. Ozcan, “Extended depthoffield in holographic imaging using deeplearningbased autofocusing and phase recovery,” Optica 5, 704–710 (2018).
 [42] T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “Deep learning approach for fourier ptychography microscopy,” Optics express 26, 26470–26484 (2018).
 [43] Y. Xue, S. Cheng, Y. Li, and L. Tian, “Reliable deeplearningbased phase imaging with uncertainty quantification,” Optica 6, 618–629 (2019).
 [44] U. S. Kamilov, I. N. Papadopoulos, M. H. Shoreh, A. Goy, C. Vonesch, M. Unser, and D. Psaltis, “Learning approach to optical tomography,” Optica 2, 517–522 (2015).
 [45] A. Goy, G. Rughoobur, Shuai Li, K. Arthur, A. Akinwande, and G. Barbastathis, “Highresolution limitedangle phase tomography of dense layered objects using deep neural networks,” Proc. Nat. Acad. Sci. ((accepted) 2019).

[46]
YoungJu Jo, Hyungjoo Cho, Sang Yun Lee, Gunho Choi, Geon Kim, HyunSeok Min, and YongKeun Park, “Quantitative phase imaging and artificial intelligence: A review,” IEEE J. Sel. Topics in Quant. Electronics
25 (2019).  [47] G. Barbastathis, A. Ozcan, and Guohai Situ, “On the use of deep learning for computational imaging,” Optica (2019).
 [48] I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Comm. Pure Appl. Math. 41, 909–996 (1988).
 [49] I. Daubechies, Ten lectures on wavelets (Soc. Ind. Appl. Math., 1992).
 [50] R. Coifman and D. L. Donoho, “Translation invariant denoising,” in Wavelets and statistics, , vol. 103 of Lecture Notes in Statistics (SpringerVerlag, 1995), pp. 120–150.
 [51] G. Strang and Truong Nguyen, Wavelets and filter banks (WellesleyCambridge Press, 1996), 2nd ed.
 [52] Raymond Chan, Tony Chan, Lixin Shen, and Zuowei Shen, “Wavelet algorithms for highresolution image reconstruction,” SIAM J. Sci. Comp. 24, 1408–1432 (2003).
 [53] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. Pure Appl. Math. 57, 1413–1457 (2004).
 [54] M. A. T. Figueiredo and R. Nowak, “An EM algorithm for waveletbased image restoration,” IEEE Trans. Image Proc. 12, 906–916 (2003).
 [55] S. Mallat, A wavelet tour of signal processing: the sparse way (Academic Press, 2008).
 [56] D. Lim, Kengye K. Chu, and J. Mertz, “Widefield fluorescence sectioning with hybrid speckle and uniformillumination microscopy,” Opt. Lett. 33, 1819–1821 (2008).
 [57] J. Mertz, “Optical sectioning microscopy with planar or structured illumination,” Nature Methods 8, 811–819 (2011).
 [58] D. Bhattacharya, V. R. Singh, Chen Zhi, P. T. C. So, P. Matsudaira, and G. Barbastathis, “Three dimensional HiLobased structured illumination for a digital scanned laser sheet microscopy (DSLM) in thick tissue imaging,” Opt. Express 20, 27337–27347 (2012).
 [59] Y. Yunhui, A. Shanker, Lei Tian, L. Waller, and G. Barbastathis, “Lownoise phase imaging by hybrid uniform and structured illumination transport of intensity equation,” Optics express 22, 26696–26711 (2014).

[60]
J. Pan, S. Liu, D. Sun, J. Zhang, Y. Liu, J. Ren, Z. Li, J. Tang, H. Lu, Y.W.
Tai et al., “Learning dual convolutional neural networks for
lowlevel vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
(2018), pp. 3070–3079.  [61] A. N. Tikhonov, “On the solution of illposed problems and the method of regularization,” Dokl. Acad. Nauk SSSR 151, 501–504 (1963).
 [62] A. N. Tikhonov, “On the regularization of illposed problems,” Dokl. Acad. Nauk SSSR 153, 49–52 (1963).
 [63] O. Russakovsky, Jia Deng, Hao Su, J. Krause, S. Satheesh, Sean Ma, Zhiheng Huang, A. Karpathy, A. Khosla, and M. Bernstein, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis. 115, 211–252 (2015).
 [64] Y. LeCun, C. Cortes, and C. J. Burges, “MNIST handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).
 [65] Shuai Li, Mo Deng, Justin Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5, 803–813 (2018).
 [66] B. A. Olshausen and D. J. Field, “Emergence of simplecell receptive field properties by learning a sparse code for natural images,” Nature 381, 607–609 (1996).
 [67] B. A. Olshausen and D. J. Field, “Natural image statistics and efficient coding,” Network Comp. Neural Syst. 7, 333–339 (1996).
 [68] A. Van der Schaaf and J. H. van Hateren, “Modelling the power spectra of natural images: statistics and information,” Vision Research 36, 2759–2770 (1996).
 [69] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?” Vision Research 37, 3311–3325 (1997).
 [70] M. S. Lewicki and B. A. Olshausen, “A probabilistic framework for the adaptation and comparison of image codes,” J. Opt. Soc. A 16, 1587–1601 (1999).
 [71] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation 12, 337–365 (2000).
 [72] M. Deng, S. Li, and G. Barbastathis, “Learning to synthesize: splitting and recombining low and high spatial frequencies for image recovery,” arXiv preprint arXiv:1811.07945 (2018).
 [73] O. Ronneberger, P.Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and ComputerAssisted Intervention (MICCAI), , vol. 9351 of LNCS (Springer, 2015), pp. 234–241. (available on arXiv:1505.04597 [cs.CV]).

[74]
L. Bottou, “Largescale machine learning with stochastic gradient descent,” in
Proceedings of COMPSTAT’2010, (Springer, 2010), pp. 177–186.  [75] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR abs/1412.6980 (2015).
 [76] G. E. Hinton, “Learning translation invariant recognition in massively parallel networks,” in Proceedings PARLE Conference on Parallel Architectures and Languages Europe, (SpringerVerlag, 1987), pp. 1–13.
 [77] C. M. Bishop, Pattern recognition and machine learning (Springer, 2006).
 [78] J. Johnson, A. Alahi, and Li FeiFei, “Perceptual losses for realtime style transfer and superresolution,” in European Conference on Computer Vision (ECCV) / Lecture Notes on Computer Science, , vol. 9906 B. Leide, J. Matas, N. Sebe, and M. Welling, eds. (2016), pp. 694–711.
 [79] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
 [80] C. Ledig, L. Theis, F. Huczar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Zehan Wang, and Wenshe Shi, “Photorealistic single image superresolution using a Generative Adversarial Network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 4681–4690.
 [81] K. Pearson, “Contributions to the mathematical theory of evolution. note on reproductive selection,” Proc. Royal Society 59, 300–305 (1896).
 [82] J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician 42, 59–66 (1988).
 [83] S. Li, G. Barbastathis, and A. Goy, “Analysis of phaseextraction neural network (phenn) performance for lensless quantitative phase imaging,” in Quantitative Phase Imaging V, , vol. 10887 (International Society for Optics and Photonics, 2019), p. 108870T.
 [84] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The ThritySeventh Asilomar Conference on Signals, Systems & Computers, 2003, , vol. 2 (Ieee, 2003), pp. 1398–1402.
 [85] Zhou Wang, A. C. Bovik, H. R. Sheikh, , and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” ieeeip 13, 600–612 (2004).
 [86] A. Goy, K. Arthur, S. Li, and G. Barbastathis, “The importance of physical preprocessors for quantitative phase retrieval under extremely low photon counts,” in Quantitative Phase Imaging V, , vol. 10887 (International Society for Optics and Photonics, 2019), p. 108870S.
 [87] P. Gupta, P. Srivastava, S. Bhardwaj, and V. Bhateja, “A modified psnr metric based on hvs for quality assessment of color images,” in Communication and Industrial Application (ICCIA), 2011 International Conference on, (IEEE, 2011), pp. 1–4.
Comments
There are no comments yet.