The retrieval of phase from intensity is one of the most important and most challenging problems in Classical Optics. The utility of phase retrieval stems from the fact that it allows the shape of transparent objects, biological cells in particular, to be quantified in two and three spatial dimensions using visible light [1, 2]. In the x-ray band, quantitative phase imaging is also useful because phase contrast in tissue is orders of magnitude higher than attenuation contrast [3, 4]. The same argument can be made for identification of liquids  and semiconductor materials for integrated circuit characterization and inspection .
There are two well-known challenges in phase retrieval. Firstly, for the phase of the optical field to be well-defined, the illumination needs to be temporally and spatially coherent to a fairly good approximation; this is especially difficult with x-rays. One way to relax this requirement is to acknowledge that it is usually the phase delay through the object that is of interest, and seldom the phase of the optical field itself. The former can be obtained even from partially coherent light 
but requires the correlation function (mutual intensity) to be measured, which is often problematic because of low contrast. The second challenge is that, since only the intensity of a light beam is observable, the phase may only be inferred indirectly from intensity measurements. Computational approaches to this operation may be classified as interferometric/holographic[8, 9], where a reference beam is provided; and non-interferometric, or reference-less such as direct/iterative [10, 11] and ptychographic [12, 13], which are both nonlinear, and transport-based [14, 15], where the problem is linearized through a hydrodynamic approximation. Direct methods attempt to retrieve the phase from a single raw intensity image, whereas the transport and ptychographic methods implement axial and lateral scanning, respectively. What reference-less methods have in common is the need to obtain intensity measurements at some distance away from the conjugate plane of the object, i.e. with a small defocus. Direct measurement with defocus is the approach we take here.
All computational phase retrieval approaches, interferometric and non-interferometric, involve the solution of a nonlinear and highly ill-posed inverse problem. For direct phase imaging, which is a nonlinear problem—see Section 2.2.1—the classical Gerchberg-Saxton-Fienup (GSF) algorithm [10, 11, 16] and its variants 
are widely used. The main idea is to start with a random estimate for the unknown phase distribution and then to iteratively update it until the modulus-squared of its Fourier (or Fresnel) transform matches the observed intensity. For well-behaved phase fields, the iteration usually converges to the correct phase[18, 19]. Alternatively, the Wiener-Tikhonov functional minimization approach, described in Section 2.2.2, exploits prior knowledge about the class of phase objects being imaged to combat noise artifacts. A modern and popular implementation is compressive imaging [20, 21, 22, 23, 24] that utilizes sparsity priors, valid when the object is expressed in the appropriate set of basis functions. If the sparsifying set is unknown, it can be learned from examples according to the -SVD method or other dictionary approaches [25, 26, 27, 28].
In 2010, 
proposed a deep neural network in a recurrent scheme to learn the prior from examples, as an alternative to the dictionary approach. Subsequently, generator-discriminator competition was adopted as a more secure means of learning the statistical relationship between valid objects and their forward models; and the recursion was unfolded into a cascade for better numerical stability . In these schemes, the physical model of the measurement is taken explicitly into account as a projection operator applied to the reconstruction estimate repeatedly at each recursion or cascade stage. This generalization of dictionaries to deep learning has been successful in a number of linear inverse problems, most notably superresolution [32, 33, 34] and tomography [35, 36].
Recently, deep learning regression has been investigated for application to nonlinear inverse problems, in particular phase retrieval: direct [37, 38, 39], holographic [40, 41], and ptychographic [42, 43]. The idea, described briefly in Section 2.2.2, is to train a deep neural network (DNN) in supervised mode from examples of phase objects and their intensity images so that, after training, given an intensity image as input, the DNN outputs an estimate of the phase object. In this case, the physical model is either learned implicitly together with the prior from the examples [37, 38]; or incorporated as a pre-processor (“Approximant”) [39, 40, 41]. An interesting alternative method for the inverse problem, also nonlinear, of reconstructing the three-dimensional (3D) refractive index distribution from intensity projections, is to define the DNN architecture according to the strong scattering model and store the refractive index values as weights of the DNN after training . This “index-storing” DNN itself was subsequently used as Approximant to a traditional DNN for improving the estimates in 3D distributions with exceptionally small and high-contrast features or when the range of available angles of projection is severely limited . Extensive reviews of deep learning use for inverse problems can be found in [36, 46, 47].
Here, we propose a new DNN-based computational architecture for phase retrieval with the unique feature of processing low and high spatial frequency bands as separate channels with two corresponding DNNs trained from an original object database and a high-pass filtered version of the database, respectively. Subsequently, the outputs of the two channels are recombined using a third DNN also specifically trained for this task. The motivation for this new approach is an earlier observation  that nonlinearities in DNN training and execution algorithms tend to amplify imbalances in the spatial frequency content of the training database and in the way different spatial frequencies are treated as they propagate through the physical optical system; this amplified imbalance typically results in the lower spatial frequencies becoming dominant and ultimately limiting resolution of fine spatial features in the reconstructions. A more detailed overview of this phenomenon can be found in Section 2.2.3. Because the essential feature of our new proposed technique is the synthesis of the two spatial bands through a trained DNN, we refer to it as “learning to synthesize” (LS).
Splitting the spatial frequency content into several bands and processing the bands separately has a long history in signal processing [48, 49, 50, 51, 52, 53, 54, 55]. For image reconstruction, dual band processing has been used in fluorescence microscopy [56, 57, 58] and phase retrieval . However, these cases, unlike ours, required structured illumination. In the context of learning-based inversion, a dual channel method has been tried for superresolution  (to be understood as upsampling) albeit the two processed channels were combined as a simple convex sum to form the final image. By contrast, the LS method presented here uses a learnednonlinear filter, implemented as a third DNN trained to optimally recombine the two channels according to the spectral properties of the class of objects that the training database represents.
In addition to requiring a single raw image to retrieve the phase through a learned recombination of the spectral channels, the LS method presented here has the desirable property of resilience to noise, especially in the case of weak photon flux down to a single photon per pixel. We achieved that by using an Approximant filter  to pre-process the raw image before submitting it to the two spectral channels. The Approximant produces an inverse estimate that expressly uses the physical model (a single iteration of the GSF algorithm in  and here). For very noisy inputs, the Approximant is of very poor quality; yet, if the subsequent learning architecture is trained with this low-quality estimate as input, the final reconstruction results are significantly improved. The LS method with Approximant, as presented here, drastically improves over , especially in the reconstruction of fine detail, as  did not use separate spectral channels to rebalance frequency content.
The detailed implementation of the LS method is described in Section 3, where we also show how to optionally include the Approximant as initial estimate for input to the LS scheme. Results from experiments are presented in Section 4 with detailed characterization of the LS method’s behavior under different noise conditions. Conclusions and suggestions for future work are in Section 5.
2 Problem formulation
2.1 Phase retrieval
denote the complex transmittance of an optically thin object of modulus response and phase response , and let denote the coherent incident field of wavelength on the object plane. The noiseless intensity measurement (also referred to as noiseless raw image) is carried out on the detector plane located at a distance away from the object plane, and can be written as
where denotes the Fresnel (paraxial) propagation operator for distance , i.e. the convolution
and is the (nonlinear) noiseless forward operator. Alternatively, may be expressed in the spatial frequency domain as
denotes the 2D (spatial) Fourier transform operator andits inverse.
We are interested in weakly absorbing objects, i.e. we assume . In all the experiments described here, the illumination is also a normally incident plane wave . Therefore, to a good approximation, we may write
In practice, the measurement is subject to Poisson statistics due to the quantum nature of light; and to Gaussian thermal noise added by the photoelectric conversion process. We express the noisy measurement as
denotes a Poisson random variable with meanand
a Gaussian random variable with zero mean and variance. The photon flux in photons per pixel per frame is denoted as and the spatial average of the noiseless raw image in the denominator is necessary as normalization factor. The noisy forward operator is and the purpose of phase retrieval is to invert so as to recover as accurately as possible, despite the nonlinearity and randomness present in the measurements.
2.2 Solution of the inverse problem
The Wiener-Tikhonov approach to solving inverse problems of the form is to obtain the estimate of the inverse as
Here, is the fitness (or data-fidelity) term and is a distance operator. Under the assumption of solely additive Gaussian noise, it is appropriate to use the norm
Somewhat less rigorously, but conveniently for mathematical manipulations, the same
metric is used for the fitness term even for measurements strongly subject to non-Gaussian statistics, as we consider here. The topic of appropriate fitness metrics is beyond the scope of this paper; in any case, when machine learning is used to approximate (6
), the dilemma of choosing a metric shifts to the loss function for training a deep neural network. We will address this latter problem in some detail in Section3.3.1.
The second term in (6) is the regularizer, or prior knowledge term. Its purpose is to compete with the fitness term in the minimization so as to mitigate ill-posedness in the solution. That is, the regularizer penalizes solutions that are promoted by the noise in the forward problem, as in (5) for example, but does not meet general criteria known a priori for valid objects. The prior may be defined explicitly, e.g. as a minimum energy [61, 62] or sparsity [20, 21, 22, 23, 24] criterion; or learned from examples as a dictionary [25, 26, 27, 28] or through a deep learning scheme [29, 31, 32, 33, 34, 36, 35, 37, 38, 39, 40, 41, 42, 43].
where is the output of a deep neural network. In the End-to-End approach, the burden is on the DNN to learn from examples both the forward operator and the prior so as to execute, in one shot, an approximation to the ideal solution (6). Training takes place in supervised mode, with known pairs of phase objects and their raw intensity images generated on a phase spatial light modulator (SLM) and measured on a digital camera, respectively. Note that training is generally very slow, taking several hours or days if a few thousand examples are used. However, after training is complete, the execution of (8) or (9) is very fast as it only requires forward (non-iterative) computations. This is one significant advantage over the standard way of minimizing the Wiener-Tikhonov functional (6) iteratively for each image.
When the inverse problem becomes severely ill-posed or the noise is extremely strong, the learning burden on the DNN becomes too high; then, generally, better results are obtained by training the DNN to receive as input the Approximant instead of the raw measurement directly. The Approximant is obtained through an approximate inversion of the forward operator; for example, in 
it was implemented as a digital holographic backpropagation algorithm, whereas in it was the outcome of a single iteration of the Gerchberg-Saxton algorithm . While these Approximants generally do not look very good, especially in highly noisy situations , through training the DNN is able to learn a better association of with its corresponding true object than it can learn with the noisy raw measurement .
2.3 Spectral properties of training
The design of deep neural networks is an active field of research and a comprehensive review of methods and caveats is well beyond the scope of this paper. We refer the reader to [36, 46, 47] for more extensive background and references. Here, we discuss the influence on the quality of training of the spatial power spectral density of the database where example are drawn from.
) the examples determine the object class prior to be learned by the DNN. For instance, if we train on examples drawn from the ImageNet database of natural objects, the prior learned is weaker than if we train on the MNIST  database of hand-written digits or a database of integrated circuit (IC) segments because both latter shapes are more restricted than natural images and with stronger spatial correlations across. Recent work [39, 65] has shown that, unsurprisingly, training with stronger priors results in more robust reconstructions as long as the test objects are drawn from the same class.
In , we addressed the influence of the spatial power spectral density (PSD) of the example database on the quality of training. It is well known [66, 67, 68, 69, 70, 71] that two dimensional (2D) images of natural objects, such as those contained in ImageNet, follow the inverse quadratic PSD law
Other types of object classes of practical interest exhibit similar power-law decay, perhaps with slightly different exponents. This means that, if a neural network is trained on such an object class, higher spatial frequencies are presented less frequently to the DNN during the training stage. At face value, this is as it should be, since the relative popularity of different spatial frequencies in the database is precisely one of the priors that the DNN ideally should learn.
This understanding needs to be modified in the context of inverse problems because the representation of high spatial frequencies in the raw images is also uneven—typically, to the high spatial frequencies’ disadvantage. In the specific case of phase retrieval, higher spatial frequencies within the spatial bandwidth (as determined by the numerical aperture NA) have uniform transmission modulus but are more severely scrambled by the chirped oscillations of the transfer function (3). Thus, higher spatial frequencies suffer a double penalty : their recovery becomes more sensitive to noise due to the scrambling; and they are less popular due to the inverse-square (or similar) PSD law so they are presented less frequently to the DNN’s training process. Moreover, since the DNN itself and its training routine are both highly nonlinear, there is an acute risk that any unevenness in the treatment of different spatial frequency bands may be amplified in the final result, eventually causing the lower frequencies to dominate. Ref.  attributed the inability of the Phase Extraction Neural Network (PhENN)  to resolve spatial features well within its admitted spatial bandwidth to this unequal treatment of spatial frequencies. They showed that PhENN’s resolution is approximately doubled by pre-filtering the training examples as to flatten their PSD. That is, during the training, each example from the database was replaced with its filtered version
The transfer function was defined as the high-pass filter
exactly compensating for the inverse-quadratic dependence (10) and flattening the spectrum. Raw images for training were correspondingly filtered as
whereas, during the test, the un-filtered measurements (i.e., as received from the camera) were used to obtain the reconstructions. Unfortunately, with this implementation amplification of high spatial frequency features, especially of artifacts caused even by weak noise, was also evident in the reconstructions. This is not surprising, since technically (11) trades off violating the prior in return for finer spatial resolution. The LS approach that we describe in the next section is meant to fix this problem.
3 Methodology of Learning to Synthesize
3.1 Description of the LS-DNN system
In , we proposed a two-step approach to tackle the difficulty of restoring high frequency components without introducing significant artifacts and distortions. Here we describe the two steps for training and execution in detail, as well as the design of the DNNs involved in the LS system. For unity in notation, we denote the input to the entire LS system as , to be understood as
We will discuss the Approximant implementation in more detail in section 3.2 below.
The two training steps are shown in block-diagram form in Figure 1. The first step consists of training two separate DNNs in parallel, as follows:
DNN-L is trained to match unfiltered patterns at its input with the corresponding unfiltered example phase patterns as ground truth at its output (the superscript enumerates the examples).
DNN-H is trained to match unfiltered patterns at its input to corresponding spectrally filtered versions of the ground truth examples at its output. The filter’s transfer function is chosen more generally than (12) as
We have investigated implementations with spanning a broad range, as we discuss in more detail below.
The output of DNN-L for a general test input is denoted as . Assuming similar training conditions, matches the output of PhENN as presented in  in the End-to-End scheme or  in the Approximant scheme; that is, is expected to be fairly accurate at low spatial frequencies but missing fine detail.
The output of DNN-H is denoted as . Note that  required spatial pre-filtering the raw inputs ; here, we do not spatially pre-filter the input (i.e., or according to whether the End-to-End or Approximant scheme is used). We instead train DNN-H to produce the filtered output based on an unfiltered input. This leads to better generalization, because DNN-H is trained on the broadest set of possible images (whereas the training in  was on high-frequency containing images only). Moreover, using unfiltered inputs to DNN-H allows for the training process to be parallelized for better efficiency.
Depending on the value of , the PSD of the patterns training DNN-H will be flat or almost flat. The output of DNN-H is expected to have better fidelity at fine spatial features of the phase objects. However, spectral flattening may also generate artifacts due to over-learning the high spatial frequencies. Therefore, looks rather like a high-pass filtered version of the true object , which we found to be more beneficial for subsequent use in the LS scheme.
The second training step consists of combining the two partially accurate reconstructions and into a final estimate with uniform quality at all spatial frequencies, low and high up to the passband. To this end, we train the synthesizer DNN-S to receive and as inputs, and use the un-filtered examples as the output. To avoid any further damage to the high-spatial frequency content in , we bypass and present it intact to the last layer of DNN-S. By operating on alone, DNN-S learns how to treat the low frequency reconstruction so as to compensate for artifacts at all bands. The use of the synthesizer DNN-S also makes our results less sensitive to the choice of power in the transfer function (14). We found that can produce reconstructions of approximately equal quality, as will be presented in section 4.
After DNN-L, DNN-H and DNN-S have been trained, they are combined in the LS system and operated as shown in Figure 1(b). The input is passed to DNN-L and DNN-H in parallel fashion, and the respective outputs and are passed to DNN-S, which produces the final estimate . It is worth noting that it is not valid to lump the three networks in Figure 1(b) into a single network, due to their separate training schemes described above.
There is a wide variety of DNN structures one may choose to implement DNN-L, H and S. In this work, we use for DNN-L the same architecture in , a residual U-net architecture with skip connections . For simplicity, DNN-H and DNN-S, are also chosen to be structures similar to DNN-L. The details of the implementations are in the Supplementary Material. We made these choices to enable specific and fair comparisons with the earlier works; alternative architectures are certainly possible within the LS scheme, though we judged a full exploration to be outside the scope of the present paper.
Training a neural network is typically implemented as a stochastic optimization procedure [74, 75] where the neural network’s internal coefficients (weights) are adjusted so as to minimize a metric of the distance between the actual output of the neural network and its desired output to a given input (training example). This distance is called training loss function (TLF). In the context of training to solve an inverse problem, the TLF is defined as
where the superscript is again used to enumerate the examples, and the dilemma of choosing the appropriate metric operator emerges.
It is generally accepted [76, 77, 78, 79, 80, 37] that the metric (also referred to as mean square error, MSE) is a poor choice that does not generalize well; i.e., deep neural networks trained with MSE do not perform well when presented with examples outside their training set. For image classification tasks, and in some early work on phase retrieval , the metric (mean absolute error, MAE) was used instead. In direct analogy with compressive sensing, the metric promotes sparsity in the internal connectivity of the neural network, and that leads to better generalization. However,  found that in highly ill-posed problems this benefit is eclipsed by the inability of MAE and pixel-wise metrics more generally to learn spatial structure priors about the object class that are crucial for regularization.
where and are the spatial averages of the generic functions , . If and are uncorrelated, the expected value of is zero, whereas if they are identical then . Thus, training the neural network minimizes the TLF toward , where is the number of training examples.
The NPCC has been shown  to be more effective in recovering fine features than conventional loss functions such as the Mean Square Error (MSE), Mean Absolute Error (MAE) and the Structural Similarity (SSIM) index [84, 85]. However, the NPCC has the disadvantage that it is invariant to affine transformations to its arguments, i.e.
for arbitrary real numbers , , , . For quantitative phase retrieval, where the scale of phase difference matters, the affine ambiguity is resolved with a histogram equalization step after inversion .
3.2 Computation of the Approximant
It has been shown that, even under extreme noise conditions, just a single iteration of the Gerchberg-Saxton (GS) algorithm suffices as Approximant in scheme (9) for phase retrieval . We elected to use the same approach here for the LS-DNN architecture. More recently, a comparative study  showed that higher iterates or regularized versions of GS do improve the appearance of the Approximant result but do not yield significant improvement in the end output of the DNN. Similar conclusions hold for alternatives to GS, e.g. Gradient Descent. While these alternative schemes are interesting for the LS-DNN method, we chose to not pursue them here.
The general form of the GS iterate from the iterate is
where we have taken into account that . Accordingly, our Approximant is
where denotes the function that is uniformly equal to one within the frame .
Figure 2 compares the 2D (log-scale) Fourier spectrum magnitude of a ground truth image (from ImageNet ), Approximant (19) computed without noise, and Approximant (19) computed from an input subject to Poisson statistics corresponding to average flux of one photon per pixel. We can see that although the single photon Approximant (which we will later use as the input to the LS-DNN) has a large support in its spectrum, it is the noise that dominates the mid-to-high frequency range. Therefore, the learning process still bears the burden of restoring the correct high frequency contents and relying heavily on high frequency priors, as our DNN-H does, is justified.
4.1 Experimental Apparatus and Data Acquisition
In each experiment carried out to train and test different LS-DNN schemes, 10,000 image objects from ImageNet  were successively projected on a phase SLM as phase objects (i.e., with phase value at each pixel proportional to the intensity of the corresponding pixel in the original from the database) and their raw images were recorded by the EM-CCD camera at an out-of-focus plane. These 10,000 ground-truth phase images and their corresponding raw intensity images were split into a training set of 9,500 images, a validation set of 450 images and a test set of 50 images. The choice of ImageNet  is reasonable, since the low-frequency dominance in its spatial PSD is representative of the broader classes of objects of interest, and therefore, we anticipate our results will generalize well in practical applications.
The experiments were carried out using the apparatus described in Figure 3. The light source was a continuous wave Helium-Neon gas laser at 632.8nm. The laser beam first passed through a variable neutral density filter (VND) that served the purpose of adjusting the photon flux. The beam was then spatially filtered and expanded into a 18mm diameter collimated pencil and sent onto a transmission spatial light modulator (SLM) with pixels, each of size . Phase objects were projected on the SLM and imaged by a telescope (4F system) consisting of lenses L1 (focal length 230mm) and L2 (100mm). The reduction factor in the 4F system was designed to reduce the spatial extent of the defocused raw image to approximately fit the size of the camera. An aperture was placed in the Fourier plane to suppress higher diffraction orders due to the periodicity of the SLM pixels. The raw intensity images were captured by a Q-Imaging EM-CCD camera with square-shaped pixels of size 88m placed at a distance from the image plane of the 4F system. Additional details about the implementation of the optical apparatus and its numerical simulation with digital Fresnel transforms are in Supplementary Material.
The photon flux is quantified as the number of photons received by each pixel on average for an unmodulated beam, i.e. with no phase modulation driving the SLM. During an initial calibration procedure, for different positions of the VND filter, the photon level is measured using a calibrated silicon photodetector placed at the position of the camera. The quoted photon count is also corrected for the quantum efficiency of the CCD (60% at ), meaning that we refer to the number of photons actually detected and not the incident number of photons.
Here, we report results for two levels of photon flux % and %, respectively quoted in the text as “10” and “1” photons. The data acquisition, training and testing procedures of the entire LS-DNN architecture were repeated separately for each value of .
4.2 Reconstructions and Analysis
Figure 4 shows the reconstructions obtained by the LS-DNN method () and its components at fluxes 1 photon and 10 photons per pixel, as defined immediately above. As expected, the reconstructions by DNN-L have good fidelity at low spatial frequencies but lose fine details, as in ; whereas the reconstructions by DNN-H look like high-pass filtered versions of the true objects with some additional high-frequency artifacts due to the noise. The reconstructions by DNN-S preserve detail at both low and high frequencies, while significantly attenuating the artifacts. The improvement of over is more pronounced under severe noise, i.e. in the photon/pixel case. More examples of reconstructions (obtained with ) for the noisier case () are given in the Supplementary Material.
In Figures 5 and 6, we compare reconstructions by LS-DNN with different values of the pre-filtering parameter for photon and photons per pixel, respectively. The most detail at high frequencies in the DNN-S output is preserved in the range . At lower values of , the quality of reconstructions by DNN-S does not noticeably exceed that of DNN-L. This is expected, since in the limit , training DNN-H becomes identical to DNN-L. On the other hand, for values , the DNN-H output is dominated by high-frequency artifacts and again the quality of DNN-S reconstructions regresses to that of DNN-L, since the high-frequency channel is no longer contributing. These observations are valid for both values of , and even stronger for the most severely noise-limited case .
Similar trends are evident according to various quantitative metrics averaged over the entire set of test examples compared to the true phase signals , summarized in Table 1. For comparisons we used the PCC, defined as in (16) but without the minus sign; peak signal-to-noise ratio (PSNR) ; and structural similarity index metric (SSIM) [84, 85]. As we noted in Section 3.3.1, DNN-H is trained with spectrally pre-filtered version of the true object so a quantitative comparison of its output with the ground truth does not make sense.
It is noteworthy that in both visualization and quantitative comparisons of Figures Figures 5-6 and Table 1, respectively, the performance of DNN-S remains approximately the same within the range . This is desirable as it suggests that one need not prefilter exactly with the inverse of the PSD power law. This further suggests that for datasets that do not represent natural images and may obey power laws different than (10), not knowing the exact value of may not be catastrophic. We have not tested this hypothesis exhaustively, as it is beyond the scope of this paper.
In Table 2 and in Supplementary Material, we also analyze the case of a bigger DNN (denoted DNN-L-3) with computational capacity equal to the sum of DNN-L, H and S together, but trained with un-filtered examples, and show that DNN-L-3 cannot achieve reconstructions of equal quality. Therefore, the improvements over  resulted from the training procedure followed in the LS-DNN method and not simply by brute force due to the use of larger computational capacity.
|Average PCC std.dev||Average PSNR std.dev (dB)||Average SSIM std.dev|
|Approximant||0.149 0.067||0.181 0.082||8.082 3.979||8.098 3.986||0.220 0.105||0.222 0.106|
|DNN-L output||0.808 0.099||0.880 0.059||16.207 2.466||18.299 2.536||0.875 0.071||0.925 0.039|
|DNN-S output ()||0.848 0.075||0.892 0.053||17.511 2.317||18.426 2.321||0.905 0.054||0.929 0.048|
|DNN-S output ()||0.858 0.074||0.898 0.050||17.758 2.523||18.469 2.469||0.903 0.056||0.926 0.036|
|DNN-S output ()||0.859 0.071||0.896 0.050||17.769 2.420||18.487 2.313||0.912 0.048||0.930 0.043|
|DNN-S output ()||0.865 0.067||0.897 0.050||18.116 2.319||18.634 2.380||0.915 0.061||0.934 0.034|
|DNN-S output ()||0.866 0.069||0.899 0.048||17.940 2.281||18.902 2.443||0.915 0.051||0.937 0.034|
|DNN-S output ()||0.860 0.071||0.898 0.050||17.880 2.455||18.940 2.320||0.915 0.062||0.935 0.038|
|DNN-S output ()||0.859 0.088||0.892 0.051||17.827 2.140||18.487 2.311||0.911 0.070||0.927 0.047|
|DNN-S output ()||0.845 0.081||0.887 0.050||17.375 2.221||18.445 2.470||0.902 0.059||0.934 0.041|
|DNN-S output ()||0.825 0.111||0.883 0.052||16.991 2.008||18.484 2.363||0.894 0.075||0.934 0.042|
|DNN-S output ()||0.819 0.077||0.882 0.055||16.820 2.254||18.391 2.600||0.893 0.058||0.933 0.037|
|Average PCC std.dev||Average PSNR std.dev (dB)||Average SSIM std.dev|
|Approximant||0.149 0.067||0.181 0.082||8.082 3.979||8.098 3.986||0.220 0.105||0.222 0.106|
|DNN-L output||0.808 0.099||0.880 0.059||16.207 2.466||18.299 2.536||0.875 0.071||0.925 0.039|
|DNN-L-3 output||0.811 0.125||0.880 0.117||16.209 2.522||18.268 2.030||0.835 0.092||0.926 0.073|
|DNN-S output ()||0.866 0.069||0.899 0.048||17.940 2.281||18.902 2.443||0.915 0.051||0.937 0.034|
To further study the behavior of the LS components in the low and high spatial frequency bands, we studied the reconstructions in the Fourier domain. Figure 7 shows the spectra (2D Fourier transforms) of two randomly selected test examples. Figure 8 and Figure 5 in Supplementary Material show normalized diagonal cross-sections of the PSD averaged over all the 50 test images, for and photons per pixel, respectively. These plots illustrate that DNN-L and DNN-H’s outputs are depleted at the high and low frequencies, respectively, with the losses being more severe in the noisy case ; whereas DNN-S’s output mostly recovers the frequency content at both bands, albeit still with some minor loss at high frequencies.
Lastly, we experimentally characterized the spatial resolution of the LS-DNN reconstructions, i.e. the ability of DNN-S to resolve two pixels at nearby locations having phase delay higher than the rest of the signal. Similar analyses were carried out in [65, 38], where the methodology was also described in detail. In the work presented here, we carried out the analysis under ample illumination, i.e. not under strong Poisson statistics. We made that choice because spatial resolution under highly noisy conditions becomes non-trivially coupled to the noise statistics, and a complete investigation would have been outside the scope of the present investigation. The results are shown in section 6 in Supplementary Material.
5 Concluding Remarks
The LS-DNN reconstruction scheme  for quantitative phase retrieval has been shown to be resilient to highly noisy raw intensity inputs while preserving high spatial frequency detail better than . The reconstructions are also quite robust to variations of the pre-filtering power law around the value following from natural image statistics. Beyond the scope of the work reported here, further improvements may be obtained through modifying the architecture of the DNNs used to process and recombine the two spatial frequency bands. It is also possible that for highly restricted datasets, needs to be investigated further. However, our observations for natural images suggests that the LS-DNN approach is relatively insensitive to under fairly wide conditions.
Another obvious alternative strategy is to split the signals into more than two bands, then process and recombine these multiple bands with a synthesizer DNN, according to the LS scheme. While we did not investigate this approach in detail here, we expect it to present a trade-off between the improvements and the complexity of having to train multiple neural networks implying the need for more examples and the danger of poor generalization.
This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) grant No. FA8650-17-C-9113 and by the SenseTime company. I. Kang was supported in part by KFAS (Korea Foundation for Advanced Studies) scholarship. We are grateful to Kwabena Arthur for useful discussions and critique of the manuscript.
-  P. Marquet, B. Rappaz, P. J. Magistretti, E. Cuche, Y. Emery, T. Colomb, and C. Depeursinge, “Digital holographic microscopy: a noninvasive contrast imaging technique allowing quantitative visualization of living cells with subwavelength axial accuracy,” Opt. Lett. 30, 468–470 (2005).
-  G. Popescu, T. Ikeda, R. R. Dasari, and M. S. Feld, “Diffraction phase microscopy for quantifying cell structure and dynamics,” Optics letters 31, 775–777 (2006).
-  S. C. Mayo, T. J. Davis, T. E. Gureyev, P. R. Miller, D. Paganin, A. Pogany, A. W. Stevenson, and S. W. Wilkins, “X-ray phase-contrast microscopy and microtomography,” Opt. Express 11, 2289–2302 (2003).
-  F. Pfeiffer, T. Weitkamp, O. Bunk, and C. David, “Phase retrieval and differential phase-contrast imaging with low-brilliance X-ray sources,” Nature Physics 2, 258–261 (2006).
-  A. Pan, Ling Xu, J. C. Petruccelli, R. Gupta, B. Singh, and G. Barbastathis, “Contrast enhancement in x-ray phase contrast tomography,” Optics express 22, 18020–18026 (2014).
-  M. Holler, M. Guizar-Sicairos, E. H. R. Tsai, R. Dinapoli, O. Bunk, J. Raabe, and G. Aeppli, “High-resolution non-desctructive three-dimensional analysis of integrated circuits,” Nature 543, 402–406 (2017).
-  J. C. Petruccelli, L. Tian, and G. Barbastathis, “The transport of intensity equation for optical path length recovery using partially coherent illumination,” Opt. Exp. 21, 14430 (2013).
-  J. W. Goodman and R. Lawrence, “Digital image formation from electronically detected holograms,” Applied physics letters 11, 77–79 (1967).
-  K. Creath, “Phase-shifting speckle interferometry,” Applied Optics 24, 3053–3058 (1985).
-  R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the determination of the phase from image and diffraction plane pictures,” Optik 35, 237–250 (1972).
-  J. R. Fienup, “Reconstruction of an object from the modulus of its Fourier transform,” Optics letters 3, 27–29 (1978).
-  G. Zheng, R. Horstmeyer, and C. Yang, “Wide-field, high-resolution fourier ptychographic microscopy,” Nature photonics 7, 739 (2013).
-  L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for fourier ptychography with an led array microscope,” Biomedical optics express 5, 2376–2389 (2014).
-  M. R. Teague, “Deterministic phase retrieval: a Green’s function solution,” JOSA 73, 1434–1441 (1983).
-  N. Streibl, “Phase imaging by the transport-equation of intensity,” Opt. Commun. 49, 6–10 (1984).
-  J. R. Fienup, “Phase retrieval algorithms – a comparison,” Applied Optics 21, 2758–2769 (1982).
-  H. H. Bauschke, P. L. Combettes, and D. R. Luke, “Phase retrieval, error reduction algorithm, and fienup variants: a view from convex optimization,” JOSA A 19, 1334–1345 (2002).
-  R. W. Gerchberg, “The lock problem in the Gerchberg-Saxton algorithm for phase retrieval,” Optik 74, 91–93 (1986).
-  J. Fienup and C. Wackerman, “Phase-retrieval stagnation problems and solutions,” JOSA A 3, 1897–1907 (1986).
E. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Inform. Theory51, 4203–4215 (2005).
-  E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on information theory 52, 489–509 (2006).
-  D. L. Donoho, “Compressed sensing,” IEEE Transactions on information theory 52, 1289–1306 (2006).
-  E. Candès, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math. 59, 1207–1223 (2006).
-  Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Applications (Cambridge University Press, 2012).
-  M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image processing 15, 3736–3745 (2006).
-  M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing 54, 4311 (2006).
-  R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE 98, 1045–1057 (2010).
-  Changlong Bao, Hui Ji, Yuhui Quan, and Zuowei Shen, “Dictionary learning and sparse coding: algorithms and convergence analysis,” IEEE Tras. Patt. Anal. Mach. Intel. 38, 1356–1369 (2016).
-  K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, (Omnipress, 2010), pp. 399–406.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, Bing Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Neural Information Processing Systems (NIPS), , vol. 27 (2014).
-  M. Mardani, H. Monajemi, V. Papyan, S. Vasanawala, D. L. Donoho, and J. M. Pauly, “Recurrent generative adversarial networks for proximal learning and automated compressive image recovery,” CoRR (2017).
C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in
European Conference on Computer Vision,(Springer, 2014), pp. 372–386.
Chao Dong, Chen Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional neural network for image super-resolution,” inEuropean Conference on Computer Vision (ECCV) / Lecture Notes on Computer Science Part IV, , vol. 8692 (2014), pp. 184–199.
-  Chao Dong, Chen Loy, Kaiming He, and Xiaoou Tang, “Image super-resolution using deep convolutional networks,” pami 38, 295–307 (2015).
-  Kyong Hwan Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process. 26, 4509–4522 (2017).
-  M. T. McCann, Kyong Hwan Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging,” IEEE Sig. Process. Mag. pp. 85–95 (2017).
-  A. Sinha, Justin Lee, Shuai Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4, 1117–1125 (2017).
-  Shuai Li and G. Barbastathis, “Spectral pre-modulation of training examples enhances the spatial resolution of the phase extraction neural network (PhENN),” Opt. Express 26, 29340–29352 (2018).
-  A. Goy, K. Arthur, Shuai Li, and G. Barbastathis, “Low photon count phase retrieval using deep learning,” Phys. Rev. Lett. 121, 243902 (2018).
-  Y. Rivenson, Yibo Zhang, H. Gunaydin, Da Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light: Science and Applications 7, 17141 (2018).
-  Y. Wu, Y. Rivenson, Y. Zhang, Z. Wei, H. Günaydin, X. Lin, and A. Ozcan, “Extended depth-of-field in holographic imaging using deep-learning-based autofocusing and phase recovery,” Optica 5, 704–710 (2018).
-  T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “Deep learning approach for fourier ptychography microscopy,” Optics express 26, 26470–26484 (2018).
-  Y. Xue, S. Cheng, Y. Li, and L. Tian, “Reliable deep-learning-based phase imaging with uncertainty quantification,” Optica 6, 618–629 (2019).
-  U. S. Kamilov, I. N. Papadopoulos, M. H. Shoreh, A. Goy, C. Vonesch, M. Unser, and D. Psaltis, “Learning approach to optical tomography,” Optica 2, 517–522 (2015).
-  A. Goy, G. Rughoobur, Shuai Li, K. Arthur, A. Akinwande, and G. Barbastathis, “High-resolution limited-angle phase tomography of dense layered objects using deep neural networks,” Proc. Nat. Acad. Sci. ((accepted) 2019).
YoungJu Jo, Hyungjoo Cho, Sang Yun Lee, Gunho Choi, Geon Kim, Hyun-Seok Min, and YongKeun Park, “Quantitative phase imaging and artificial intelligence: A review,” IEEE J. Sel. Topics in Quant. Electronics25 (2019).
-  G. Barbastathis, A. Ozcan, and Guohai Situ, “On the use of deep learning for computational imaging,” Optica (2019).
-  I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Comm. Pure Appl. Math. 41, 909–996 (1988).
-  I. Daubechies, Ten lectures on wavelets (Soc. Ind. Appl. Math., 1992).
-  R. Coifman and D. L. Donoho, “Translation invariant denoising,” in Wavelets and statistics, , vol. 103 of Lecture Notes in Statistics (Springer-Verlag, 1995), pp. 120–150.
-  G. Strang and Truong Nguyen, Wavelets and filter banks (Wellesley-Cambridge Press, 1996), 2nd ed.
-  Raymond Chan, Tony Chan, Lixin Shen, and Zuowei Shen, “Wavelet algorithms for high-resolution image reconstruction,” SIAM J. Sci. Comp. 24, 1408–1432 (2003).
-  I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. Pure Appl. Math. 57, 1413–1457 (2004).
-  M. A. T. Figueiredo and R. Nowak, “An EM algorithm for wavelet-based image restoration,” IEEE Trans. Image Proc. 12, 906–916 (2003).
-  S. Mallat, A wavelet tour of signal processing: the sparse way (Academic Press, 2008).
-  D. Lim, Kengye K. Chu, and J. Mertz, “Wide-field fluorescence sectioning with hybrid speckle and uniform-illumination microscopy,” Opt. Lett. 33, 1819–1821 (2008).
-  J. Mertz, “Optical sectioning microscopy with planar or structured illumination,” Nature Methods 8, 811–819 (2011).
-  D. Bhattacharya, V. R. Singh, Chen Zhi, P. T. C. So, P. Matsudaira, and G. Barbastathis, “Three dimensional HiLo-based structured illumination for a digital scanned laser sheet microscopy (DSLM) in thick tissue imaging,” Opt. Express 20, 27337–27347 (2012).
-  Y. Yunhui, A. Shanker, Lei Tian, L. Waller, and G. Barbastathis, “Low-noise phase imaging by hybrid uniform and structured illumination transport of intensity equation,” Optics express 22, 26696–26711 (2014).
J. Pan, S. Liu, D. Sun, J. Zhang, Y. Liu, J. Ren, Z. Li, J. Tang, H. Lu, Y.-W.
Tai et al., “Learning dual convolutional neural networks for
low-level vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(2018), pp. 3070–3079.
-  A. N. Tikhonov, “On the solution of ill-posed problems and the method of regularization,” Dokl. Acad. Nauk SSSR 151, 501–504 (1963).
-  A. N. Tikhonov, “On the regularization of ill-posed problems,” Dokl. Acad. Nauk SSSR 153, 49–52 (1963).
-  O. Russakovsky, Jia Deng, Hao Su, J. Krause, S. Satheesh, Sean Ma, Zhiheng Huang, A. Karpathy, A. Khosla, and M. Bernstein, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis. 115, 211–252 (2015).
-  Y. LeCun, C. Cortes, and C. J. Burges, “MNIST handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).
-  Shuai Li, Mo Deng, Justin Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5, 803–813 (2018).
-  B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature 381, 607–609 (1996).
-  B. A. Olshausen and D. J. Field, “Natural image statistics and efficient coding,” Network Comp. Neural Syst. 7, 333–339 (1996).
-  A. Van der Schaaf and J. H. van Hateren, “Modelling the power spectra of natural images: statistics and information,” Vision Research 36, 2759–2770 (1996).
-  B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?” Vision Research 37, 3311–3325 (1997).
-  M. S. Lewicki and B. A. Olshausen, “A probabilistic framework for the adaptation and comparison of image codes,” J. Opt. Soc. A 16, 1587–1601 (1999).
-  M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation 12, 337–365 (2000).
-  M. Deng, S. Li, and G. Barbastathis, “Learning to synthesize: splitting and recombining low and high spatial frequencies for image recovery,” arXiv preprint arXiv:1811.07945 (2018).
-  O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), , vol. 9351 of LNCS (Springer, 2015), pp. 234–241. (available on arXiv:1505.04597 [cs.CV]).
L. Bottou, “Large-scale machine learning with stochastic gradient descent,” inProceedings of COMPSTAT’2010, (Springer, 2010), pp. 177–186.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR abs/1412.6980 (2015).
-  G. E. Hinton, “Learning translation invariant recognition in massively parallel networks,” in Proceedings PARLE Conference on Parallel Architectures and Languages Europe, (Springer-Verlag, 1987), pp. 1–13.
-  C. M. Bishop, Pattern recognition and machine learning (Springer, 2006).
-  J. Johnson, A. Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV) / Lecture Notes on Computer Science, , vol. 9906 B. Leide, J. Matas, N. Sebe, and M. Welling, eds. (2016), pp. 694–711.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
-  C. Ledig, L. Theis, F. Huczar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Zehan Wang, and Wenshe Shi, “Photo-realistic single image super-resolution using a Generative Adversarial Network,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 4681–4690.
-  K. Pearson, “Contributions to the mathematical theory of evolution. note on reproductive selection,” Proc. Royal Society 59, 300–305 (1896).
-  J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician 42, 59–66 (1988).
-  S. Li, G. Barbastathis, and A. Goy, “Analysis of phase-extraction neural network (phenn) performance for lensless quantitative phase imaging,” in Quantitative Phase Imaging V, , vol. 10887 (International Society for Optics and Photonics, 2019), p. 108870T.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, , vol. 2 (Ieee, 2003), pp. 1398–1402.
-  Zhou Wang, A. C. Bovik, H. R. Sheikh, , and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” ieeeip 13, 600–612 (2004).
-  A. Goy, K. Arthur, S. Li, and G. Barbastathis, “The importance of physical pre-processors for quantitative phase retrieval under extremely low photon counts,” in Quantitative Phase Imaging V, , vol. 10887 (International Society for Optics and Photonics, 2019), p. 108870S.
-  P. Gupta, P. Srivastava, S. Bhardwaj, and V. Bhateja, “A modified psnr metric based on hvs for quality assessment of color images,” in Communication and Industrial Application (ICCIA), 2011 International Conference on, (IEEE, 2011), pp. 1–4.