1 Introduction
Fourier ptychographic microscopy (FPM) [1] circumvents optical spacebandwidth (SBP) limitations to achieve gigapixelscale quantitative phase images, having both wide fieldofview (FOV) and high resolution. The method combines ideas from synthetic aperture and translationaldiversity phase retrieval [2, 3], conveniently realized by replacing the light source of a microscope with an LED array, then capturing multiple images under different illumination angles. When LEDs illuminate the sample from angles smaller than that allowed by the objective’s numerical aperture (), brightfield images result. Conversely, when the illumination NA is larger than the objective NA, darkfield images result. Although darkfield images alone do not have higher resolution than the objective allows, they do contain information about subdiffractionlimit sized features, which occupy a shifted area of the sample’s Fourier space (assuming a thin sample). By collecting many images that cover a wide region of Fourier space and stitching them together coherently, one can achieve spatial resolution beyond the objective’s diffraction limit, corresponding to the sum of illumination and objective NAs (). FPM’s scanfree high SBP imaging capability has great potential for revolutionizing biomedical imaging, with applications in digital pathology [1, 4, 5, 6] and in vivo live cell imaging [7]. The original FPM method only applies to 2D thin objects, however, new models and reconstruction algorithms also enable 3D reconstruction of thick samples [8].
Multiple algorithms have been proposed for solving the nonlinear inverse FPM problem, which amounts to phase retrieval. Amongst these, there are the usual tradeoffs between accuracy, noise performance and computational complexity. In practice, however, we have found that a critical metric is an algorithm’s performance under model mismatch – when the experimental data is imperfect (e.g. due to misalignment). This is typical in computational imaging algorithms, which are often fragile and not robust enough to provide consistent highquality results. Unfortunately, model mismatch is difficult to quantify, since it is systematic, yet unpredictable. Here, we aim to compare algorithms directly, in order to identify error mechanisms and determine the most accurate and robust algorithm for our experiments.
The original FPM algorithm used a GerchbergSaxton approach [9], which is a type of alternating projections [10, 11, 12, 13], first developed for traditional ptychography [2, 14, 15, 16, 17, 3] and later for FPM [1, 18, 19]. Shifted support constraints (finite pupil size) are enforced in the Fourier domain as the corresponding amplitude constraints (measured images) are applied in the image domain, while letting the phase evolve as each image is stepped through sequentially. The GerchbergSaxton method, which is a type of gradient descent, represents a natural way to solve phase retrieval problems by trying to directly minimize some cost function that describes the differences between actual and predicted measurements. Unfortunately, these formulations are often nonconvex in nature and do not come with global convergence guarantees.
Recently, a class of gradient descent like updates, dubbed Wirtinger Flows [20], have been shown to have global convergence guarantees. This method has been successfully applied to FPM [21], though the actual implementation deviates from theory somewhat. In the Wirtinger Flow framework, the optimization procedure is similar to gradient descent, except that the step size and initial guess are carefully chosen for provable convergence.
Gradient descent and Wirtinger Flow are firstorder methods, in the sense that they only use the firstorder derivative of the cost function when updating the complexfield. It is also possible, and generally advantageous, to use higherorder derivatives in the updates. For example, secondorder methods (e.g. Newton’s method) use both the first and second derivative of the cost function, and have been shown to provide faster convergence rates [22]. In our studies, we also observe improved performance when using secondorder methods. For example, in the top row of Fig. 1, the GerchbergSaxton algorithm is a firstorder method, whereas the other three methods are secondorder (or approximate secondorder) methods. All results achieve a similar resolution, but the firstorder (GerchbergSaxton) result is corrupted by lowfrequency artifacts. While computing secondorder derivatives increases complexity, we find that it usually reduces the number of iterations needed, enabling fast overall run times.
The final class of algorithms that have been proposed are based on convex relaxations [23, 24, 25, 26, 27]. This class of phase retrieval algorithms, called PhaseLift, reframes the problem in higher dimensions such that it becomes convex, then aims to minimize the cost function between actual and predicted intensity via semidefinite programming. These algorithms come with the significant advantage of rigorous mathematical guarantees [28] and were successfully applied to FPM data [27]. The actual implementations of these algorithms, however, deviate from the provable case due to computational limitations.
Algorithms can be further classified as sequential or global, depending on whether the update is done for each image, one at a time (sequentially), or all at once with the full set of images (globally) for each iteration. Global methods are expected to perform better, at a cost of additional computational requirements. In our studies, results show little difference between the sequential and global implementation of any particular algorithm (see Fig. 1), suggesting that sequential procedures may be sufficient, allowing reduced computational requirements.
One seemingly unimportant classification of algorithms is whether their cost function minimizes differences in intensity or amplitude. Throughout this paper, we refer to algorithms that minimize intensity differences as intensitybased algorithms, and algorithms that minimize amplitude differences as amplitudebased algorithms. Since intensity is amplitude squared, both drive the optimization in the correct direction; hence, one might expect that the choice between the two is of little consequence. Surprisingly, however, we find that the cost function is the key predictor of experimental performance for our experimental dataset. Intensitybased algorithms suffer from strong artifacts (see Fig. 1), which we show to be due to noise and model mismatch errors. Hence, amplitudebased algorithms perform better on imperfect data, so are more robust. Our goal is to explain why this happens.
We will show that in order for a phase retrieval scheme to be robust to experimental imperfections, the choice of cost function is of crucial importance. One source of error in our experimental data is measurement noise, including Gaussian noise or Poisson shot noise. Another main source of error is model mismatch, caused by experimental imperfections such as aberrations and LED misalignment. A particular problem of FPM datasets is that they contain both brightfield and darkfield images, which have drastically different intensity levels (see Fig. 2). Brightfield images can have several orders of magnitude higher intensity than darkfield images; thus, the amount of Poisson noise will also be significantly higher. If this difference in the noise levels is not properly accounted for, the brightfield noise may drown out the darkfield signal. We will further show that aberrations and LED miscalibration  the two main model mismatch errors in our experiments  result also in intensitydependent errors. Thus, by carefully designing the the cost function, we can develop algorithms that are significantly more robust to both noise and model mismatch.
We develop a maximum likelihood theory which provides a flexible framework for formulating the FPM optimization problem with various noise models. In particular, we will focus on Gaussian and Poisson noise models. We find that amplitudebased algorithms effectively use a Poisson noise model, while intensitybased algorithms use a Gaussian noise model. To illustrate, we simulate four FPM datasets, three of which are contaminated with measurement errors (see Fig. 3): Poisson noise, aberrations, and LED misalignment. We compare the performance of various algorithms on these datasets to demonstrate that the imperfections in our experimental data are more consistent with a Poisson noise model. This explains our observations that amplitudebased algorithms are more experimentally robust than intensitybased algorithms.
2 Algorithm Formulation
2.1 Forward problem for Fourier ptychography
Consider a thin sample with transmission function , where represents the 2D spatial coordinates in the sample plane. Assuming that the LED array is sufficiently far from the sample, each LED will illuminate the sample by a plane wave from a different angle, defined by , where is the spatial frequency corresponding to the th LED, . After passing through the sample, the exit wave is the product of the sample and illumination complex fields,
. The tilted plane wave illumination means that the Fourier transform of this exit wave is just a shifted version of the Fourier spectrum of the object,
, where and is the 2D Fourier transform. This exit wave then passes through the objective lens, where it is lowpass filtered by the pupil function, , which is usually a circle with its size defined by . Finally, with being the 2D inverse Fourier transform, we can write the intensity at the image plane as [19](1) 
2.2 Possible noise and simulated dataset
Ideally, all algorithms based on the forward model above should give good reconstructions. However, noise and model mismatch errors cause deviations from our forward model. Thus, a noise model that accurately describes the error will be important for noise tolerance. Heuristically, we have identified three experimental nonidealities that cause error: Poisson noise, aberrations and LED misalignment. We aim to separate and analyze the artifacts caused by each through controlled simulations that incur only one type of error.
The simulated data (Fig. 3) uses the same parameters as our experimental setup, where a green LED array (central wavelength nm) is placed 77 mm above the sample. LEDs are nominally 4 mm apart from each other and only the central 293 LEDs are used, giving a maximum . Samples are imaged with a objective lens having .
Using our forward model, we simulate four datasets:

Ideal data: no noise is added. The object and pupil follow exactly the FPM forward model that is assumed in the algorithm.

Poisson noise data: the ideal data is corrupted by Poissondistributed noise at each pixel. To emphasize the effect and to emulate experiments with lowerperformance sensors, we simulate 20
more noise than is present in our experiments (details in Section 2.3). 
Aberrated data: simulated images are corrupted by imaging system aberrations, which are described by the aberrated complex pupil function shown in Fig. 3. The pupil function used in these simulations was obtained from experimental measurements.

LED misaligned data: the illumination angle of each LED is perturbed slightly (following a normal distribution with standard deviation
). The black and blue in Fig. 3 show the original and perturbed LED positions, respectively.
To deal with these experimental errors, in the next section we will discuss different noise models for formulating the FPM optimization problem.
2.3 Optimization problem based on different noise models
Most algorithms solve the FPM problem by minimizing the difference between the measured and estimated amplitude (or intensity), without assuming a noise model. Hence, the FPM problem can be formulated as the following optimization
(2) 
Since the cost function here, , aims to minimize the difference between the estimated amplitude and the measured amplitude, this is the amplitudebased cost function. By optimizing this cost function, the projectionbased algorithms for Fourier ptychography can be obtained [1, 18, 19], which treat each measurement as an amplitudebased suboptimization problem. The formulation is used in the traditional GerchbergSaxton phase retrieval approach.
If we have information about the statistics of the noise, we can use it in our optimization formulation via the maximum likelihood estimation framework [29]
. If we assume that our measured images suffer only from white Gaussian noise, then the probability of capturing the measured intensity
at each pixel, given the estimate of , can be expressed as(3) 
where and is the standard deviation of the Gaussian noise. and denote the estimated and measured intensity, respectively.
The likelihood function is the overall probability due to all the pixels in all the images and can be calculated as
, assuming measurements from all pixels are independent. In maximum likelihood estimation, the goal is to maximize the likelihood function. However, it is easier to solve this problem by turning the likelihood function into a negative loglikelihood function which can be minimized. The negative loglikelihood function associated with this probability distribution can be calculated as
(4) 
The next step is to minimize this negative loglikelihood function by estimating so that the overall probability is maximized. For white Gaussian noise, it is assumed that are the same across all pixels for all images (i.e. all measurements have the same amount of noise), though this will not be the case for FPM datasets. By making a Gaussian noise assumption, the first term in (4) is a constant and can be ignored. The optimization problem then reduces to
(5) 
We call this cost function, , the intensitybased cost function because it aims to minimize the difference between the estimated intensity and the measured intensity. It also implies that noise from each pixel is treated the same and independent of the measured intensity. It will be shown later that the previous implementations of PhaseLift [27] and Wirtinger flow algorithms [21] for FPM aimed to optimize this intensitybased cost function. However, both can be implemented instead with a Poisson likelihood cost function.
If we assume instead that our measured images suffer from Poisson shot noise, then the probability of the measured intensity, , given the estimate of can be expressed as
(6) 
Note that the Poisson distribution is used to describe the statistics of the incoming photons at each pixel, which is a discrete probability distribution. Here, we assume that the intensity is proportional to the photon count, so we can treat the distribution of the intensity as a Poisson distribution. When the expected value of the Poisson distribution is large, then this Poisson distribution will become more like a Gaussian distribution having a standard deviation proportional to the square root of the intensity,
, from the central limit theorem. This means that a large measured intensity at a particular pixel will imply large noise at that pixel. In the simulation, we impose Poisson noise on the measured intensity by distributing each pixel value with a Gaussian distribution and setting the standard deviation to
. The negative loglikelihood of the Poisson noise model can then be calculated; the optimization problem is formed by minimizing the negative loglikelihood function with estimation of ,(7) 
This cost function comes from the likelihood function of the Poisson distribution, so we call it the Poissonlikelihoodbased cost function. It implies that the pixels with larger measured intensity are weighted smaller because they suffer from more noise. Since the brightfield images have more largevalue pixels, they are assumed to be more noisy and thus are weighted smaller in the cost function. It is shown in the Appendix that the gradient of this cost function (37) is very similar to that of the amplitudebased cost function (34), which suggests that the amplitudebased cost function deals well with Poissonlike noise or model mismatch.
2.4 Vectorization Notation
For multivariate optimization problems such as (2) and (5
) , it is convenient to reformulate the problem using linear algebra. First, the functions need to be vectorized. Each of the captured images,
, having pixels, are rasterscanned into vectors, , with size . Since the estimated object transmission function will have higher spacebandwidth product than the raw images, the estimated object should have pixels, where . For convenience, we actually solve for the Fourier space of the object, , which is vectorized into a vector with size . Before multiplying the pupil function, the Fourier space of the object is downsampled by a matrix . The matrix transforms a vector into a vector by selecting values out of the original vector, so the entries of this matrix are either 1 or 0 and each row contains at most one nonzero element. The pupil function is vectorized into a vector with size . The 2D Fourier transform and inverse transform operator are matrices defined as and . , , , and are elementwise operators, and the operator puts the entries of a vector into the diagonal of a matrix.The second step is to rewrite the optimization in vector form using the new parameters. First, the forward model (1) can be vectorized as
(8) 
The amplitudebased cost function (2) can be vectorized as
(9) 
where the hyperscript denotes a Hermitian conjugate.
Likewise, the intensitybased cost function (5) can be vectorized as
(10) 
The Poisson likelihood cost function is more complicated to be expressed in vector form. First, we rewrite as
(11) 
where is a matrix with row vectors, , , and denotes the complex conjugate of vector . Then the likelihood function can be rewritten as
(12) 
To minimize (9), (10) or (12) using an iterative optimization algorithm, the gradients (and possibly Hessians) of the cost functions need to be calculated, both of which are shown in the Appendix. Since (9), (10) and (12) are all realvalued functions of a complex vector , that means that and should be treated independently in the derivative calculation, which is based on the CRcalculus discussed in [30] and the similar formulation for traditional ptychography discussed in [17].
3 Derivation of algorithms
The basic formulation of the optimization problem in FPM has been described in the last section and the derivative calculation has been done in the Appendix, so we now turn our attention to describing how each algorithm solves the optimization problem based on different cost functions. The key step will be in how each algorithm updates the estimate of the object at each iteration. We compare the existing algorithms for FPM and also implement a new secondorder global Newton’s method under different cost functions, for comparison. The initialization for all algorithms is the same  the amplitude of the image from the onaxis LED illumination.
3.1 Firstorder methods
3.1.1 Sequential gradient descent (GerchbergSaxton) [1, 18]
For the implementation in [1, 18], the algorithm aims to optimize the amplitudebased cost function (9). It is the simplest to implement and, in this case, equivalent to the GerchbergSaxton approach of simply replacing known information in real and Fourier space. Since the sequential strategy treats a single image as an optimization problem, the cost function for each problem is just one component of Eq. (9) and is defined as
(13) 
where denotes the index of each measurement.
The derivative of this cost function is thus a component of Eq. (34) and can be expressed as
(14) 
The update equation for this sequential amplitudebased algorithm is then a gradient descent with the descent direction given by Eq. (14) and step size :
(15) 
where indicates the iteration number, which goes to after running through all the measurements from to . This algorithm adopts the alternating projection phase retrieval approach. The first projection in the real domain is the amplitude replacement operation , and the second projection is to project the previous estimated Fourier region onto the updated Fourier region .
It is worth noting that the algorithm in [1] directly replaces in the Fourier domain at each subiteration. A similar algorithm in [18], introduced for simultaneous aberration recovery, has the same form as Eq. (15) that implements gradient descent in the Fourier domain. However, when there is no pupil estimation, then becomes a pure support function with one inside the support and zero outside. In this situation, these two algorithms become exactly the same, and thus we refer to both as sequential gradient descent or GerchbergSaxton algorithm.
3.1.2 Wirtinger flow algorithm [21, 20]
The Wirtinger flow optimization framework was originally proposed to iteratively solve the codedmask phase retrieval problem using nonlinear optimization [20]. It is a gradient descent method implemented with a special initialization and special step sizes. For the FPM implementation described in [21], the intensitybased cost function is used. Thus, the update equation for the object transmission function can be expressed as
(16) 
where the step size is calculated by
(17) 
where is the gradient of the intensitybased cost function calculated in (36), and and are userchosen parameters to calculate the step size.
In the previously proposed FPM implementation of Wirtinger flow [21], the algorithm deviates somewhat from the original theory proposed in [20]. First, there is an additional term in the cost function to deal with additive noise. Second, the initialization used in [21] is not the proposed one in [20], but rather a lowresolution captured image. So the algorithm in [21] is essentially a gradient descent method with the special step size based on the intensitybased cost function and is not guaranteed to converge to the global minimum.
The Wirtinger flow algorithm can be implemented with different cost functions simply by replacing the original intensitybased gradient with the other gradients derived in the Appendix. For comparison, we have implemented the Wirtinger flow algorithm using all three of the cost functions described here: amplitudebased, intensitybased and Poissonlikelihoodbased. The results are compared in Fig. 1 with experimental data and Section 4 with simulated data.
3.2 Secondorder methods
Beyond firstorder, a secondorder optimization method can improve the convergence speed and stability of the algorithm, especially for nonlinear and nonconvex problems. Secondorder methods (e.g. Newton’s method) use both the first and second derivatives (Hessian) of the cost function to create a better update at each iteration. As a result, they generally require fewer iterations and move more directly towards the solution. The difficulty of secondorder implementations is in computing the Hessian matrix, whose size scales quadratically with the size of the image. As a result, approximations to the Hessian are often used (known as quasiNewton methods) to trade performance for computational efficiency.
3.2.1 Sequential GaussNewton method [19]
First, we look at a GaussNewton method based on the amplitudebased cost function, which approximates the Hessian matrix as a multiplication of its Jacobian matrix:
where (See Appendix). Since the inversion of this Hessian matrix requires very high computational cost, we approximate the Hessian by dropping all the offdiagonal terms of the Hessian matrix. Further, the inversion of the Hessian matrix may be an illposed problem, so a constant regularizer is adopted. In the end, the approximated Hessian inversion becomes
(19) 
where is a constant vector with all the entries equal to a constant regularizer over all pixels.
By applying Newton’s update, Eq. (45), with this approximated Hessian inversion, the new estimate of can be expressed as
(20) 
where the part is the step size for this descent direction. Note that when is a constant having either 0 or 1 values, this method is reduced to the sequential gradient descent method with a tunable regularizer . In practice, however, we also simultaneously update (see Section 4.3), so the secondorder optimization procedure becomes more crucial.
3.2.2 New algorithm implementing a global Newton’s method
Since we expect secondorder methods to perform better than firstorder, and also global methods to be more stable than sequential, we propose a new global secondorder (Newton’s) method, and show the results compared against other methods. For completeness, we implement all three of amplitude, intensity, and Poissonlikelihoodbased cost functions, showing that the amplitude and Poissonlikelihoodbased cost functions indeed perform better. The difficult step in deriving a Newton’s method for this problem is in calculating the gradients and Hessians of the cost functions directly, without approximations. In the Appendix, we show our derivation, and in this section we use the results with a typical Newton’s update equation:
(21) 
The inverse of the Hessian matrix, , is solved efficiently by a conjugate gradient matrix inversion iterative solver as described in [31]. is determined by the backtracking line search algorithm at each iteration, as described in [22]. The exact form of the cost function and the Hessian depends on the algorithm used. For amplitudebased Newton’s algorithm, and ; for intensitybased Newton’s algorithm, and ; for Poissonlikelihoodbased Newton’s algorithm, and .
3.3 Convexbased methods
3.3.1 PhaseLift algorithm [27, 23, 24, 25, 26]
The PhaseLift formulation for phase retrieval is conceptually quite different than the previous methods described here. The idea is to lift the nonconvex problem into a higherdimensional space in which it is convex, thereby guaranteeing convergence to the global solution. To do this, the cost function of is reformulated into that of a rank1 matrix and the goal is to estimate instead of . The process of reformulation can be expressed as [27]
(22) 
where is an operator combining the inverse Fourier transform, pupil cropping, and the downsampling operation with row vectors denoted by .
Hence, the estimated intensity as a function of can be expressed
(23) 
where is a linear operator transforming into . In Section 2.3, we discussed three different cost functions. Only the intensitybased and Poissonlikelihoodbased cost functions are convex on the estimated intensity, , which is a component of . Thus, the intensitybased and Poissonlikelihoodbased cost functions can be turned into a convex function on through this transformation. For the implementation in [27], by defining , the intensitybased cost function can be expressed as
(24) 
Since is a rank1 matrix, we then minimize the rank of subject to . However, the rank minimization problem is NPhard. Therefore, a convex relaxation [23, 24, 25] is used instead to transform the problem into a trace minimization problem. Under this relaxation, the optimization problem becomes
(25) 
where is a regularization variable that depends on the noise level.
The problem with this new approach is that by increasing the dimensionality of the problem, the size of the matrix has become
, which is too large to store and calculate eigenvalue decomposition on a normal computer. To avoid these computational problems, we do not directly solve (
25), but rather apply a factorization to , where is an matrix. is a rank1 matrix so is set to be 1 ( becomes ). This new problem is then solved effectively using the augmented Lagrangian multiplier, by modifying the original cost function [27, 26](26) 
where , vector, is the Lagrangian multiplier, and is the augmented Lagrangian multiplier. Both are parameters that can be tuned to give a better reconstruction. By taking the derivative of this cost function with respect to and updating in each iteration, the optimization problem can then be solved [26]. Unfortunately, after these modifications, the problem becomes nonconvex because of the minimization with respect to instead of , and thus is no longer provable.
In order to provide a more familiar form for comparing the PhaseLift algorithm to the others discussed in this paper, we define , where is vector, so that the minimization problem in Eq. (26) becomes
(27) 
Now, we see that the PhaseLift implementation is essentially an intensitybased cost function with an additional constraint that may deal better with noise.
The corresponding derivative of the cost function is calculated as in the previous section:
(28) 
When is large compared to the component of and , the factorized PhaseLift formulation with rank1 is equivalent to the intensitybased optimization problem discussed in the previous section. To solve this optimization problem, a quasiNewton algorithm called LBFGS (Limitedmemory BroydenFletcherGoldfarbShanno) method [22], which is a secondorder method using an approximated Hessian inversion from previous gradients, is adopted.
We note that although the PhaseLift algorithm can also be implemented with the Poissonlikelihoodbased cost function, the algorithm in the rank1 case is equivalent to our global Newton’s method discussed in Section 3.2.2 for the same reason as in the above analysis.
4 Performance analysis of various algorithms
In this section, we compare the algorithms described in Section 3 using experimental data, as well as simulated data that mimics the experimental errors described in Section 2.2. We find that secondorder optimization generally performs better than firstorder, while global methods do not give significant improvement over sequential. Further, we explain why the cost function is a key consideration in choosing an algorithm by explaining the cause of the highfrequency artifacts that result from intensitybased algorithms. Interestingly, the two model mismatch errors (aberrations and LED misalignment) behave similarly to Poisson noise, in that they also give intensitydependent errors. Hence, the amplitude and Poisson likelihood algorithms are more robust not only to Poisson noise, but also to model mismatch errors.
4.1 Reconstruction of the simulated and experimental data
Next, we use each of the algorithms described in Section 3 to reconstruct amplitude and phase from the datasets simulated in Section 2.2, in order to quantify performance under various experimental error types by comparing against the ground truth input. Figures 4 and 5 show the reconstructed amplitude and phase, respectively. On the top left corner of each image we give the relative error of the reconstruction, defined as
(29) 
where and are the reconstructed and true images, respectively, in vector form. In order to ensure that all algorithms converge to their stable solutions, we use 200 iterations for each algorithm, except for Wirtinger flow, which requires 500 iterations. The tuning parameters for each algorithm are summarized in Table 1. We have attempted to optimize each parameter as fairly as possible; for example, we use a large in the PhaseLift algorithm to achieve a better reconstruction. Small trades resolution for flatter background artifacts.







N/A  N/A 

N/A 



PhaseLift 



N/A 
In analyzing results from the simulated datasets, we find that algorithms with the same cost function give similar reconstruction artifacts. For example, the intensitybased algorithms suffer from highfrequency artifacts and phase wrapping when the data is not perfect. Almost all algorithms give a satisfactory reconstruction when using the errorfree ideal dataset, except for intensitybased Wirtinger flow, which suffers some phaseamplitude leakage and phase blurring (see Figs. 45). When the dataset contains noise or model mismatch, we observe a distinct trend that amplitudebased and Poissonlikelihoodbased algorithms give a better result, compared with intensitybased algorithms. The exception to this trend is the GerchbergSaxton algorithm, which is somewhat unstable and gets stuck in local minima, so is not robust to any type of error.
The goal of our simulations was to determine the main error sources that cause artifacts in the experimental reconstructions of Fig. 1. Since the experiments contain combined errors from multiple sources, it is difficult to attribute artifacts to any particular type of error. We find, however, that all three of our main error sources cause similar artifacts, hence our experimental results may be corrupted by any of Poisson noise, aberration, or LED misalignment. For example, notice that our simulated errorcorrupted data all results in highfrequency artifacts when using intensitybased algorithms, similar to the experimental results. The GerchbergSaxton result also displays lowfrequency errors in simulation, as in experiment. The fact that both noise and model mismatch create similar artifacts is unexpected, since they are very different error mechanisms. We explain below why all three are intensitydependent errors, which is the reason why the cost function choice is so important for robustness. The consequence is that algorithms which use a more accurate noise model (amplitude and Poisson likelihoodbased) will not only be more robust to noise, but also to model mismatch errors.
To examine the convergence of each algorithm, Figure 6 plots the error for each iteration when using the aberrated dataset and LED misaligned dataset with different algorithms. The intensitybased algorithms (red curves) clearly do not converge to the correct solution and can incur large errors when the data is not perfect. Compared to PhaseLift and the intensitybased Newton’s method, the Wirtingerflow algorithm seems to have lower error; however, this is only due to its slow divergence. If run for many iterations, it will eventually settle on a similarly errorcorrupted result as the other two intensitybased algorithms (not shown). We also observe that amplitudebased (blue curves) and Poissonlikelihoodbased (black curve) algorithms converge to points with lower errors in a similar fashion. This behavior is well explained by the similarity of the algorithms in their use of gradients and Hessians (as shown in the Appendix). Again, the exception to the trend is the firstorder GerchbergSaxton algorithm, which recovers the object fairly well with aberrated data, but goes unstable in the case of LED misalignment. Note that, when there is no pupil estimation step, the only difference between the GerchbergSaxton and the sequential GaussNewton algorithm is the step size. Since the latter algorithm gives a good reconstruction, while the former diverges, we conclude that the GerchbergSaxton step size is too large for a stable update in this particular case.
Ideal data  Misaligned data  

Runtime (s) 

Runtime (s)  

4  2.22  diverges  diverges  

23  12.97  83  46.8  

13  100.49  20  154.6  

46  26.28  158  89.52  

28  211.68  77  582.1  

96  54.46  153  87.36  

1481  651.64  diverges  diverges  

67  386.28  diverges  diverges  

12  74.44  diverges  diverges 
The convergence speed of each algorithm can be determined from Figure 6 using two metrics: number of iterations required and total runtime. We choose the convergence curves from the cases of ideal data and LED misaligned data and compare their iteration numbers and runtimes in Table 2. All the algorithms were implemented in MATLAB on an Intel i7 2.8 GHz CPU computer with 16G DDR3 RAM under OS X operating system. We define convergence as the point when the relative phase error reaches its stable point. The comparison does not consider the divergent cases. In the ideal data case, we can see that the sequential methods outperform all the other algorithms in terms of runtime. The GerchbergSaxton algorithm is the fastest in terms of both iteration number and runtime for this perfect dataset. The global Newton’s method using intensitybased and amplitudebased cost functions also converge very fast in terms of iteration number. The Wirtinger flow algorithm takes much longer to reach convergence both in runtime and iteration number. For the case of the LED misaligned data, only five algorithms converge. In terms of iteration number, the amplitudebased Newton’s method converges much faster than the other four, as expected. However, the sequential GaussNewton algorithm converges much faster in terms of the runtime. Though the global Newton’s method is theoretically better than the others, it takes significant time to calculate the full Hessian matrix. Thus, the sequential GaussNewton method is our preferred algorithm in practice, because it provides excellent robustness while also enabling fast runtimes and reasonable computational complexity.
The main conclusions to be drawn from this section are that the FPM optimization algorithms which are formulated from amplitudebased and Poissonlikelihoodbased cost functions are more tolerant to imperfect datasets with both Poisson noise and physical deviations like model mismatch, which were represented by aberrations and LED misalignment here. In the next section, we will explain more about the causes for this trend.
4.2 Noise Model Analysis
The reason why amplitudebased and Poissonlikelihoodbased algorithms have superior tolerance to experimental errors is due to their Poisson noise model. Each of these algorithms makes an implicit or explicit assumption that the magnitude of the errors in the data scale with the measured intensity. This is obviously a good model for Poisson noise errors, which are defined as noise which scales with intensity. It is not as obvious that the model mismatch errors (aberrations and LED misalignment) scale with intensity as well. To demonstrate this, Fig. 7 shows the histogram of the difference between the deviated dataset and the ideal dataset, for the cases of both brightfield and darkfield images. The histograms show a similar trend  all of the brightfield errors are much larger than the darkfield errors, with a similar statistical variation. Thus, the errors from Poisson noise, aberrations and LED misalignment all scale with the measured intensity. In our experimental data, there are always aberrations in the objective lens, LED misalignment, and Poisson shot noise. Since the noise model for the amplitudebased and Poissonlikelihoodbased algorithms match the actual noise properties, these algorithms perform better than the intensitybased algorithms. And since the images captured by FPM have drastically different intensity values, this effect dominates the reconstruction artifacts. Note that these large variations in intensity values are specific to FPM and likely do not play a major role in other phase imaging schemes (e.g. phase from defocus or traditional ptychography), where images do not have such a wide range of intensity values. In our experiments, the Poisson noise is fairly low (due to use of a highperformance sCMOS sensor), but the model mismatch in the experimental data can cause effects similar to strong Poisson noise.
For further understanding, we look closer at the relationship between the noise model and the cost function. Our optimization algorithms are derived from three cost functions. Each of the cost functions makes a noise model assumption. The intensitybased cost function assumes that noise in the data follows a white Gaussian noise model, which means that the standard deviation of the noise is assumed to be the same across the brightfield and darkfield images. Recall that the standard deviation of a Gaussian noise probability model is related to the weight in the cost function for each pixel, as shown in Eq. 4. The larger the standard deviation (amount of noise) at any pixel in Fourier space, the smaller the weighting, since noisy pixels should be trusted less. In the Gaussian noise model, the weights in the cost function for largevalued pixels and smallvalue pixels are the same. However, the deviation for brightfield images is much larger than that for darkfield images, as shown in Fig. 7. Hence, the brightfield images will contribute more to the total cost function value if the weights are all the same, due to their high intensity. The result is that the intensitybased (Gaussian noise model) algorithms focus mostly on the brightfield images, which correspond to low spatial frequency information, and the darkfield images do not contribute much. The result is a failure in the highfrequency reconstruction, as we saw in Figs. 1, 4, 5, and loss of effective resolution since the darkfield images contain all the subdiffractionlimit information. To illustrate the dramatic difference in weights, Fig. 8 shows the gradient of the different cost functions. Obviously, the intensity cost function gives much higher weighting to low spatial frequencies, which causes the highfrequency artifacts.
Since the amplitudebased cost function shares a similar gradient and Hessian with the Poisson likelihood function, as shown in the Appendix and Fig. 8, it is not surprising that they both produce a similar quality reconstruction. Both of these cost functions assume the noise in the data follow a Poisson distribution, with the standard deviation scaling with the measured intensity. This assumption matches the actual error better than the white Gaussian assumption. The actual noise or deviations in the experiments for brightfield images have larger standard deviation, while that for darkfield images have smaller standard deviation. Under the Poisson noise model, the weight in the cost function is smaller for the noisy brightfield images and larger for the darkfield images. At the end, algorithms based on the Poisson noise model put more emphasis on the darkfield images and thus get a better reconstruction compared to the intensitybased algorithms. Figure 8 shows that the gradients for the amplitudebased and Poissonlikelihoodbased cost function are similar and are more uniform throughout the whole Fourier space.
4.3 Pupil estimation
There are already more sophisticated FPM extensions to correct for some model mismatch errors [18, 19], similar to the probe correction algorithms in traditional ptychography [16]. Both of the methods previously developed for Fourier ptychography are derived from the amplitudebased formulation. By taking the derivative of the cost function with respect to , the decent direction to estimate the pupil function can be calculated as
(30) 
By applying the pupil estimation step after each object estimation using this gradient or approximated Hessian, the sequential gradient descent [18] and the sequential GaussNewton method [19] including pupil estimation can be derived. Here we only consider the amplitudebased cost function, for simplicity.
We wish to investigate the improvements obtained by adding a pupil estimation step to both first and secondorder optimization algorithms. Figure 9 shows the reconstruction result from the sequential gradient descent (firstorder) and sequential GaussNewton (secondorder) algorithms, using the aberrated dataset from the previous simulations. The numbers at the top left corner are the relative error compared to the ground truth simulated image. As can be seen, adding the pupil estimation step gives a better complexfield reconstruction, and the secondorder (GaussNewton) method with pupil estimation provides the best result.
Surprisingly, however, the secondorder reconstruction without pupil estimation is better than the firstorder reconstruction with pupil estimation, for this case. This highlights the robustness to aberrations that a secondorder optimization scheme enables. The secondorder nature of the algorithm makes it faster in convergence, and also more stable. In terms of runtime, the pupil estimation step takes about the same time as the object reconstruction part, so the algorithm is two times slower when the pupil function step is incorporated.
4.4 Misalignment correction
Another possible correction scheme for model mismatch is that for LED misalignment. Since each LED position corresponds to a certain shift of the pupil function in the Fourier domain, this is similar to the shift of the probe function in traditional ptychography. There, iterative algorithms have been proposed to correct for the positioning error of the probe function [14, 32, 33, 34]. In [14, 34], a gradient of the cost function with respect to the shift of the probe function has been calculated and the conjugate gradient method has been applied to correct for the positioning error. In [32], a simulated annealing method is adopted to estimate the shift of the probe function. The simulated annealing method is also adopted to correct for the misalignment of the spatial light modulator in a overlapped Fourier coding system [35], analogous to FPM. In our experiments, we observe that the simulated annealing method can locate the LED positions more accurately than other methods. Thus, we only compare with the simulated annealing method.
Simulated annealing is a method of searching unknown variables over a finite space to minimize or maximize the function of merit  the cost function in our case. Instead of exhaustively testing all the possible states, simulated annealing iteratively approaches the optimal state. At the first iteration, the algorithm randomly searches several states in the space and selects the one with the smallest cost function value. The algorithm then starts at this state for the next iteration, slowly reducing the search range in the following iterations until convergence.
In our sequential algorithm, the whole optimization problem is divided into many suboptimization problems for different collected images. At each suboptimization problem, a gradient descent or GaussNewton method is applied to update that corresponding region in Fourier domain. To add a LED misalignment correction step, the simulated annealing algorithm can be incorporated into each subiteration to find an optimal shift of the pupil function. In each subiteration, the downsampling matrix, , which contains the information of the pupil shift, is tested according to the annealing process for several possible states corresponding to different shifts of the pupil. The state with the smallest cost function value is selected to update the old downsampling matrix. Then, the new downsampling matrix is used to update the corresponding region in the Fourier domain.
The simulated annealing method estimates the LED positions with good accuracy. Figure 10 shows the reconstruction result from the simulated LED misaligned dataset, both with and without the LED correction step. The result using the LED correction clearly shows better quality and smaller error, as seen in Fig. 10(a). Since the LED correction scheme also estimates the actual LED positions, which we intentionally perturbed in order to impose a known error, we can also compare the actual and recovered LED positions, shown in Fig. 10(b).
To complete the picture, we now show experimental reconstructions with and without the two correction schemes: pupil correction and LED misalignment corrections (see Fig. 11). Since we do not know ground truth for our experiments, we can only make qualitative observations. An incremental improvement is observed when adding the pupil estimation and then the LED correction steps  the background variation becomes flatter. Figure 11
(b) shows the corrected LED positions compared to the original ones, in angular coordinates. Corrected positions of LEDs in different regions share similar offset because the fabrication process of the LED array can cause unexpected position misalignment for each LED. Notice that the LEDs at the edges (corresponding to higher angles of illumination) incur more variation, since these are more sensitive to calibration. Also, many of the large deviations occur at the edges that are not along the horizontal and vertical axes. In these areas, the LED position recovery is poor because the object has very little information there (the resolution test target contains only square features) and so the data contains little information about these areas. However, any errors in LED positions in this area will also not significantly affect the reconstruction if they do not contribute much energy to the object spectrum. If the goal was not to correct the image results, but rather to find the LED positions accurately, then one should choose an object that contains uniformly distributed spatial frequencies (e.g. a random diffuser or speckle field). Although the simulated annealing further improves our reconstruction, we note that it is more than ten times slower to process the data because of the local search performed at each subiteration.
5 Conclusion
We formulated the Fourier ptychographic phase retrieval problem using maximum likelihood optimization theory. Under this framework, we reviewed the existing FPM algorithms and classified them based on their cost functions: amplitudebased algorithms (akin to a Poisson noise model) and intensitybased algorithms (akin to a white Gaussian noise model). We also derived a new algorithm based on the Poisson likelihood function, which is better suited for dealing with measurement imperfections. We compared the tolerance of these algorithms under errors due to experimental noise and model mismatch (aberrations and LED misalignment) using both simulated data and experimental data. Because the noise and model mismatch error for brightfield and darkfield images depend on the measured intensity, the amplitudebased and Poissonlikelihoodbased algorithms from the Poisson noise model are more robust than the intensitybased algorithms. This can be explained by the standard deviation of the noise model determining the weight of each image in the optimization. Hence, intensitybased algorithms overweight the brightfield images, resulting in poor highfrequency reconstruction.
We used existing pupil estimation algorithms and proposed a simulatedannealingbased LED correction algorithm to algorithmically fix the experimental deviations. We compared the performance of the pupil estimation algorithms and found that secondorder methods give the best results. We also showed the capability of the simulated annealing method to correct for misaligned LEDs and find their actual positions.
Based on our studies, we conclude that the global Newton’s method gives the best reconstruction, but may have high computational cost. Considering both robustness and computational efficiency, we find that sequential GaussNewton method provides the best tradeoffs for largescale applications. Its experimental robustness is verified in our recent timeseries in vitro experiments [7]. Our open source code for this algorithm can be downloaded at [36].
Acknowledgments
Funding was provided by the Gordon and Betty Moore Foundation’s DataDriven Discovery Initiative through Grant GBMF4562 to Laura Waller (UC Berkeley).
Appendix A: Gradient and Hessian calculation
Gradient:
Then, calculate the derivative of with respect to , and it can then be expressed as
(32) 
Using and
, two chain rule parts in (
32) are calculated as(33) 
if does not contain any zero entries for .
By plugging these two terms into (32), the gradient of with respect to becomes
(34) 
The gradient for can be calculated in the similar way, and the chain rule part of can be calculated as
(35) 
With (35), it is clear to express the gradient of as
(36) 
The calculation of gradient of with respect to is different from the other two. With the expression (12), the gradient of Poisson likelihood function can be calculated as
(37) 
This is equivalent to the gradient of the intensitybased cost function with added weight to the component from each image. In addition, this gradient is very similar to that from the amplitudebased cost function.
Since we have gradients for all cost functions, the updating equation for the gradient descent method can then be expressed as
(38) 
where denotes the iteration number, is the step size chosen by the line search algorithm, and can be either intensitybased or amplitudebased cost function.
Looking at , and , they all contain the term following by a residual term. The residual term basically finds the difference between the estimation and the measurement. This difference carries the information to update the previous estimation. Since each measurement carries the information for a specific region in the Fourier space, the term brings this updating information back to the right place corresponding to some spatial frequency. For , the first term in the residual shows the replacement of the amplitude in the real domain, which is the projection from the estimation to the modulus space. Thus, the gradient descent method using the amplitudebased cost function is similar to the projectionbased phase retrieval solver.
Hessian:
The secondorder Taylor expansion on an arbitrary real function with a complex vector at certain point can be written as [30]
(39) 
where the matrix is the Hessian of . For the case of a singlevalue function, the secondorder term in the Taylor expansion denotes the curvature of the function at that expansion point. Thus, this Hessian matrix similarly contains the curvature information of the original multivariate function.
If the Hessian is a diagonal matrix, each diagonal entry denotes the curvature in each corresponding dimension. If the Hessian is not diagonal, a coordinate transformation can be found to make the Hessian diagonal by using eigenvalue decomposition. For a convex problem, the Hessian is positive semidefinite. The curvatures of the cost function in different dimensions are always nonnegative. A standard optimization process can lead to a global minimum. However, if the problem is nonconvex, a standard optimization process will probably lead to a local minimum. Calculating the Hessian of a cost function is useful either to examine the optimization process or to speed up the convergence rate by using Newton’s method.
From [17, 30], the definition for the Hessian of a realvalue function with multiple complex variables is a matrix and can be expressed as
(40) 
where each component matrices can be further calculated as
(41) 
Similar to the calculation of the gradient, the components of the Hessians for amplitudebased, intensitybased, and Poissonlikelihoodbased cost functions can be calculated by taking an additional derivative on the gradient of the cost functions. The components of the Hessian for the amplitudebased cost function are
(42) 
where is the identity matrix.
Likewise, the Hessian of the intensitybased cost function is
(43) 
Finally, the Hessian of the Poisson likelihood cost function is