1 Introduction
In recent years, datadriven
image reconstruction techniques based on machine learning, in particular deep learning (DL)
[1], have gained tremendous success in solving complex inverse problems [2], and can often provide results surpassing those using stateoftheart modelbased techniques. Traditionally, solving an inverse problem involves first explicitly formulating the imaging model and incorporating domain and prior knowledge (e.g. via the use of regularization techniques), and then finding an analytical solution (e.g. through an optimization procedure) [3]. Unlike modelbased approaches, the ‘endtoend’ DL framework does not explicitly utilize any models or priors, and instead relies on large datasets to ‘learn’ the underlying inverse problem. The outcome of this DL approach consists of two important components. First, the result from the training stage is a CNN that corresponds to a plausible underlying mapping function relating the measurement to the solution. Second, the trained CNN can be used to make ‘predictions’ when presenting it with new measurements that were unused in the training stage. This second part comes with major practical benefits in computational cost and speed in typical image reconstruction problems, since the prediction process simply involves the feedforward computation of the CNN that typically takes no more than a few seconds on a normal grade GPU. In contrast, most of modern modelbased techniques rely on iterative algorithms [4, 5, 6] that require much higher computational cost and longer running time; the same lengthy process needs to be repeated every time for each new measurement.Here, we distinguish two classes of imaging problems: those involve independent datasets from often static objects, and those dealing with sequential datasets that are temporally correlated, from dynamic objects. In independent
problems, CNNs have been demonstrated to provide superior performance to solve many challenging imaging problems, such as image superresolution
[7, 8], denoising [9, 10], segmentation [11], deconvolution [12, 13], compressive imaging [14, 15], tomography [16, 17], digital labeling [18], holography [19, 20], phase recovery [21], and imaging through diffusers [22, 23]. What’s common in this class of problems is that independently prepared inputoutput pairs (i.e. measurement and solution), obtained by repeating the same imaging process, are presented to the CNN at the training stage to optimize the network’s parameters. In sequential problems, the temporal correlation of a dynamic process contains additional information, and is often recorded in video datasets. Various CNN frameworks have been proposed to learn the additional temporal information. For example, spatial superresolution has been demonstrated by training a CNN on both spatial and temporal dimensions of videos [24]. Temporal superresolution on recurring processes is achieved by learning the underlying temporal statistics [25]. The motion information of dynamic objects is learned with an opticalflow based CNN [26]. Motion artifacts can be removed by jointly learning the blurring pointspreadfunction (PSF) and deconvolution operation [27, 28]. In all these cases, CNNs are designed to process a video sequence in order to extract the temporal information. The downside is that the CNN architectures inevitably become more complicated that require more computational resources, as compared to those used in the independent problems. Fundamentally, the complication stems from that any single frame from the imaging techniques used does not contain sufficient temporal statistical information.In this work, we develop a CNN architecture to reconstruct video sequence of dynamic live cells captured with a computational microscopy technique based on Fourier ptychographic microscopy (FPM) [30, 29]. The unique feature of the FPM is its ability to quantitatively reconstruct phase information with both wide fieldofview (FOV) and high spatial resolution, i.e., a large spacebandwidth product (SBP). This is not possible for traditional techniques which must trade spatial or temporal resolution for FOV. For livecell imaging applications, this allows one to simultaneously image a large cell population (e.g. more than 3400 in a single frame in [29]). Cells of the same type undergo similar morphological changes during different cell states, which then repeat over each cell cycle. If one records only a few cells at a time using conventional microscopy techniques [31], capturing the full dynamics would require a large sequence of measurements to cover the entire cell cycle (typically ranging from a few hours to days). Our proposed technique is based on the observation that, in any live cell experiment without precise cell synchronization [32], at any instant of time, a large cell population would contain samples covering all cell states. In other words, it is possible to gather sufficient temporal statistical information of a single cell by imaging a large spatial ensembles simultaneously. Based on this idea, we propose a CNN that is trained using only a single frame from the FPM. We then show that this trained CNN is able to reconstruct largeSBP phase videos with high fidelity using datasets taken in a timeseries live cell experiments.
Existing FPM techniques are limited by their long acquisition times, which are limited by the FPM algorithms that require at least overlap in the Fourier coverage of the images captured from neighboring LEDs [30]. Several illumination multiplexing techniques have been demonstrated to improve the acquisition speed [33, 29]. However, the amount of data reduction is still limited by the Fourier overlap requirement. Here, we show that, similar to prior work on CNN for FPM on static objects [34], our CNN can be sufficiently trained using much fewer images than that needed by the modelbased FPM algorithms for dynamic livecell samples.
Distinct from computer vision applications, a particular challenge in applying DL to biomedical microscopy is the difficulty in gathering ground truth data needed for training the network. Various strategies have been proposed, including synthetic data from simulations built with physical imaging models
[35, 36, 37], semisynthetic data that uses experimental data to guide simulations [36], experimental data captured with a different modality [8, 19], and experimental data captured with the same modality [36]. Here, we propose to use the traditional FPM reconstructed phase images as the ground truth for training. Since our technique requires only a single frame for training, this does not add much overhead in data acquisition or computation. When using experimental data as the ground truth, they inevitably are contaminated with noise. In FPM, the quality of the phase reconstruction is limited by spatially variant aberrations, system misalignment, and intensitydependent noise [38]. Robust learning using noisy labelled data has been demonstrated for image classification and segmentation [39, 40]. In essence, CNN captures the invariants while filtering out the random fluctuations [41, 42]. Here, we show that our proposed CNN is also robust to phase noise in the ‘ground truth’ data for solving the inverse problem of FPM.We build a CNN based on the conditional generative adversarial network (cGAN) framework, consisting of two subnetworks, the generator and the discriminator. The generator network uses the UNet architecture [11] with densely connected convolutional blocks (DenseNet) [43] to output highresolution phase image. The discriminator network distinguishes if the output is real or fake. We compare five variants of the network, which differ by the input measurements using different illumination patterns corresponding to different Fourier coverages. Similar to the traditional FPM, the darkfield measurements lead to spatial resolution improvement in the reconstruction. To further refine the network, we introduce a mixed loss function that takes a weighted Fourier domain loss, in addition to the standard image domain loss for the generator and the adversarial loss for the discriminator. We show that this novel weighted Fourier domain loss leads to improved recovery of high frequency information. We demonstrate our technique using live Hela cell FPM video data from [29]. We quantitatively assess the performance of our CNN technique over time against those from traditional FPM results, and found that the ‘generalization’ degradation of the reconstructed phase is small over the entire time course (¿4 hours).
The training is performed on a PC Intel core i7, 32 GB RAM, NVIDIA GeForce Titan XP for
16 hours using Keras/Tensorflow framework. Once the network is trained, reconstructing a 12800
10800 pixels phase image requires only 25 seconds, which is approximately 50 faster than the modelbased FPM algorithm [29].Our technique demonstrates a promising deep learning approach to continuously image large livecell populations over extended time and gather spatial and temporal information with subcellular resolution. Compare to existing FPM [30, 29], this CNN approach significantly improves the overall throughput by reducing both the acquisition and computation times, and with less data requirement. The CNN reconstructed phase image provides high spatial resolution, wide FOV, and low noiseinduced artifacts. We also show the flexibility in reconstructing other cell types using transfer learning, which makes our technique appealing to broad applications.
2 Method
2.1 Conditional generative adversarial network (cGAN)
Generally speaking, the proposed CNN based FPM reconstruction algorithm takes a set of lowresolution intensity images as the network input and output a single highresolution phase image . The intensity images are captured from illuminating the sample from different illumination angles (LEDs) [Fig. 1(a)], in which are brightfield (BF) and are darkfield (DF) (Fig. 2). In the training stage, the ground truth phase image is fed into the CNN, which is obtained from the reconstructed highresolution phase from the FPM algorithm in [29] [Fig. 1(b)]. A key feature of the FPM is to reconstruct a highresolution phase image using a set of lowresolution intensity images. The resolution enhancement factor is in each dimension. To obtain the ground truth, it needed to capture the full FPM dataset, containing 173 images [29]. Since our DL scheme only requires training for the first ‘FPM frame’, the rest of the frame only requires images, which allows reducing the acquisition time, especially in a timeseries experiment. We denote the set of lowresolution images
as a tensor of dimension
and the corresponding ground truth a tensor of dimension [Fig. 1(b)].The proposed CNN that performs FPM video reconstruction [Fig. 1(c)] is based on the conditional generative adversarial network (cGAN) framework. It consists of two subnetworks, the generator and the discriminator (Fig. 2). Here, the goal of the generator , is to be trained to predict a highresolution phase from the lowresolution image set input. To simplify the notation, we will drop the subscript knowing that will always contain lowresolution intensity images. The generator network consists of a set of parameters (weights and biases), which will be optimized through the training. The optimal is learned by minimizing a loss function over inputoutput training pairs:
(1) 
We emphasize that the choice of the loss function significantly affects the quality of the training. We propose a mixed loss function that takes the weighted sum of multiple elementary loss functions, which will be detailed in Subsection 2.2.
The generator adopts the general ”encoderdecoder” architecture used in UNet [11] to facilitate efficient learning of pixeltopixel information. UNet has shown to increase the network’s performance by adapting to the highcomplexity information in image dataset [44]. To enhance the efficiency of the training process, batchnormalization (BN) is used to offset the internal covariate shift [45]. In addition, dropout regularization [46] is employed to constrain network’s adaptation to the data during the training to avoid overfitting and increase the network’s model accuracy. A known problem of training a CNN is that it can get saturated when the network’s depth becomes too deep [47]. To mitigate this problem, the dense block (DB) proposed in the densely connected network is used [43]. A DB connects each layer to its subsequent layers in a feedforward fashion. The inputs to each layer are the featuremaps of all preceding layers; the output of the current layer’s own featuremaps are inputs to all the subsequent layers (see Fig. 2
). The DB has several advantages, including (a) mitigation of the vanishinggradient problem in the training; (b) reduction of the total number of parameters; (c) enhancement of feature propagation and reuse. A typical
layer DB is defined as follows:(2) 
where denotes the concatenation operation that connects all the feature maps of all layers in the block. The output at the end of each layer DB has numbers of feature maps, where is the number of the feature maps in the first layer, the hyperparameter is referred to as the growth rate. Within each layer inside the DB (ConvBlock), a series of operations are performed, including batchnormalization (BN), nonlinear activation using the ReLU or LeakyReLU function (ReLU/Leaky ReLU) [48], and convolution with filters of kernel size [Conv()].
In our generator , it contains a total of 11 DBs. The number of ConvBlock layers in each DB is (marked as in Fig. 2 with denoting the number of ConvBlock layers in each DB). In each ConvBlock layer, a stack of BNReLUConv()BNReLUConv() operations are performed with and .
Between two consecutive DBs, a transition block is used to facilitate the desired downsampling or upsampling operation. The downsampling transition block contains Conv()BNReLUConv(, stride=2); the upsampling transition block contains Conv()BNReLUDeconv(, stride=2), where Deconv denotes the deconvolution (transpose convolution) layer [49]. The features of the input layer are extracted by an initial Conv()BNReLU block before feeding them to the first DB. A Conv() is used to perform the final regression to generate the phase map .
The discriminator network aims to distinguish if the output from is real or fake. Following [50] and [51], we define a conditional Generative Adversarial Network (cGAN) to solve the following adversarial minmax problem:
(3) 
The general idea behind this network is that it aims to train a generator to ‘fool’ the discriminator . Here, is trained to distinguish whether the highresolution phase image predicted by represents a real phase image. It was observed that GAN in general is hard to train and it may fail when the generator collapses to a parameter setting where it always gives the same output. A successful strategy to avoid this failure is to allow the discriminator to perform minibatch discrimination [51, 52]. In this case, the discriminator distinguishes if the reconstructed phase image is real or fake by evaluating multiple subregions of the predicted image instead of the whole.
2.2 Loss function
A motivation of the usage of the discriminator network is that the commonly used pixelwise loss functions, such as the mean absolute error (MAE), mean square error (MSE), and structural similarity index (SSIM), may not be the most appropriate figures of merit, in particular when assessing a CNN’s performance in preserving high frequency content of reconstructed images. The minimization of these pixelwise loss functions can lead to solutions that ignores the highfrequency details, while favors solutions that are smooth, albeit have less perceptual quality [53]. With cGAN approach, the generator can learn to create a solution that resembles realistic highresolution images with highfrequency details.
For this purpose, we define the ‘perceptual loss function’
as a weighted sum of multiple loss functions. This ensures that the model can learn the desired features containing both lowfrequency and highfrequency information in the phase images. Specifically, our loss function consists of four components, including the pixelwise spatial domain meanabsolute error (MAE) loss , the pixelwise Fourier domain meanabsolute error (FMAE) loss , the generator’s adversarial loss , and the weight regularization , in the following form:(4) 
where
(5)  
(6)  
(7)  
(8) 
where
denotes the 2D Fourier transform,
is the norm. are hyperparameters that controls the relative weights of each loss components. We found that the Fourier loss function is sensitive to pixelwise corruption during the early stage of the training process. As a result, we use it only to refine the outputs by enforcing similarity in the frequency domain [54] after initial training is done with the other three loss components (details in Subsection 2.4).2.3 Data preparation
To test our CNN technique, we use FPM video data from [29]. The timeseries data was taken on Hela cells at 2 min intervals over the course of 4 hours that contains several cell cycles. Each FPM dataset contains 173 lowresolution intensity images, in which 37 are brightfield (BF), 136 are darkfield (DF). Each intensity image is 25602160 pixels in 16bit grayscale.
To generate the data for training, FPM phase reconstructions from [29] are used as the ground truth. Each FPM reconstructed phase image contains 1280010800 pixels, which is 55 larger than the raw intensity image.
To prepare the dataset for training, we use only the first FPM frame in the timelapse as the training set. Specifically, to prepare the ground truth data, the full FOV phase image is first divided into 44 subregions, containing 34402760 pixels. To avoid edge artifacts during training and reconstruction, neighboring subregions are chosen to have 320pixel and 80pixel overlap along the horizontal and vertical directions, respectively. The corresponding intensity image in each subregion are with 688552 pixels. The input to the CNN are BF and DF image patches that are cropped from random locations of each of the subregion images, each with 6464 pixels. Each training input data is formed by stacking the BF and DF image patches to form a 6464 tensor. To facilitate fast computation, the models are designed with downsampling path and upsampling path. Each input image was upsampled to 80
80 using bilinear interpolation. The spatial dimension of each layer in the CNN are 80, 40, 20, 10, 20, 40, 80, 160 and 320, respectively. The corresponding ground truth data contains 320
320 pixels. Each raw BF image was preprocessed by the background subtraction procedure in [29]; each raw DF image was preprocessed to remove the dark current noise [29]. The same preprocessing steps are applied for training, validation, and testing.2.4 Training, evaluation, and testing
To investigate the interplay between the illumination pattern and the performance of the CNN, we train our network by using several different combinations of BF and DF images. The illumination patterns along with the CNN models used are shown in Fig. 3(a). Each illumination pattern is plotted in the Fourier space, in which a yellow circle indicates the NA of the objective lens. Intensity images taken from the LEDs within the circle are BF; whereas those outside the circle are DF. The LEDs inuse are marked in red. To systematically study the relation between the reconstructed resolution with the illumination’s angular coverage, we have designed patterns with (P1) 13 BFonly with 0.2 illumination NA, (P2) 13 BF + 36 DF with 0.6 illumination NA, (P3) 13 BF + 10 DF with 0.25 illumination NA, and (P4) 9 BF + 20 DF with 0.4 illumination NA. The following networks are investigated: P1 is trained on two networks, UB implements the UNet without DB in [55]; UBcGAN implements the UNet in [51] with the cGAN architecture (i.e. with the discriminator network in Fig. 2); P2 is trained on the cGAN network in Fig. 2, DBDcGAN; P3 is trained on the cGAN network, DBDcGAN; P4 is trained on a cGAN network with and without the Fourier loss function, denoted as DBDcGAN, DBDFcGAN, respectively.
Each model was trained with
700900 epochs. For UNet, the batchsize was 16; whereas the batchsize was 4 in UNet with DB due to memory limitation. We use the weight coefficients
when the Fourier loss is not used. When the Fourier loss is used, we first train the network with for 700 epochs, and then with for another epochs. We observed that the network’s parameters are unstable in the early stage of training. To stabilize the training process, we added the Fourier loss after 700 epochs. We used the ADAM optimizer [56] with initial learning rate of , dropout factor 0.5 after every 10 epochs, in which each epoch contains 1000 iterations. In each iteration, the algorithm incrementally updates the model using a subset (set by the batchsize) of the input. To fine tune each network, as an optional step, we performed model validation using the FPM frame taken at 2 hour. The best models were selected based on the MAE metric calculated on the validation data.Once the CNN is trained, which only needs to be performed once using the first FPM frame taken at 0 min, the CNN is then applied to reconstruct highSBP phase video frames (i.e. the testing step). To perform the reconstruction, similar data preprocessing steps are followed as the training phase. The raw intensity images were first divided into 4 4 subregions. Within each region, image patches having the same sizes as training batches (6464) are used for reconstruction. Neighboring image patches contain 15pixel and 19pixel overlap in the horizontal and vertical directions, respectively. Each image patch was first upsampled to 8080 pixels with bilinear interpolation. The predicted phase image contains 320 320 pixels. Once reconstructions are performed on all 2288 patches, the alpha blending algorithm was used to form the full FOV phase image containing 12800 10800 pixels. To reconstruct the video, we simply fed each FPM frame to the trained CNN to reconstruct the highSBP dynamic information from the timesseries data. The time for reconstructing each fullFOV, highSBP image is 252 seconds using our cGAN network with the added Fourier loss, DBDFcGAN, which is 50 faster than the standard FPM algorithm (which took 20 min for each frame [29]). A detailed comparison of all networks is detailed in Section 3 and Table 1.
3 Results and discussion
We discuss our results by presenting results in real space (Fig. 3), Fourier space (Fig. 4), and over different time points (Fig. 5).
Figure 3(a) summarizes all the illumination patterns used for training and testing along with the corresponding networks used. All networks are applied to reconstruct the entire timeseries experiment. A sample largeSBP phase reconstruction across the full (1.7mm2.1mm) FOV is shown in Fig. 3(b). In Fig. 3(c), we zoomin on a subregion to compare the results from different networks in the real space. For comparison, the raw lowresolution intensity image from the central BF illumination is shown, which was bilinearly upsampled to have the same size as the network’s output.
The result from UB, which uses BF data only, UNet without DB or cGAN and only the pixelwise MAE loss, produce lowresolution phase images. It has been shown that the MAE loss function can lead to blurry results when solving an image reconstruction problem [51] because it does not place sufficient weights in the high frequency content. To overcome this problem, we use generative adversarial networks to reconstruct phase image with conditional input (cGAN) [50]. In UBcGAN, the UNet is accompanied with a discriminator network in order to better learn high frequency information. The introduction of the cGAN architecture allows us to better reconstruct subcellular structures with more perceptual details; however, the resolution still appears worse compared to the ground truth.
In order to further improve resolution, as in FPM, DF images are needed since they contain highspatial frequency information beyond the support of the optical transfer function (OTF). In addition, to deal with the added data size, we also seek a more efficient network structure with higher representation power. The dense block (DB) structure has shown to provide efficient presentation with a small number of parameters in the model [43]. We present results from three illumination patterns with different angular coverage, all reconstructed with DenseNet (UNet with DB) and cGAN structure. In DBDcGAN, we use 36 DF images covering up to 0.6 illumination NA. This leads to moderate resolution improvement; however, the results are limited by the highly noisy data captured at very large NAs. In general, we observe that it is not guaranteed that higher illumination angle leads to better resolution. The reason is because the DF data is subject to much higher level noise than the BF data, and the noise level increases as the the illumination angle increases [38]. When the signaltonoise ratio (SNR) falls below certain threshold, the inclusion of these DF data is no longer helpful. To confirm this, we first use a small amount of DF data from small angles in DBDcGAN. It leads to resolution improvement as compared to the one from DBDcGAN. It should be noted that the DF SNR can significantly improve if a domeshaped LED array [57] is used instead of the planar array in [29]
. Heuristically, we found that the capacity of our CNN is that it can reliably utilize DF data up to 0.4 illumination NA (P4). The reconstructions are further explored using two networks, DB
DcGAN and DBDFcGAN.A major limitation of imagespace only loss function is that the metric still favors lowfrequency information [53] but underweights highfrequency information. A recently proposed solution is to further include Fourier loss component [54]. The result from using this strategy is shown in DBDFcGAN. Our reconstruction of last frame on Hela cells is available at [58].
Method  MAE  PSNR  SSIM  FM  Time 
UB  0.0401  25.01db  0.7575  0.0110  303s 
UB*  0.0331  26.49db  0.7790  0.0156  553s 
DBD*  0.0339  26.17db  0.7779  0.0146  252s 
DBD*  0.0309  26.76db  0.7966  0.0165  252s 
DBD*  0.0308  26.87db  0.7964  0.0169  252s 
DBDF*  0.0318  26.19db  0.7797  0.0211  252s 
FPM (GT)  0  1.00  0  0.0389  20 mins [29] 
To better visualize the recovery of highfrequency information, Fig. 4 shows the Fourier transform of each image in Fig. 3(c). The spectrum of the onaxis BF image is mostly concentrated within the pupil region, i.e. the circular region with a radius of 1NA, and extends up to the support of the OTF (i.e. 2NA). It is well known that using only BF images can provide Fourier coverage up to the support of the OTF. As shown in the Fourier image of UBcGAN, the network is able to fully recover these lowfrequency information. The inclusion of DF images should lead to larger Fourier coverage; however, the improvement is not significant with imagespace only loss function, as shown in the Fourier images of DBDcGAN, DBDcGAN, and BDcGAN. The introduction of the Fourier domain loss significantly boosts the Fourier coverage up to the 0.4 illumination NA (¡0.6 NA in the ground truth), as shown in the Fourier image of DBDFcGAN. We note that using the Fourier domain loss in the training process generally leads to enhancement of the sharpness of the results and the frequency measurement metric (FM) [59]; however, it may trade off imagespace metrics, such as MAE, SSIM, and PSNR due to different metric weighting schemes involved (see Table. 1).
Further inspecting the results from the CNN and comparing them to the FPM generated ‘ground truth’, we note that the ground truth image contains noisy structures, which are clearly visible in the background. All CNN reconstructed results are free from these background artifacts, demonstrating the robustness of the training process to noisy ground truth data.
A unique feature of our technique is the ability to reconstruct highSBP phase videos with training data only from the first time point of a long timeseries experiment. To demonstrate the effectiveness of this strategy, we show our CNN predicted temporal frames over the course of over 4 hours. During this process, considerable amount of morphological (hence phase distribution) changes occur due to cell division over several cell cycles. Figure 5(b) shows several frames (reconstructed with DBDFcGAN) of a zoomin region, where one cell is growing and dividing into multiple cells, and another cell has its membrane rapidly fluctuating. More example videos are provided in Visualization 1. A more quantitative evaluation of the ‘generalization error’ over time is presented in Fig. 5(a), in which the MAE metrics of all the networks studied are plotted for every frame in the time series experiment. The error is low at the beginning of the experiment and grows slowly as the time progresses.
4 Transfer learning
Practically, it is difficult to train a single network that can handle all sample types, a main drawback of the DL approach compared to the model based methods. To mitigate this problem, we investigate transfer learning, in which our pretrained CNN on Hela cell is finely tuned for other cell types. The effectiveness of this strategy to address the generalization limitation of sample types has also been previously demonstrated in other biomedical imaging applications [60].
We used DBDFcGAN trained on Hela cells to predict the phase reconstruction of two other cell types (MCF10A, U2OS) with or without staining. The data were captured with the same setup in [29]. In Fig. 6, we compare two results. First, we directly apply the DBDFcGAN network to the new data. To further refine the results, we use the transfer learning technique. Specifically, we take the weights from the pretrained network and continue the training with the new cell data as the training data for 30 mins. Note that these new cell data contain significant intensity differences. By fine tuning the model, the CNN is able to produce high quality reconstruction. During the transfer learning, we did not use any validation data and only evaluated the new CNN’s performance directly after the 30min training. The results show that transfer learning provides a practical way to broaden the utility of our technique.
5 Conclusion
We have demonstrated a deep learning framework for Fourier ptychography video reconstruction. The proposed CNN architecture fully exploits the unique highSBP imaging capability of FPM so that it can be trained using a single frame and then be generalized to a full timeseries experiment. In addition, the CNN requires reduced number of images for highresolution phase recovery. The reconstruction of each highSBP image takes less than 30 seconds. Overall, this technique significantly improves the imaging throughput of the FPM system by reducing both the acquisition and reconstruction time. The central idea of our technique is based on the observation that each FPM image contains a large cell ensembles covering all morphological information throughout the timeseries experiment. By the principle of ergodicity, the statistical information learned from these large spatial ensembles in a single frame are shown to be sufficient to predict temporal dynamics with high fidelity. In practice, we showed that our trained CNN can successfully reconstruct a highSBP phase video of dynamic live cell populations with reduced noise artifacts. Using the conditional generative adversarial network (cGAN) framework and a weighted Fourier loss function, the proposed CNN is able to more effectively learn the high resolution information encoded in the darkfield data. The technique may find wide applications in in vitro live cell imaging and gather largescale spatial and temporal information in a data and computation efficient manner. We also demonstrate that transfer learning is a practical approach to image a broad range of new cell samples, bypassing the need to train an entirely new CNN from scratch.
Acknowledgments
We would like to thank NVIDIA Corporation for supporting us with the GeForce Titan Xp through the GPU Grant Program.
Disclosures
The authors declare that there are no conflicts of interest related to this article.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015).
 [2] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using Deep Neural Networks for Inverse Problems in Imaging: Beyond Analytical Methods,” IEEE Signal Process. Mag. 35(1), 20–36 (2018).
 [3] M. Bertero and P. Boccacci, Introduction to inverse problems in imaging (IOP Publishing, 1998).
 [4] M. V. Afonso, J. M.BioucasDias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Med. Imaging 19(9), 2345–2356 (2010).
 [5] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences 2(1), 183–202 (2009).
 [6] S. Boyd, N. Parikh, B. P. E Chu, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends® in Machine Learning 3(1), 1–122 (2011).

[7]
C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,
“Photorealistic single image superresolution using a generative
adversarial network,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 105–144.
 [8] Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017).
 [9] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 2392–2399.
 [10] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. Med. Imaging 26(7), 3142–3155 (2017).
 [11] O. Ronneberger, P. Fischer, and T. Brox “Unet: Convolutional networks for biomedical image segmentation,” https://arxiv.org/abs/1505.04597.
 [12] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” Advances in Neural Information Processing Systems (NIPS, 2014), pp. 1790–1798.
 [13] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2010), pp. 2528–2535.
 [14] H. Yao, F. Dai, D. Zhang, Y. Ma, S. Zhang, and Y. Zhang, “Dr2net: Deep residual reconstruction network for image compressive sensing,” https://arxiv.org/abs/1702.05743.
 [15] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Noniterative reconstruction of images from compressively sensed measurements,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 449–458.
 [16] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Med. Imaging 26(9), 4509–4522 (2017).
 [17] T. Nguyen, V. Bui, and G. Nehmetallah, “Computational optical tomography using 3D deep convolutional neural networks,” Opt. Eng. 57(4), 043111 (2018).
 [18] E. M. Christiansen, S. J. Yang, D. M. Ando, A. Javaherian, G. Skibinski, S. Lipnick, E. Mount, A. O’Neil, K. Shah, A. K. Lee, P. Goyal, W. Fedus, P. Ryan, A. Esteve, M. Berndl, L. L. Rubin, P. Nelson, and S. Finkbeiner, “In silico labeling: Predicting fluorescent labels in unlabeled images,” Cell 137 (3), 792–803 (2018).
 [19] Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl., 7(2), 17141 (2018).
 [20] Z. Ren, Z. Xu, and E. Y. Lam, “Learningbased nonparametric autofocusing for digital holography,” Optica 5(4), 337–344 (2018).
 [21] A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017).
 [22] S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5, 803 (2018).
 [23] Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach towards scalable imaging through scattering media,” https://arxiv.org/abs/1806.04139.
 [24] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video superresolution with convolutional neural networks,” IEEE Trans. Med. Imaging 2(2), 109–122 (2016).
 [25] O. Shahar, A. Faktor, and M. Irani, “Spacetime superresolution from a single video,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 3353–3360.
 [26] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2015), pp. 2758–2766.
 [27] S. Nah, T. H. Kim, and K. M. Lee, “Deep multiscale convolutional neural network for dynamic scene deblurring,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 257–265.

[28]
H. Chen, J. Gu, O. Gallo, M. Liu, A. Veeraraghavan, and J. Kautz,
“Reblur2deblur: Deblurring videos via selfsupervised learning,”
in Proceedings of IEEE International Conference on Computational Photography (IEEE, 2018), pp. 1–9.  [29] L. Tian, Z. Liu, L.H. Yeh, M. Chen, J. Zhong, and L. Waller, “Computational illumination for highspeed in vitro Fourier ptychographic microscopy,” Optica 2(10), 904–911 (2015).
 [30] G. Zheng, R. Horstmeyer, and C. Yang, “Widefield, highresolution Fourier Ptychographic microscopy,” Nat. Photonics 7(9), 739–745 (2013).
 [31] D. J. Stephens and V. J. Allan, “Light microscopy techniques for live cell imaging,” Science, 300(5616), 82–86 (2003).
 [32] T. Ashihara and R. Baserga, “[20] cell synchronization,” Methods in Enzymology 8, 248–262 (1979).
 [33] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for Fourier ptychography with an LED array microscope,” Biomed. Opt. Express 5(7), 2376–2389 (2014).
 [34] A. Kappeler, S. Ghosh, J. Holloway, O. Cossairt, and A. Katsaggelos, “Ptychnet: Cnn based fourier ptychography,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2017), pp. 1712–1716.
 [35] E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deepstorm: superresolution singlemolecule microscopy by deep learning,” Optica 5(4), 458–464 (2018).
 [36] M. Weigert, U. Schmidt, T. Boothe, A. Muller, A. Dibrov, A. Jain, B. Wilhelm, D. Schmidt, C. Broaddus, S. Culley, M. RochaMartins, F. SegoviaMiranda, C. Norden, R. Henriques, M. Zerial, M. Solimena, P. Tomancak, L. Royer, F. Jug, and E. W. Myers, “Contentaware image restoration: Pushing the limits of fluorescence microscopy,” https://www.biorxiv.org/content/early/2017/12/19/236463.
 [37] N. Boyd, E. Jonas, H. P. Babcock, and B. Recht, “Deeploco: Fast 3d localization microscopy using neural networks,” https://www.biorxiv.org/content/early/2018/02/16/267096.
 [38] L.H. Yeh, J. Dong, J. Zhong, L. Tian, M. Chen, G. Tang, M. Soltanolkotabi, and L. Waller, “Experimental robustness of Fourier ptychography phase retrieval algorithms,” Opt. Express 23(26), 33214–33240 (2015).
 [39] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” IEEE Conference on Computer Vision and Pattern Recognition, 2691–2699 (2015).
 [40] Z. Lu, Z. Fu, T. Xiang, P. Han, L. Wang, and X. Gao, “Learning from weak and noisy labels for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(3), 486–500 (2017).
 [41] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013).
 [42] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” European conference on computer vision, Springer, 818–833 (2014).
 [43] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2261–2269.
 [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” https://arxiv.org/abs/1409.1556.
 [45] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” https://arxiv.org/abs/1502.03167.
 [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(1), 1929–1958 (2014).
 [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778.
 [48] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” https://arxiv.org/abs/1412.6830.
 [49] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” https://arxiv.org/abs/1603.07285.
 [50] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS, 2014), pp. 2672–2680.

[51]
P. Isola, J. Zhu, T. Zhou, and A. A. Efros,
“Imagetoimage translation with conditional adversarial networks,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 5967–5976.  [52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems (NIPS, 2016), pp. 2234–2242.
 [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Med. Imaging 13(4), 600–612 (2004).
 [54] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, and D. Firmin, “Dagan: Deep dealiasing generative adversarial networks for fast compressed sensing mri reconstruction,” IEEE Trans. Med. Imaging 37 (6), 1310–1321 (2018). IEEE Trans. Med. Imaging
 [55] T. Nguyen, V. Bui, V. Lam, C. B. Raub, L.C. Chang, and G. Nehmetallah, “Automatic phase aberration compensation for digital holographic microscopy based on deep learning background detection,” Opt. Express 25(13), 15043–15057 (2017).
 [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” https://arxiv.org/abs/1412.6980.
 [57] Z. F. Phillips, M. V. D’Ambrosio, L. Tian, J. J. Rulison, H. S. Patel, N. Sadras, A. V. Gande, N. A. Switz, D. A. Fletcher, and L. Waller, “Multicontrast imaging and digital refocusing on a mobile microscope with a domed led array,” PLoS ONE 10(5), e0124938 (2015).
 [58] T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “DeepLearningFourierPtychographicMircoscopy,” https://github.com/32nguyen/DeepLearningFourierPtychographicMircoscopy (2018). Accessed: 2018721.
 [59] K. De and V. Masilamani, “Image sharpness measure for blurred images in frequency domain,” Procedia Eng. 64, 149–158 (2013).
 [60] Y. Rivenson, H. Wang, Z. Wei, Y. Zhang, H. Gunaydin and A. Ozcan, “Deep learningbased virtual histology staining using autofluorescence of labelfree tissue,” https://arxiv.org/abs/1803.11293.