Convolutional neural network for Fourier ptychography video reconstruction: learning temporal dynamics from spatial ensembles

by   Thanh Nguyen, et al.
The Catholic University of America

Convolutional neural networks (CNNs) have gained tremendous success in solving complex inverse problems for both problems involving independent datasets from input-output pairs of static objects, as well as sequential datasets from dynamic objects. In order to learn the underlying temporal statistics, a video sequence is typically used at the cost of network complexity and computation. The aim of this work is to develop a novel CNN framework to reconstruct video sequence of dynamic live cells captured using a computational microscopy technique, Fourier ptychographic microscopy (FPM). The unique feature of the FPM is its capability to reconstruct images with both wide field-of-view (FOV) and high resolution, i.e. a large space-bandwidth-product (SBP), by taking a series of low resolution intensity images. For live cell imaging, a single FPM frame contains thousands of cell samples with different morphological features. Our idea is to fully exploit the statistical information provided by this large spatial ensembles so as to learn temporal information in a sequential measurement, without using any additional temporal dataset. Specifically, we show that it is possible to reconstruct high-SBP dynamic cell videos by a CNN trained only on the first FPM dataset captured at the beginning of a time-series experiment. Our CNN approach reconstructs a 12800x10800 pixels phase image using only 25 seconds, a 50x speedup compared to the model-based FPM algorithm. In addition, the CNN further reduces the required number of images in each time frame by 6x. Overall, this significantly improves the imaging throughput by reducing both the acquisition and computational times. Our technique demonstrates a promising deep learning approach to continuously monitor large live-cell populations over an extended time and gather useful spatial and temporal information with sub-cellular resolution.


page 4

page 20

page 21

page 22

page 23


Data-Driven Design for Fourier Ptychographic Microscopy

Fourier Ptychographic Microscopy (FPM) is a computational imaging method...

Microscopy Cell Segmentation via Convolutional LSTM Networks

Live cell microscopy sequences exhibit complex spatial structures and co...

Connecting the dots across time: Reconstruction of single cell signaling trajectories using time-stamped data

Single cell responses are shaped by the geometry of signaling kinetic tr...

PgNN: Physics-guided Neural Network for Fourier Ptychographic Microscopy

Fourier ptychography (FP) is a newly developed computational imaging app...

Solving Fourier ptychographic imaging problems via neural network modeling and TensorFlow

Fourier ptychography is a recently developed imaging approach for large ...

Deep Learning Enhanced Extended Depth-of-Field for Thick Blood-Film Malaria High-Throughput Microscopy

Fast accurate diagnosis of malaria is still a global health challenge fo...

1 Introduction

In recent years, data-driven

image reconstruction techniques based on machine learning, in particular deep learning (DL) 

[1], have gained tremendous success in solving complex inverse problems [2], and can often provide results surpassing those using state-of-the-art model-based techniques. Traditionally, solving an inverse problem involves first explicitly formulating the imaging model and incorporating domain and prior knowledge (e.g. via the use of regularization techniques), and then finding an analytical solution (e.g. through an optimization procedure) [3]. Unlike model-based approaches, the ‘end-to-end’ DL framework does not explicitly utilize any models or priors, and instead relies on large datasets to ‘learn’ the underlying inverse problem. The outcome of this DL approach consists of two important components. First, the result from the training stage is a CNN that corresponds to a plausible underlying mapping function relating the measurement to the solution. Second, the trained CNN can be used to make ‘predictions’ when presenting it with new measurements that were unused in the training stage. This second part comes with major practical benefits in computational cost and speed in typical image reconstruction problems, since the prediction process simply involves the feedforward computation of the CNN that typically takes no more than a few seconds on a normal grade GPU. In contrast, most of modern model-based techniques rely on iterative algorithms [4, 5, 6] that require much higher computational cost and longer running time; the same lengthy process needs to be repeated every time for each new measurement.

Here, we distinguish two classes of imaging problems: those involve independent datasets from often static objects, and those dealing with sequential datasets that are temporally correlated, from dynamic objects. In independent

problems, CNNs have been demonstrated to provide superior performance to solve many challenging imaging problems, such as image super-resolution 

[7, 8], denoising [9, 10], segmentation [11], deconvolution [12, 13], compressive imaging [14, 15], tomography [16, 17], digital labeling [18], holography [19, 20], phase recovery [21], and imaging through diffusers [22, 23]. What’s common in this class of problems is that independently prepared input-output pairs (i.e. measurement and solution), obtained by repeating the same imaging process, are presented to the CNN at the training stage to optimize the network’s parameters. In sequential problems, the temporal correlation of a dynamic process contains additional information, and is often recorded in video datasets. Various CNN frameworks have been proposed to learn the additional temporal information. For example, spatial super-resolution has been demonstrated by training a CNN on both spatial and temporal dimensions of videos [24]. Temporal super-resolution on recurring processes is achieved by learning the underlying temporal statistics [25]. The motion information of dynamic objects is learned with an optical-flow based CNN [26]. Motion artifacts can be removed by jointly learning the blurring point-spread-function (PSF) and deconvolution operation [27, 28]. In all these cases, CNNs are designed to process a video sequence in order to extract the temporal information. The downside is that the CNN architectures inevitably become more complicated that require more computational resources, as compared to those used in the independent problems. Fundamentally, the complication stems from that any single frame from the imaging techniques used does not contain sufficient temporal statistical information.

Figure 1: The workflow of the proposed deep learning based Fourier ptychography video reconstruction. (A) The intensity data is captured by illuminating the sample from different angles with an LED array. (B) Training CNN to reconstruct high-resolution phase images. The input to the CNN are low-resolution intensity images; the output of the CNN is the ground truth phase image reconstructed using the traditional FPM algorithm in [29]. The network is then trained by optimizing network’s parameters that minimizes a loss function calculated based on the network’s predicted output and the ground truth. (C) The network is fully trained using the first dataset at 0 min, then can be used to predict phase videos of dynamic cell samples frame by frame.

In this work, we develop a CNN architecture to reconstruct video sequence of dynamic live cells captured with a computational microscopy technique based on Fourier ptychographic microscopy (FPM) [30, 29]. The unique feature of the FPM is its ability to quantitatively reconstruct phase information with both wide field-of-view (FOV) and high spatial resolution, i.e., a large space-bandwidth product (SBP). This is not possible for traditional techniques which must trade spatial or temporal resolution for FOV. For live-cell imaging applications, this allows one to simultaneously image a large cell population (e.g. more than 3400 in a single frame in  [29]). Cells of the same type undergo similar morphological changes during different cell states, which then repeat over each cell cycle. If one records only a few cells at a time using conventional microscopy techniques [31], capturing the full dynamics would require a large sequence of measurements to cover the entire cell cycle (typically ranging from a few hours to days). Our proposed technique is based on the observation that, in any live cell experiment without precise cell synchronization [32], at any instant of time, a large cell population would contain samples covering all cell states. In other words, it is possible to gather sufficient temporal statistical information of a single cell by imaging a large spatial ensembles simultaneously. Based on this idea, we propose a CNN that is trained using only a single frame from the FPM. We then show that this trained CNN is able to reconstruct large-SBP phase videos with high fidelity using datasets taken in a time-series live cell experiments.

Existing FPM techniques are limited by their long acquisition times, which are limited by the FPM algorithms that require at least overlap in the Fourier coverage of the images captured from neighboring LEDs [30]. Several illumination multiplexing techniques have been demonstrated to improve the acquisition speed [33, 29]. However, the amount of data reduction is still limited by the Fourier overlap requirement. Here, we show that, similar to prior work on CNN for FPM on static objects [34], our CNN can be sufficiently trained using much fewer images than that needed by the model-based FPM algorithms for dynamic live-cell samples.

Distinct from computer vision applications, a particular challenge in applying DL to biomedical microscopy is the difficulty in gathering ground truth data needed for training the network. Various strategies have been proposed, including synthetic data from simulations built with physical imaging models 

[35, 36, 37], semi-synthetic data that uses experimental data to guide simulations [36], experimental data captured with a different modality [8, 19], and experimental data captured with the same modality [36]. Here, we propose to use the traditional FPM reconstructed phase images as the ground truth for training. Since our technique requires only a single frame for training, this does not add much overhead in data acquisition or computation. When using experimental data as the ground truth, they inevitably are contaminated with noise. In FPM, the quality of the phase reconstruction is limited by spatially variant aberrations, system mis-alignment, and intensity-dependent noise [38]. Robust learning using noisy labelled data has been demonstrated for image classification and segmentation [39, 40]. In essence, CNN captures the invariants while filtering out the random fluctuations [41, 42]. Here, we show that our proposed CNN is also robust to phase noise in the ‘ground truth’ data for solving the inverse problem of FPM.

We build a CNN based on the conditional generative adversarial network (cGAN) framework, consisting of two sub-networks, the generator and the discriminator. The generator network uses the UNet architecture  [11] with densely connected convolutional blocks (DenseNet) [43] to output high-resolution phase image. The discriminator network distinguishes if the output is real or fake. We compare five variants of the network, which differ by the input measurements using different illumination patterns corresponding to different Fourier coverages. Similar to the traditional FPM, the darkfield measurements lead to spatial resolution improvement in the reconstruction. To further refine the network, we introduce a mixed loss function that takes a weighted Fourier domain loss, in addition to the standard image domain loss for the generator and the adversarial loss for the discriminator. We show that this novel weighted Fourier domain loss leads to improved recovery of high frequency information. We demonstrate our technique using live Hela cell FPM video data from  [29]. We quantitatively assess the performance of our CNN technique over time against those from traditional FPM results, and found that the ‘generalization’ degradation of the reconstructed phase is small over the entire time course (¿4 hours).

The training is performed on a PC Intel core i7, 32 GB RAM, NVIDIA GeForce Titan XP for

16 hours using Keras/Tensorflow framework. Once the network is trained, reconstructing a 12800

10800 pixels phase image requires only 25 seconds, which is approximately 50 faster than the model-based FPM algorithm [29].

Our technique demonstrates a promising deep learning approach to continuously image large live-cell populations over extended time and gather spatial and temporal information with sub-cellular resolution. Compare to existing FPM [30, 29], this CNN approach significantly improves the overall throughput by reducing both the acquisition and computation times, and with less data requirement. The CNN reconstructed phase image provides high spatial resolution, wide FOV, and low noise-induced artifacts. We also show the flexibility in reconstructing other cell types using transfer learning, which makes our technique appealing to broad applications.

2 Method

2.1 Conditional generative adversarial network (cGAN)

Generally speaking, the proposed CNN based FPM reconstruction algorithm takes a set of low-resolution intensity images as the network input and output a single high-resolution phase image . The intensity images are captured from illuminating the sample from different illumination angles (LEDs) [Fig. 1(a)], in which are brightfield (BF) and are darkfield (DF) (Fig. 2). In the training stage, the ground truth phase image is fed into the CNN, which is obtained from the reconstructed high-resolution phase from the FPM algorithm in [29] [Fig. 1(b)]. A key feature of the FPM is to reconstruct a high-resolution phase image using a set of low-resolution intensity images. The resolution enhancement factor is in each dimension. To obtain the ground truth, it needed to capture the full FPM dataset, containing 173 images [29]. Since our DL scheme only requires training for the first ‘FPM frame’, the rest of the frame only requires images, which allows reducing the acquisition time, especially in a time-series experiment. We denote the set of low-resolution images

as a tensor of dimension

and the corresponding ground truth a tensor of dimension [Fig. 1(b)].

Figure 2:

The proposed condition generative adversarial network (cGAN) for FPM video reconstruction. The the generator (top) and the discriminator (bottom) are constructed with the ConvBlock BN-ReLU-Conv(

)-BN-ReLU-Conv() and ConvBlock Conv-BN-LeakyReLU, respectively. The generator output is the high-resolution phase. The discriminator tries to distinguish if that output phase is fake or real. The generator uses the UNet architecture. For the discriminator, the generator predicted phase or the ground truth phase is concatenated with the up-sampled intensity data as a conditional input to the discriminator network. The following color schemes are used: the two blocks and describe the dense concatenation inside the dense block in down-sampling and up-sampling path, respectively. and are transition layers interweaving with the dense blocks in the generator. denotes the convolutional layer,

denotes the batch-normalization with a nonlinear ReLU layer in generator model, and

the batch-normalization with the leaky ReLU in the discriminator. In the last three layers of the discriminator, denotes fully-connected layers for high-level feature reasoning.

is used at the end for binary classification. k#n#s# (# stands for some integer) denotes the filter size, number of channels, and stride of the convolution layer, respectively.

The proposed CNN that performs FPM video reconstruction [Fig. 1(c)] is based on the conditional generative adversarial network (cGAN) framework. It consists of two sub-networks, the generator and the discriminator (Fig. 2). Here, the goal of the generator , is to be trained to predict a high-resolution phase from the low-resolution image set input. To simplify the notation, we will drop the subscript knowing that will always contain low-resolution intensity images. The generator network consists of a set of parameters (weights and biases), which will be optimized through the training. The optimal is learned by minimizing a loss function over input-output training pairs:


We emphasize that the choice of the loss function significantly affects the quality of the training. We propose a mixed loss function that takes the weighted sum of multiple elementary loss functions, which will be detailed in Subsection 2.2.

The generator adopts the general ”encoder-decoder” architecture used in UNet [11] to facilitate efficient learning of pixel-to-pixel information. UNet has shown to increase the network’s performance by adapting to the high-complexity information in image dataset [44]. To enhance the efficiency of the training process, batch-normalization (BN) is used to offset the internal covariate shift [45]. In addition, dropout regularization [46] is employed to constrain network’s adaptation to the data during the training to avoid overfitting and increase the network’s model accuracy. A known problem of training a CNN is that it can get saturated when the network’s depth becomes too deep [47]. To mitigate this problem, the dense block (DB) proposed in the densely connected network is used [43]. A DB connects each layer to its subsequent layers in a feed-forward fashion. The inputs to each layer are the feature-maps of all preceding layers; the output of the current layer’s own feature-maps are inputs to all the subsequent layers (see Fig. 2

). The DB has several advantages, including (a) mitigation of the vanishing-gradient problem in the training; (b) reduction of the total number of parameters; (c) enhancement of feature propagation and reuse. A typical

-layer DB is defined as follows:


where denotes the concatenation operation that connects all the feature maps of all layers in the block. The output at the end of each -layer DB has numbers of feature maps, where is the number of the feature maps in the first layer, the hyper-parameter is referred to as the growth rate. Within each layer inside the DB (ConvBlock), a series of operations are performed, including batch-normalization (BN), nonlinear activation using the ReLU or LeakyReLU function (ReLU/Leaky ReLU) [48], and convolution with filters of kernel size [Conv()].

In our generator , it contains a total of 11 DBs. The number of ConvBlock layers in each DB is (marked as in Fig. 2 with denoting the number of ConvBlock layers in each DB). In each ConvBlock layer, a stack of BN-ReLU-Conv()-BN-ReLU-Conv() operations are performed with and .

Between two consecutive DBs, a transition block is used to facilitate the desired down-sampling or up-sampling operation. The down-sampling transition block contains Conv()-BN-ReLU-Conv(, stride=2); the up-sampling transition block contains Conv()-BN-ReLU-Deconv(, stride=2), where Deconv denotes the deconvolution (transpose convolution) layer [49]. The features of the input layer are extracted by an initial Conv()-BN-ReLU block before feeding them to the first DB. A Conv() is used to perform the final regression to generate the phase map .

The discriminator network aims to distinguish if the output from is real or fake. Following [50] and [51], we define a conditional Generative Adversarial Network (cGAN) to solve the following adversarial min-max problem:


The general idea behind this network is that it aims to train a generator to ‘fool’ the discriminator . Here, is trained to distinguish whether the high-resolution phase image predicted by represents a real phase image. It was observed that GAN in general is hard to train and it may fail when the generator collapses to a parameter setting where it always gives the same output. A successful strategy to avoid this failure is to allow the discriminator to perform minibatch discrimination [51, 52]. In this case, the discriminator distinguishes if the reconstructed phase image is real or fake by evaluating multiple sub-regions of the -predicted image instead of the whole.

2.2 Loss function

A motivation of the usage of the discriminator network is that the commonly used pixel-wise loss functions, such as the mean absolute error (MAE), mean square error (MSE), and structural similarity index (SSIM), may not be the most appropriate figures of merit, in particular when assessing a CNN’s performance in preserving high frequency content of reconstructed images. The minimization of these pixel-wise loss functions can lead to solutions that ignores the high-frequency details, while favors solutions that are smooth, albeit have less perceptual quality [53]. With cGAN approach, the generator can learn to create a solution that resembles realistic high-resolution images with high-frequency details.

For this purpose, we define the ‘perceptual loss function

as a weighted sum of multiple loss functions. This ensures that the model can learn the desired features containing both low-frequency and high-frequency information in the phase images. Specifically, our loss function consists of four components, including the pixel-wise spatial domain mean-absolute error (MAE) loss , the pixel-wise Fourier domain mean-absolute error (FMAE) loss , the generator’s adversarial loss , and the weight regularization , in the following form:





denotes the 2D Fourier transform,

is the -norm. are hyper-parameters that controls the relative weights of each loss components. We found that the Fourier loss function is sensitive to pixel-wise corruption during the early stage of the training process. As a result, we use it only to refine the outputs by enforcing similarity in the frequency domain [54] after initial training is done with the other three loss components (details in Subsection 2.4).

2.3 Data preparation

To test our CNN technique, we use FPM video data from [29]. The time-series data was taken on Hela cells at 2 min intervals over the course of 4 hours that contains several cell cycles. Each FPM dataset contains 173 low-resolution intensity images, in which 37 are brightfield (BF), 136 are darkfield (DF). Each intensity image is 25602160 pixels in 16-bit grayscale.

To generate the data for training, FPM phase reconstructions from [29] are used as the ground truth. Each FPM reconstructed phase image contains 1280010800 pixels, which is 55 larger than the raw intensity image.

To prepare the dataset for training, we use only the first FPM frame in the time-lapse as the training set. Specifically, to prepare the ground truth data, the full FOV phase image is first divided into 44 sub-regions, containing 34402760 pixels. To avoid edge artifacts during training and reconstruction, neighboring sub-regions are chosen to have 320-pixel and 80-pixel overlap along the horizontal and vertical directions, respectively. The corresponding intensity image in each sub-region are with 688552 pixels. The input to the CNN are BF and DF image patches that are cropped from random locations of each of the sub-region images, each with 6464 pixels. Each training input data is formed by stacking the BF and DF image patches to form a 6464 tensor. To facilitate fast computation, the models are designed with down-sampling path and up-sampling path. Each input image was up-sampled to 80

80 using bilinear interpolation. The spatial dimension of each layer in the CNN are 80, 40, 20, 10, 20, 40, 80, 160 and 320, respectively. The corresponding ground truth data contains 320

320 pixels. Each raw BF image was preprocessed by the background subtraction procedure in [29]; each raw DF image was preprocessed to remove the dark current noise [29]. The same preprocessing steps are applied for training, validation, and testing.

2.4 Training, evaluation, and testing

To investigate the interplay between the illumination pattern and the performance of the CNN, we train our network by using several different combinations of BF and DF images. The illumination patterns along with the CNN models used are shown in Fig. 3(a). Each illumination pattern is plotted in the Fourier space, in which a yellow circle indicates the NA of the objective lens. Intensity images taken from the LEDs within the circle are BF; whereas those outside the circle are DF. The LEDs in-use are marked in red. To systematically study the relation between the reconstructed resolution with the illumination’s angular coverage, we have designed patterns with (P1) 13 BF-only with 0.2 illumination NA, (P2) 13 BF + 36 DF with 0.6 illumination NA, (P3) 13 BF + 10 DF with 0.25 illumination NA, and (P4) 9 BF + 20 DF with 0.4 illumination NA. The following networks are investigated: P1 is trained on two networks, U-B implements the UNet without DB in [55]; U-B-cGAN implements the UNet in [51] with the cGAN architecture (i.e. with the discriminator network in Fig. 2); P2 is trained on the cGAN network in Fig. 2, D-BD-cGAN; P3 is trained on the cGAN network, D-BD-cGAN; P4 is trained on a cGAN network with and without the Fourier loss function, denoted as D-BD-cGAN, D-BD-F-cGAN, respectively.

Figure 3: (A) The summary of the illumination patterns and network structures investigated. The illumination angles (shown in the Fourier space) in-use are marked in red. The yellow cycle indicates the NA of the imaging system. (B) A sample full-FOV high-SBP phase reconstruction (at 4 hour) predicted by the proposed network D-BD-F-cGAN. (C) The original intensity image, ground truth phase image, and the reconstructions from the CNN models from the zoom-in area [marked by the red square in (B)].

Each model was trained with

700-900 epochs. For UNet, the batch-size was 16; whereas the batch-size was 4 in UNet with DB due to memory limitation. We use the weight coefficients

when the Fourier loss is not used. When the Fourier loss is used, we first train the network with for 700 epochs, and then with for another epochs. We observed that the network’s parameters are unstable in the early stage of training. To stabilize the training process, we added the Fourier loss after 700 epochs. We used the ADAM optimizer [56] with initial learning rate of , dropout factor 0.5 after every 10 epochs, in which each epoch contains 1000 iterations. In each iteration, the algorithm incrementally updates the model using a subset (set by the batch-size) of the input. To fine tune each network, as an optional step, we performed model validation using the FPM frame taken at 2 hour. The best models were selected based on the MAE metric calculated on the validation data.

Once the CNN is trained, which only needs to be performed once using the first FPM frame taken at 0 min, the CNN is then applied to reconstruct high-SBP phase video frames (i.e. the testing step). To perform the reconstruction, similar data preprocessing steps are followed as the training phase. The raw intensity images were first divided into 4 4 sub-regions. Within each region, image patches having the same sizes as training batches (6464) are used for reconstruction. Neighboring image patches contain 15-pixel and 19-pixel overlap in the horizontal and vertical directions, respectively. Each image patch was first up-sampled to 8080 pixels with bilinear interpolation. The predicted phase image contains 320 320 pixels. Once reconstructions are performed on all 2288 patches, the alpha blending algorithm was used to form the full FOV phase image containing 12800 10800 pixels. To reconstruct the video, we simply fed each FPM frame to the trained CNN to reconstruct the high-SBP dynamic information from the times-series data. The time for reconstructing each full-FOV, high-SBP image is 252 seconds using our cGAN network with the added Fourier loss, D-BD-F-cGAN, which is 50 faster than the standard FPM algorithm (which took 20 min for each frame [29]). A detailed comparison of all networks is detailed in Section 3 and Table 1.

Figure 4: Fourier analysis of the CNN reconstructed phase images. We directly take the Fourier transform of the reconstructions in Fig. 3(c). They are compared with the raw intensity image from on-axis illumination and the ground truth from FPM. To illustrate the Fourier coverage in each model, we mark three circles in each image, in which the yellow circle corresponds to the support of the pupil function with a radius of NA, the green circle corresponds to the support of the optical transfer function with a radius of NA, and the orange circle is the support from the ground truth with a radius of NA.
Figure 5: Reconstructed temporal dynamic information using the proposed CNN. (A) The MAE metric is evaluated for every frame of the time-series experiment on all the CNN models. (B) Several frames of the reconstructed high-SBP phase video (see Visualization 1 for more examples) from a zoom-in region, where significant morphological changes are observed over the course of 4 hours.

3 Results and discussion

We discuss our results by presenting results in real space (Fig. 3), Fourier space (Fig. 4), and over different time points (Fig. 5).

Figure 3(a) summarizes all the illumination patterns used for training and testing along with the corresponding networks used. All networks are applied to reconstruct the entire time-series experiment. A sample large-SBP phase reconstruction across the full (1.7mm2.1mm) FOV is shown in Fig. 3(b). In Fig. 3(c), we zoom-in on a sub-region to compare the results from different networks in the real space. For comparison, the raw low-resolution intensity image from the central BF illumination is shown, which was bilinearly up-sampled to have the same size as the network’s output.

The result from U-B, which uses BF data only, UNet without DB or cGAN and only the pixel-wise MAE loss, produce low-resolution phase images. It has been shown that the MAE loss function can lead to blurry results when solving an image reconstruction problem [51] because it does not place sufficient weights in the high frequency content. To overcome this problem, we use generative adversarial networks to reconstruct phase image with conditional input (cGAN) [50]. In U-B-cGAN, the UNet is accompanied with a discriminator network in order to better learn high frequency information. The introduction of the cGAN architecture allows us to better reconstruct sub-cellular structures with more perceptual details; however, the resolution still appears worse compared to the ground truth.

In order to further improve resolution, as in FPM, DF images are needed since they contain high-spatial frequency information beyond the support of the optical transfer function (OTF). In addition, to deal with the added data size, we also seek a more efficient network structure with higher representation power. The dense block (DB) structure has shown to provide efficient presentation with a small number of parameters in the model [43]. We present results from three illumination patterns with different angular coverage, all reconstructed with DenseNet (UNet with DB) and cGAN structure. In D-BD-cGAN, we use 36 DF images covering up to 0.6 illumination NA. This leads to moderate resolution improvement; however, the results are limited by the highly noisy data captured at very large NAs. In general, we observe that it is not guaranteed that higher illumination angle leads to better resolution. The reason is because the DF data is subject to much higher level noise than the BF data, and the noise level increases as the the illumination angle increases [38]. When the signal-to-noise ratio (SNR) falls below certain threshold, the inclusion of these DF data is no longer helpful. To confirm this, we first use a small amount of DF data from small angles in D-BD-cGAN. It leads to resolution improvement as compared to the one from D-BD-cGAN. It should be noted that the DF SNR can significantly improve if a dome-shaped LED array [57] is used instead of the planar array in [29]

. Heuristically, we found that the capacity of our CNN is that it can reliably utilize DF data up to 0.4 illumination NA (P4). The reconstructions are further explored using two networks, D-B

D-cGAN and D-BD-F-cGAN.

A major limitation of image-space only loss function is that the metric still favors low-frequency information [53] but under-weights high-frequency information. A recently proposed solution is to further include Fourier loss component [54]. The result from using this strategy is shown in D-BD-F-cGAN. Our reconstruction of last frame on Hela cells is available at [58].

U-B 0.0401 25.01db 0.7575 0.0110 303s
U-B* 0.0331 26.49db 0.7790 0.0156 553s
D-BD* 0.0339 26.17db 0.7779 0.0146 252s
D-BD* 0.0309 26.76db 0.7966 0.0165 252s
D-BD* 0.0308 26.87db 0.7964 0.0169 252s
D-BD-F* 0.0318 26.19db 0.7797 0.0211 252s
FPM (GT) 0 1.00 0 0.0389 20 mins [29]
Table 1: Performance metrics evaluated on the full-FOV testing data [Legend: * stands for -cGAN, based on the region in Fig. 3(c), GT: ground truth]

To better visualize the recovery of high-frequency information, Fig. 4 shows the Fourier transform of each image in Fig. 3(c). The spectrum of the on-axis BF image is mostly concentrated within the pupil region, i.e. the circular region with a radius of 1NA, and extends up to the support of the OTF (i.e. 2NA). It is well known that using only BF images can provide Fourier coverage up to the support of the OTF. As shown in the Fourier image of U-B-cGAN, the network is able to fully recover these low-frequency information. The inclusion of DF images should lead to larger Fourier coverage; however, the improvement is not significant with image-space only loss function, as shown in the Fourier images of D-BD-cGAN, D-BD-cGAN, and BD-cGAN. The introduction of the Fourier domain loss significantly boosts the Fourier coverage up to the 0.4 illumination NA (¡0.6 NA in the ground truth), as shown in the Fourier image of D-BD-F-cGAN. We note that using the Fourier domain loss in the training process generally leads to enhancement of the sharpness of the results and the frequency measurement metric (FM) [59]; however, it may trade off image-space metrics, such as MAE, SSIM, and PSNR due to different metric weighting schemes involved (see Table. 1).

Further inspecting the results from the CNN and comparing them to the FPM generated ‘ground truth’, we note that the ground truth image contains noisy structures, which are clearly visible in the background. All CNN reconstructed results are free from these background artifacts, demonstrating the robustness of the training process to noisy ground truth data.

A unique feature of our technique is the ability to reconstruct high-SBP phase videos with training data only from the first time point of a long time-series experiment. To demonstrate the effectiveness of this strategy, we show our CNN predicted temporal frames over the course of over 4 hours. During this process, considerable amount of morphological (hence phase distribution) changes occur due to cell division over several cell cycles. Figure 5(b) shows several frames (reconstructed with D-BD-F-cGAN) of a zoom-in region, where one cell is growing and dividing into multiple cells, and another cell has its membrane rapidly fluctuating. More example videos are provided in Visualization 1. A more quantitative evaluation of the ‘generalization error’ over time is presented in Fig. 5(a), in which the MAE metrics of all the networks studied are plotted for every frame in the time series experiment. The error is low at the beginning of the experiment and grows slowly as the time progresses.

Figure 6: Transfer learning using the pre-trained CNN (D-BD-F-cGAN) on Hela cells, and then used to make predictions of the phase on MCF10A, stained and unstained U2OS cells. (a) the intensity images vary across different cell types and before/after staining. The image patches are taken from the same FOV region and using the same illumination angle. (b) The regions used for testing and training for demonstrating the transfer learning. Phase reconstructed from (c1) directly apply the pre-trained CNN to the new data. (c2) after 30min transfer learning. (c3) the ground truth from [29].

4 Transfer learning

Practically, it is difficult to train a single network that can handle all sample types, a main drawback of the DL approach compared to the model based methods. To mitigate this problem, we investigate transfer learning, in which our pre-trained CNN on Hela cell is finely tuned for other cell types. The effectiveness of this strategy to address the generalization limitation of sample types has also been previously demonstrated in other biomedical imaging applications  [60].

We used D-BD-F-cGAN trained on Hela cells to predict the phase reconstruction of two other cell types (MCF10A, U2OS) with or without staining. The data were captured with the same setup in [29]. In Fig. 6, we compare two results. First, we directly apply the D-BD-F-cGAN network to the new data. To further refine the results, we use the transfer learning technique. Specifically, we take the weights from the pretrained network and continue the training with the new cell data as the training data for 30 mins. Note that these new cell data contain significant intensity differences. By fine tuning the model, the CNN is able to produce high quality reconstruction. During the transfer learning, we did not use any validation data and only evaluated the new CNN’s performance directly after the 30-min training. The results show that transfer learning provides a practical way to broaden the utility of our technique.

5 Conclusion

We have demonstrated a deep learning framework for Fourier ptychography video reconstruction. The proposed CNN architecture fully exploits the unique high-SBP imaging capability of FPM so that it can be trained using a single frame and then be generalized to a full time-series experiment. In addition, the CNN requires reduced number of images for high-resolution phase recovery. The reconstruction of each high-SBP image takes less than 30 seconds. Overall, this technique significantly improves the imaging throughput of the FPM system by reducing both the acquisition and reconstruction time. The central idea of our technique is based on the observation that each FPM image contains a large cell ensembles covering all morphological information throughout the time-series experiment. By the principle of ergodicity, the statistical information learned from these large spatial ensembles in a single frame are shown to be sufficient to predict temporal dynamics with high fidelity. In practice, we showed that our trained CNN can successfully reconstruct a high-SBP phase video of dynamic live cell populations with reduced noise artifacts. Using the conditional generative adversarial network (cGAN) framework and a weighted Fourier loss function, the proposed CNN is able to more effectively learn the high resolution information encoded in the darkfield data. The technique may find wide applications in in vitro live cell imaging and gather large-scale spatial and temporal information in a data and computation efficient manner. We also demonstrate that transfer learning is a practical approach to image a broad range of new cell samples, bypassing the need to train an entirely new CNN from scratch.


We would like to thank NVIDIA Corporation for supporting us with the GeForce Titan Xp through the GPU Grant Program.


The authors declare that there are no conflicts of interest related to this article.


  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015).
  • [2] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using Deep Neural Networks for Inverse Problems in Imaging: Beyond Analytical Methods,” IEEE Signal Process. Mag. 35(1), 20–36 (2018).
  • [3] M. Bertero and P. Boccacci, Introduction to inverse problems in imaging (IOP Publishing, 1998).
  • [4] M. V. Afonso, J. M.Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Med. Imaging 19(9), 2345–2356 (2010).
  • [5] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences 2(1), 183–202 (2009).
  • [6] S. Boyd, N. Parikh, B. P. E Chu, and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends® in Machine Learning 3(1), 1–122 (2011).
  • [7] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,”

    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 105–144.

  • [8] Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017).
  • [9] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 2392–2399.
  • [10] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. Med. Imaging 26(7), 3142–3155 (2017).
  • [11] O. Ronneberger, P. Fischer, and T. Brox “U-net: Convolutional networks for biomedical image segmentation,”
  • [12] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” Advances in Neural Information Processing Systems (NIPS, 2014), pp. 1790–1798.
  • [13] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2010), pp. 2528–2535.
  • [14] H. Yao, F. Dai, D. Zhang, Y. Ma, S. Zhang, and Y. Zhang, “Dr2-net: Deep residual reconstruction network for image compressive sensing,”
  • [15] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 449–458.
  • [16] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Med. Imaging 26(9), 4509–4522 (2017).
  • [17] T. Nguyen, V. Bui, and G. Nehmetallah, “Computational optical tomography using 3-D deep convolutional neural networks,” Opt. Eng. 57(4), 043111 (2018).
  • [18] E. M. Christiansen, S. J. Yang, D. M. Ando, A. Javaherian, G. Skibinski, S. Lipnick, E. Mount, A. O’Neil, K. Shah, A. K. Lee, P. Goyal, W. Fedus, P. Ryan, A. Esteve, M. Berndl, L. L. Rubin, P. Nelson, and S. Finkbeiner, “In silico labeling: Predicting fluorescent labels in unlabeled images,” Cell 137 (3), 792–803 (2018).
  • [19] Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl., 7(2), 17141 (2018).
  • [20] Z. Ren, Z. Xu, and E. Y. Lam, “Learning-based nonparametric autofocusing for digital holography,” Optica 5(4), 337–344 (2018).
  • [21] A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017).
  • [22] S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5, 803 (2018).
  • [23] Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach towards scalable imaging through scattering media,”
  • [24] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video super-resolution with convolutional neural networks,” IEEE Trans. Med. Imaging 2(2), 109–122 (2016).
  • [25] O. Shahar, A. Faktor, and M. Irani, “Space-time super-resolution from a single video,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2011), pp. 3353–3360.
  • [26] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2015), pp. 2758–2766.
  • [27] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 257–265.
  • [28] H. Chen, J. Gu, O. Gallo, M. Liu, A. Veeraraghavan, and J. Kautz,

    “Reblur2deblur: Deblurring videos via self-supervised learning,”

    in Proceedings of IEEE International Conference on Computational Photography (IEEE, 2018), pp. 1–9.
  • [29] L. Tian, Z. Liu, L.-H. Yeh, M. Chen, J. Zhong, and L. Waller, “Computational illumination for high-speed in vitro Fourier ptychographic microscopy,” Optica 2(10), 904–911 (2015).
  • [30] G. Zheng, R. Horstmeyer, and C. Yang, “Wide-field, high-resolution Fourier Ptychographic microscopy,” Nat. Photonics 7(9), 739–745 (2013).
  • [31] D. J. Stephens and V. J. Allan, “Light microscopy techniques for live cell imaging,” Science, 300(5616), 82–86 (2003).
  • [32] T. Ashihara and R. Baserga, “[20] cell synchronization,” Methods in Enzymology 8, 248–262 (1979).
  • [33] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for Fourier ptychography with an LED array microscope,” Biomed. Opt. Express 5(7), 2376–2389 (2014).
  • [34] A. Kappeler, S. Ghosh, J. Holloway, O. Cossairt, and A. Katsaggelos, “Ptychnet: Cnn based fourier ptychography,” in Proceedings of IEEE International Conference on Image Processing (IEEE, 2017), pp. 1712–1716.
  • [35] E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-storm: super-resolution single-molecule microscopy by deep learning,” Optica 5(4), 458–464 (2018).
  • [36] M. Weigert, U. Schmidt, T. Boothe, A. Muller, A. Dibrov, A. Jain, B. Wilhelm, D. Schmidt, C. Broaddus, S. Culley, M. Rocha-Martins, F. Segovia-Miranda, C. Norden, R. Henriques, M. Zerial, M. Solimena, P. Tomancak, L. Royer, F. Jug, and E. W. Myers, “Content-aware image restoration: Pushing the limits of fluorescence microscopy,”
  • [37] N. Boyd, E. Jonas, H. P. Babcock, and B. Recht, “Deeploco: Fast 3d localization microscopy using neural networks,”
  • [38] L.-H. Yeh, J. Dong, J. Zhong, L. Tian, M. Chen, G. Tang, M. Soltanolkotabi, and L. Waller, “Experimental robustness of Fourier ptychography phase retrieval algorithms,” Opt. Express 23(26), 33214–33240 (2015).
  • [39] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” IEEE Conference on Computer Vision and Pattern Recognition, 2691–2699 (2015).
  • [40] Z. Lu, Z. Fu, T. Xiang, P. Han, L. Wang, and X. Gao, “Learning from weak and noisy labels for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(3), 486–500 (2017).
  • [41] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013).
  • [42] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” European conference on computer vision, Springer, 818–833 (2014).
  • [43] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2261–2269.
  • [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”
  • [45] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,”
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(1), 1929–1958 (2014).
  • [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778.
  • [48] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,”
  • [49] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,”
  • [50] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS, 2014), pp. 2672–2680.
  • [51] P. Isola, J. Zhu, T. Zhou, and A. A. Efros,

    Image-to-image translation with conditional adversarial networks,”

    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 5967–5976.
  • [52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems (NIPS, 2016), pp. 2234–2242.
  • [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Med. Imaging 13(4), 600–612 (2004).
  • [54] G. Yang, S. Yu, H. Dong, G. Slabaugh, P. L. Dragotti, X. Ye, F. Liu, S. Arridge, J. Keegan, Y. Guo, and D. Firmin, “Dagan: Deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction,” IEEE Trans. Med. Imaging 37 (6), 1310–1321 (2018). IEEE Trans. Med. Imaging
  • [55] T. Nguyen, V. Bui, V. Lam, C. B. Raub, L.-C. Chang, and G. Nehmetallah, “Automatic phase aberration compensation for digital holographic microscopy based on deep learning background detection,” Opt. Express 25(13), 15043–15057 (2017).
  • [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
  • [57] Z. F. Phillips, M. V. D’Ambrosio, L. Tian, J. J. Rulison, H. S. Patel, N. Sadras, A. V. Gande, N. A. Switz, D. A. Fletcher, and L. Waller, “Multi-contrast imaging and digital refocusing on a mobile microscope with a domed led array,” PLoS ONE 10(5), e0124938 (2015).
  • [58] T. Nguyen, Y. Xue, Y. Li, L. Tian, and G. Nehmetallah, “DeepLearningFourierPtychographicMircoscopy,” (2018). Accessed: 2018-7-21.
  • [59] K. De and V. Masilamani, “Image sharpness measure for blurred images in frequency domain,” Procedia Eng. 64, 149–158 (2013).
  • [60] Y. Rivenson, H. Wang, Z. Wei, Y. Zhang, H. Gunaydin and A. Ozcan, “Deep learning-based virtual histology staining using auto-fluorescence of label-free tissue,”