Global Transformer U-Nets for Label-Free Prediction of Fluorescence Images

07/01/2019 ∙ by YI LIU, et al. ∙ 0

Visualizing the details of different cellular structures is of great importance to elucidate cellular functions. However, it is challenging to obtain high quality images of different structures directly due to complex cellular environments. Fluorescence microscopy is a popular technique to label different structures but has several drawbacks. In particular, labeling is time consuming and may affect cell morphology, and simultaneous labels are inherently limited. This raises the need of building computational models to learn relationships between unlabeled and labeled fluorescence images, and to infer fluorescent labels of other unlabeled fluorescence images. We propose to develop a novel deep model for fluorescence image prediction. We first propose a novel network layer, known as the global transformer layer, that fuses global information from inputs effectively. The proposed global transformer layer can generate outputs with arbitrary dimensions, and can be employed for all the regular, down-sampling, and up-sampling operators. We then incorporate our proposed global transformer layers and dense blocks to build an U-Net like network. We believe such a design can promote feature reusing between layers. In addition, we propose a multi-scale input strategy to encourage networks to capture features at different scales. We conduct evaluations across various label-free prediction tasks to demonstrate the effectiveness of our approach. Both quantitative and qualitative results show that our method outperforms the state-of-the-art approach significantly. It is also shown that our proposed global transformer layer is useful to improve the fluorescence image prediction results.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Capturing and visualizing the details of different sub-cellular structures is an important but challenging problem in cellular biology (Koho et al., 2016; Jo et al., 2019). Detailed information on the shapes and locations of cellular structures plays an important role in investigating cellular functions (Held et al., 2010; Glory and Murphy, 2007; Chou and Shen, 2008). The widely used transmitted light microscopy can only provide low contrast images, and it is difficult to study certain structures or functional characteristics from such images (Bray et al., 2012; Buchser et al., 2014). One popular technique to overcome these limitations is fluorescence microscopy, which labels different structures with dyes or dye-conjugated antibodies (Ounkomol et al., 2018). For example, cell nuclei can be labeled and visualized after stained by DAPI (Ounkomol et al., 2018; Christiansen et al., 2018). However, fluorescence labeling is time consuming, especially when cell structures are complex. In addition, due to the overlap of spectrum, there is a limit on the number of fluorescent labels to be applied simultaneously on the same microscopy image (Bastiaens and Squire, 1999; Wang et al., 2010). Furthermore, labeling may interfere with regular physiological processes in live cells, resulting in changes in cell morphology (Jo et al., 2019; Ounkomol et al., 2018). These limitations raise the need of advanced methods to label cellular structures more effectively and efficiently.

Figure 1: Diagram of our proposed global transformer (GT). The spatial sizes of the output is determined by . Generally speaking, GT can generate output of arbitrary sizes. In practice, three types of GTs are investigated. From left to right, Global Up Transformer (GUT) doubles the spatial sizes; Global Same Transformer (GST) keeps the spatial sizes; Global Down Transformer (GDT) halves the spatial sizes. For each case, response at each position in the output is computed as a weighted summation of features at all positions in , which is obtained directly from the input features. Thus, global context information of the input is captured by GT.

With the rapid development of deep learning methods, recent studies 

(Ounkomol et al., 2018; Christiansen et al., 2018; Yuan et al., 2018)

propose to formulate such problems as image dense prediction tasks using deep neural networks. In such a dense prediction task, we wish to predict if each pixel on the input transmitted light image belongs to a fluorescent label or not. Given transmitted light images and corresponding fluorescence labeled images, the models are trained to capture the relationship between them. Then for any newly obtained transmitted light image, the fluorescence image can be predicted by the models based on the learned relationships. When considering multiple fluorescent labels for the same image, a multi-task image dense prediction problem can be used. In these cases, the models are trained to learn the underlying relationships between the raw images and multiple fluorescent labels. In the prediction phase, the models produce a prediction map for each fluorescent label. Such a problem formulation allows us to obtain multiple fluorescent labels simultaneously from transmitted light images without labeling.

The recent study in Christiansen et al. (2018)

proposes to use convolutional neural networks (CNNs) 

(LeCun et al., 1998; Simonyan and Zisserman, 2014; He et al., 2016)

for such a task and obtains promising results for prediction of fluorescence images. It stacks multiple convolutional layers to enlarge the receptive field and employs inception modules 

(Szegedy et al., 2015) to facilitate the training. However, only local operators, such as convolution, pooling, and deconvolution, are used in their model. Hence, the global information cannot be captured effectively and efficiently, while such information may be important to determine certain fluorescent labels. Meanwhile, another work (Ounkomol et al., 2018) employs a vanilla U-Net framework for prediction of fluorescence images. For each type of fluorescent label, it builds a model to learn the relationships between transmitted light images and the corresponding fluorescent label. However, such a design learns different fluorescent types separately, thereby ignoring important relationships among different fluorescent labels. In addition, it only employs local operators so that the global information cannot be effectively captured. Other studies on image missing modality prediction tasks employ similar network architectures (Cai et al., 2018; Zhang et al., 2015; Li et al., 2014; Chen et al., 2018).

In this work, we propose a novel deep learning model, known as the global transformer (GT) U-Nets, for fluorescent label predictions. As a radical departure from previous studies that invariably employ local operators, we develop a novel network layer, known as the global transformer layer, to fuse global information efficiently and effectively. The global transformer layer is inspired by the attention operators (Vaswani et al., 2017; Wang et al., 2018), and each position of the output in the global transformer layer fuses information from all input positions. Particularly, our proposed layer can be flexibly generalized to produce outputs of any dimensions. We build an U-Net like architecture based on our proposed global transformer layer. We further develop dense blocks in our network to promote feature reusing between layers in the network. To capture both global contextual and local subtle features, we propose a multi-scale input strategy in our model to incorporate information at different scales. Importantly, our model is designed in a multi-task manner to predict several target fluorescent labels simultaneously. We conduct extensive experiments to evaluate our proposed approach across various fluorescent label prediction tasks. Both quantitative and qualitative results show that our model outperforms the existing approach (Christiansen et al., 2018) significantly. Our ablation analysis shows that the proposed global transformer layer is useful to improve model performance.

Figure 2: Overall pipeline of our method for prediction of fluorescence images. The network produces predications for a cropped patch in the whole image. For multi-scale inputs, besides the cropped patch, two other patches centered at the same pixel with sizes and are also cropped and re-scaled to

. The input to the network is the concatenation of these three patches. The U-like architecture includes an encoder part and decoder part. In the encoder part, each dense block is followed by a GDT layer. Sizes of feature maps are reduced by the GDT layer, and numbers of feature maps are increased by the dense block. In the decoder, each GUT layer is followed by a dense block. GUT layers recover the spatial sizes and reduces the number of feature maps. In the bottom block of the U-like architecture, a GST layer following a dense block to transmit information from the encoder to the decoder. A detailed diagram of a dense block with 3 layers is also shown. Each layer includes convolution, batch normalization, ReLU activation, and dropout. A

convolution layer is added at the end to adjust the number of feature maps.

2 Background and Related Work

We describe the attention operator in this section. The inputs to an attention operator include three matrices; those are, a query matrix

with each query vector

, a key matrix with each key vector , and a value matrix with each value vector . An attention operator computes output at each position by performing a weighted sum over all value vectors in , where the weights are acquired by attending the corresponding query vector to all key vectors in . Formally, to compute a response at a position , the attention operator first computes the weight vector as


where ensures the sum of all the elements in to be 1. Each element in measures the importance of the corresponding vector in by performing the inner product between it and . The response at position is then computed by using the weight vector to perform a weighted sum over all vectors in as


In this way, the response at position fuses the global information in by assigning an importance to each value vector referring to . For response at each position, we follow the same procedure and obtain outputs as


We rewrite outputs of an attention operator as


where denotes a column-wise softmax operator to ensure every column sum to 1. We can easily see the number of vectors in output matrix is determined by the number of vectors in query matrix . In self-attention operators, we set

. Thus, response of a position is computed by the weighted average of features at all positions, thereby fusing global information from input feature maps. Note that a fully connected (FC) layer also fuses global information from whole receptive fields. However, The self-attention operator computes responses based on similarities between feature vectors at different positions, whereas a FC layer connects every neuron to compute responses using learnable weights. Moreover, a self-attention operator deals with inputs with variable sizes, while an FC layer needs sizes of input to be fixed.

3 Global Transformer U-Nets

In this section, we introduce a novel model for prediction of fluorescence images, known as the multi-scale global transformer U-Nets with dense blocks.

3.1 Global Transformers

Traditional deep learning models for dense prediction tasks contain several key operators, such as convolution, pooling, and deconvolution. These operators are all performed within a local neighborhood, restricting the capacity of networks to fuse global context information. To overcome this limitation, we propose a novel network layer, known as the global transformer (GT), which is based on the attention operator and captures dependencies between each position on outputs and all positions on inputs, thereby fusing global information from input feature maps. Unlike the self-attention operator that generates outputs with the same dimensions as the inputs, our proposed GT layer can generate output feature maps with arbitrary dimensions, and can be employed for both regular, down-sampling, and up-sampling operators. Specifically, we investigate three types of global transformers, namely global down transformer (GDT), global up transformer (GUT), and global same transformer (GST). The dimensions of feature maps are halved in GDT, while those are doubled in GUT and kept the same in GST.

Although the three types of global transformers generate outputs of different sizes, they share similar structure and computational pipeline. An illustration of our proposed GT is provided in Figure 1. Let

denote the input of the GT layer, the first step is to compute the query tensor

, key tensor and value tensor based on . We employ a generator layer to obtain the query tensor, and two convolution layers to obtain the key and value tensors as


where Generator denotes a query generator layer, and denotes a

convolution layer with stride 1 and

output feature maps. Hence, is equal to and is equal to . The choice of the query generator depends on the types of global transformers. For GDT, we employ a convolutional layer with stride to generate . For GUT, we employ a deconvolutional layer with stride to generate . For GST, we employ a convolutional layer with stride to generate .

We then convert the each of the third-order tensors into a matrix by unfolding along mode-3 (Kolda and Bader, 2009). In this way, tensor is converted into a matrix . Similarly, is converted into a matrix and is converted into a matrix . These three matrices serve as the query, key and value matrices in Eq. 4. To ensure the attention operator to be valid, we set . The output of the attention operator is computed as


Finally, the output matrix is converted back to a third-order tensor , as output of the GT layer. To this end, each position feature in the output tensor is computed as a weighted sum of all feature vectors in , which is obtained directly from the input tensor . Apparently, global information from input features is captured and fused to generate the output through our GT layers. In addition, the spatial sizes () of output feature maps are determined by spatial sizes of the query tensor , while the number of output feature maps depends on the value tensor . Theoretically, our proposed GT layer can generate feature maps of arbitrary dimensions. In practice, the commonly used local operators either keep the spatial sizes of feature maps, or double the spatial sizes for up-sampling, or halve the spatial sizes for down-sampling. Hence, in this work, we propose to substitute these local operators by three types of global transformers.

The traditional local operators, such as max pooling and convolution with a stride 2, may also capture global information by stacking the same operator many times. However, such stacking is not efficient. For example, when trying to capture the global information in an area, the max pooling need to be repeated times. However, our proposed GT layers can capture global relationships among any two positions using only one layer. Therefore, our proposed methods are more efficient and effective compared to traditional local operators.

3.2 Global Transformer U-Nets

It is well-known that encoder-decoder architectures like U-Nets (Ronneberger et al., 2015) have achieved the state-of-the-art performance in various dense prediction tasks. However, these networks employ local operators like convolution, pooling and deconvolution, which cannot efficiently capture global information. Based on our GT layer, we propose a novel network for dense prediction tasks, known as the global transformer U-Nets (GT U-Nets).

In U-Nets, down-sampling layers are employed to reduce spatial sizes and obtain high-level features, while up-sampling layers are used to recover spatial dimensions. The commonly used convolution, pooling, and deconvolution operators are performed in local neighborhood on feature maps. We propose to substitute these local operators with our proposed GT layers. By setting different sizes for the query tensor , our proposed GT layers can be employed for both down-sampling and up-sampling, while considering global information to build output features. Suppose an input feature map has spatial size of . For the down-sampling operator, a GDT layer halves the spatial sizes of input feature maps, which can be achieved by setting the sizes of query tensor as and . For the up-sampling operator, the spatial sizes of feature maps are doubled by setting and in a GUT layer. In addition, the GST layers are employed to transmit information from the encoder to the decoder in the bottom block of U-Nets.

In addition, due to the multiple down-sampling and up-sampling operators in U-Nets, the spatial information, such as the shapes and locations of cellular structures, is largely lost in its information flow. Since the decoder recovers the spatial sizes from high-level features, the prediction may not fully incorporate all spatial information while such spatial information is important to perform dense prediction. Hence, we adapt the idea to build skip connections between the encoder and the decoder in U-Nets. Such connections are expected to enable the sharing of spatial information and high-level features between the encoder and decoder, and hence improve the performance of dense prediction.

Condition Cell Type
Fluorescent Label 1
and Modality
Fluorescent Label 2
and Modality
Fluorescent Label 3
and Modality
A human motor neurons DAPI (Wide Field) TuJ1 (Wide Field) Islet1 (Wide Field) 286 39 1900x2600 Rubin
B human motor neurons DAPI (Confocal) MAP2 (Confocal) NFH (Confocal) 273 52 4600x4600 Finkbeiner
C primary rat cortical cultures DAPI (Confocal) DEAD (Confocal) - 936 273 2400x2400 Finkbeiner
D primary rat cortical cultures DAPI (Confocal) MAP2 (Confocal) NFH (Confocal) 26 13 4600x4600 Finkbeiner
E human breast cancer line DAPI (Confocal) CellMask (Confocal) - 13 13 3500x3500 Google
Table 1: A description of the datasets used in our experiments. The datasets are created by Christiansen et al. (2018) under five conditions from three laboratories. A set of 13 2D images are z-stacks of transmitted-light images collected from one 3D biomedical sample. In total, eight fluorescent labels are introduced for all the datasets.

3.3 Global Transformer U-Nets with Dense Blocks

To perform dense prediction on images, deep networks are usually required to extract high-level features. However, a known problem for training very deep CNNs is that gradient flow in deep networks is sometimes saturated. Residual connections have been shown to be effective to solve such a problem in various popular networks, such as ResNets 

(He et al., 2016) and DenseNets (Huang et al., 2017)

. In ResNets, residual connections are employed in residual blocks to share the different levels of features between the non-linear transformation of the input and the identity mapping. They benefit the convergence of very deep neural networks by providing a highway for the gradients to back propagate. Recently, residual U-Net 

(Quan et al., 2016; Fakhry et al., 2017) is proposed to inherit the benefits of both long-range skip connections and short-range residual connections. It is shown to obtain more precise results on dense prediction tasks without increasing parameters. Since DenseNets employ extreme residual connections, also known as dense connections, to build dense blocks and achieve state-of-the-art performance on image classification tasks, we follow a similar idea to use dense blocks in our proposed global transformer U-Nets.

The general structure of our model is shown in Figure 2. We combine the dense block and the GT layer to better incorporate dense connections. For the encoder part, each dense block is followed by a GDT layer, since the dense block retains the spatial sizes of the input while the GDT layer performs down-sampling. The reduction of spatial sizes is compensated by the growth in feature map number generated by the dense block. Correspondingly, each GUT layer in the decoder is followed by a dense block, and the GUT layer recovers the spatial sizes and reduces the number of feature maps.

For each dense block in our model, residual connections are employed to connect every layer and its subsequent layers. A typical layer dense block can be defined as


where is the input to the dense block, is the output of the layer, and represents the concatenation operator. denotes a series of operators, including convolution, batch normalization (BN) (Ioffe and Szegedy, 2015), ReLU activation, and dropout (Srivastava et al., 2014). Each layer in a dense block generates new feature maps and they are concatenated with previously generated feature maps. Note that is also called the growth rate of dense block. Hence, the output of the dense block contains information regarding both the input feature maps and newly generated feature maps. A general illustration of our employed dense block is shown in Figure 2. Note that we add a convolution layer before the output to make the dense block more flexible so that the number of output feature maps can be controlled. Intuitively, a dense block encourages feature reusing between layers. In addition, compared with traditional networks of the same capacity, it can significantly reduce the number of parameters since each layer in dense block only contains new feature maps.

3.4 Multi-Scale Input Strategy

One training strategy for dense prediction tasks is to feed the whole image as input and produce predictions for all input pixels. However, such a strategy requires excessive memory on training hardware. On modern hardware like GPUs, memory resource is always limited. This data feeding strategy becomes inefficient for large inputs, which is quite common for biological image processing tasks. One common solution is to crop small patches from the original image, and train the neural networks with these small image patches. To predict the whole image, an overlap-tile strategy can be used to allow continuous segmentation (Ronneberger et al., 2015). However, such a divide-and-conquer strategy imposes a natural constraint on networks. When predicting small patches, only the local information within these patches can be captured by the network, while the global information is ignored. Furthermore, the information in local subtle area may be ignored when the sizes of local area are relatively small compared with the patch sizes. To overcome these limitations, we propose a multi-scale input strategy to incorporate sufficient global and local information to perform prediction.

Assuming that the sizes of image patches for network training are . For a image patch, let denote the center and an image patch is cropped for training. To incorporate global information, we crop another image with the same center to provide larger receptive field. This image is re-scaled to but contains more global information. This is particular useful when the original image contains pixels lying on incomplete edges. In addition, we crop another image to capture local subtle information. The image is also re-scaled to . Compared with , small subtle areas are up-scaled in , which encourages the networks to capture important details. Then we concatenate , and along the channel dimension and use them as input of networks. For the corresponding label of such input, we use the predicted image of as its label. Intuitively, we incorporate information at different scales to make predictions for one particular area. Notably, we can flexibly generalize such input strategy to multiple levels and incorporate information at different scales. Our proposed multi-scale input strategy is illustrated in the left part of Figure 2.

Figure 3: Visualization of prediction results for cell nuclei, which are shown in blue. The first column is randomly cropped test microscopy images from the datasets in Table 1. The second column is the true fluorescence images for cell nuclei. The third and fourth columns are predicted fluorescence images produced by the baseline and our model, respectively.

4 Experimental Studies

We use both quantitative and qualitative evaluations to demonstrate the effectiveness of our proposed model. The dataset used for evaluation and the experimental settings are presented in Sections 4.1 and 4.2. We compare our experimental results with the existing approach (Christiansen et al., 2018) in Section 4.3. Finally, we provide an ablation analysis in Section 4.4.

4.1 Dataset

We use the dataset in the existing work (Christiansen et al., 2018). The dataset contains 2D high-resolution microscopy images from five different laboratories. Note that a set of several such 2D microscopy images are originally z-stacks of transmitted-light images collected from one 3D biological sample (Christiansen et al., 2018). Specifically, the z-stack 2D images are collected from several planes at equidistant intervals along the z axis of a 3D sample. They collected 13 2D images from a sample. Thus, for all the 13 2D images from the same set, they share the same fluorescence image for each fluorescent label. Different laboratories obtained the microscopy images under different conditions using different methods. Two imaging modalities, namely confocal and wide field are used during microscopy photoing. In addition, three different types of cells are collected by different laboratories, including human motor neurons from induced pluripotent stem cells (iPSCs), primary rat cortical cultures, and human breast cancer line. Detailed information of this dataset is given in Table 1.

Layers Spatial Sizes Channels
Multi-Scale Input Input Multi-Scaling Sizes
Multi-Scale Preprocessing 128x128
1x1 Convolution 128x128 32
Encoder DB(2 layers) + GDT 64x64 64
DB(4 layers) + GDT 32x32 128
DB(8 layers) + GDT 16x16 256
Bottom Block DB(8 layers) + GST 16x16 384
Decoder GUT + DB(4 layers) 32x32 288
GUT + DB(2 layers) 64x64 165
GUT + DB(1 layers) 128x128 90
Output 1x1 Convolution 128x128
Table 2: Detailed architecture of the proposed model used in our experiments. denotes the number of classes. DB denotes a dense block.
Cell Nuclei Cell Viability Cell Type

Condition A Condition B Condition C Condition D
Ours 0.948 0.0027 0.896 0.0019 0.944 0.0033 0.915 0.0031 0.859 0.0022 0.860 0.0026
Table 3:

Comparisons of Pearson’s correlations on three tasks. For the purpose of fair comparisons, we calculate the Pearson’s correlations for the baseline and our model on the same randomly sampled pixels. Each time we randomly sample one million pixels and calculate the Pearson’s correlations. The results are obtained by repeating the calculations 30 times, and we report the average and standard deviation.

Condition A Condition B Condition C Condition D
Baseline 0.928 0.871 0.920 0.902
Multi-scale U-Nets 0.937 0.882 0.925 0.893
Multi-scale U-Nets with DBs 0.941 0.887 0.935 0.902
Multi-scale GT U-Nets with DBs 0.948 0.896 0.944 0.915
Table 4: Ablation analysis on prediction of cell nuclei by comparing Pearson’s correlations between different models. DB denotes dense block. All models are trained across all training samples and evaluated on one specific task. Details of the models are provided in Section 3.

4.2 Experimental Setup

The architecture of our model is shown in Table 2. It shows the changes of feature maps through the information flow in our networks. The growth rate of our dense blocks is set to 16. We employ three GDT layers with dense blocks in our encoder to perform down-sampling and extract high-level features. Correspondingly, there are three GUT layers with dense blocks to recover the spatial sizes. For the bottom block connecting the encoder and the decoder, we employ one GST layer and one dense block. Note that the depths of different dense blocks are different.

Training examples are obtained by randomly cropping from the raw images. Since we employ the multi-scale input strategy, we crop images at three different scales; namely , , and . The network predicts fluorescence maps with sizes equal to

. We train our proposed model across all target-related training examples in a multi-task learning manner. Number of output fluorescence maps equals to number of target fluorescent labels. In addition, for each pixel in the predicted maps, the network outputs a probability distribution over 256 pixel values, so

in Table 2. Cross-entropy loss is employed for network training. Note that there are at most three fluorescent labels available for a given input. The loss is calculated by only considering target labels while irrelevant labels are ignored. During training, we employ the dropout with a rate of 0.5 in our dense blocks to avoid over-fitting. To optimize the model, we employ the Adam optimizer (Kingma and Ba, 2014) with a learning rate of and a batch size of 4. During the prediction stage, test patches are cropped in a sliding-window fashion. We extract patches from test images with the same sizes as those in training () by sliding a window with a constant step size. The step size is set to 64 in our experiments. Then we build predictions for the original test images based on predictions of small patches.

4.3 Comparison with the Baseline

We compare our approach with the existing model (Christiansen et al., 2018) as it achieves the state-of-the-art performance on the dataset we are using. To demonstrate the effectiveness of our proposed approach, we conduct comparisons with the baseline method for three different tasks:

Prediction of Cell Nuclei: Given an image, the task is to predict the nuclei of live cells. The nuclei of live cells are labeled using DAPI on both confocal and wild field modalities. Examples created under condition A, B, C, D have fluorescent labels to investigate the cell nuclei.

Prediction of Cell Viability: Given an image, this task predicts the dead cells with cell nuclei as visual background. Dead cells on images are labeled with propidium lodide (PI) on confocal modality. These images are obtained under condition C.

Prediction of Cell Type: Given an image, this task predicts the neurons with cell nuclei as visual background. There may exist two other types of cells in the image, such as astrocytes and immature dividing cells. Neurons on images are labeled using TuJ1 under condition A.

We first compare our approach with the baseline method quantitatively, using Pearson’s correlation values calculated for each task. Specifically, one million pixels are randomly sampled from all the test images in a task, and we collect the predicted values for these pixels. These predicted results can be represented as a one million dimensional vector. Similarly, we can obtain another one million dimensional vector from the ground truth of these pixels. Then we calculate the Pearson’s correlation between these two vectors, which can indicate the similarity between them. In particular, higher Pearson’s correlation values imply that the predicted results are closer to the ground truth. The results are reported in Table 3. Note that for both our method and the baseline approach, we repeat the calculations 30 times and report the average and standard deviation. We can observe that the proposed model outperforms the baseline model significantly on all of the three tasks. These results indicate that the proposed model can better capture the relationships between transmitted light images and the corresponding fluorescent labels.

In addition, we compare the prediction results qualitatively. We present the prediction results for the cell nuclei task in Figure 3. Based on visual comparisons for the areas in white boxes, we can observe that our model can make more accurate predictions for many small regions. These results demonstrate the capability of our model to capture detailed information. Furthermore, confusion matrices are reported for these images to allow visualization of true versus predicted pixel values in each bin. The pixel values are normalized to and divided into 10 bins that the bin contains the pixels with values in the range

. The overall accuracies (OAs) in confusion matrices indicate how many pixels are classified into the same bin as the ground truth. The corresponding confusion matrices for Figure 

3 are provided in the supplementary material. The results show that our model can predict more accurate pixel values compared with the baseline model. Similarly, we report the prediction results and the corresponding confusion matrices for the dead cell task in the supplementary material. The white boxes show that the baseline misclassifies dead cells to other labels while our model has the ability to make correct predictions. We also show the results of the cell type task in the supplementary material. We can clearly observe that our model achieves more accurate predictions to distinguish neurons from other types of cells. Finally, we report the prediction accuracies for different bins and the overall accuracies, which are provided in the supplementary material. Obviously, for all three task, we obtain more accurate predictions. Overall, both qualitative and quantitative results indicate that our model performs significantly better than the baseline approach.

4.4 Ablation Analysis

We conduct ablation analysis on the cell nuclei prediction task to show the effectiveness of each proposed module. All models are trained under the same condition and compared with fair settings. As shown in Table 4, when employing the multi-scale input strategy, even the classic U-Nets can achieve better results than the baseline approach. By adapting to dense blocks, the performance is further improved. The best performance is achieved by incorporating all of our proposed modules. Such results indicate that all of our proposed modules are effective to improve predictive performance.

5 Conclusions

Visualizing cellular structure is important to understand cellular functions. Fluorescence microscopy is a popular technique but has key limitations. Here, we develop a novel deep learning model to directly predict labeled fluorescence images from unlabeled images. To fuse global information efficiently and effectively, we propose a novel global transformer layer and build an U-Net like network by incorporating our proposed global transformer layer and dense blocks. A novel multi-scale input strategy is also proposed to combine both global and local features for more accurate predictions. Experimental results on various fluorescence image prediction tasks indicates that our model outperforms the baseline model significantly. In addition, ablation study shows that all of our proposed modules are effective to improve performance.


This work was supported by National Science Foundation [IIS-1633359, IIS-1615035, and DBI-1641223].


  • Bastiaens and Squire (1999) Bastiaens, P. I. and Squire, A. (1999). Fluorescence lifetime imaging microscopy: spatial resolution of biochemical processes in the cell. Trends in cell biology, 9(2), 48–52.
  • Bray et al. (2012) Bray, M.-A., Fraser, A. N., Hasaka, T. P., and Carpenter, A. E. (2012). Workflow and metrics for image quality control in large-scale high-content screens. Journal of biomolecular screening, 17(2), 266–274.
  • Buchser et al. (2014) Buchser, W., Collins, M., Garyantes, T., Guha, R., Haney, S., Lemmon, V., Li, Z., and Trask, O. J. (2014). Assay development guidelines for image-based high content screening, high content analysis and high content imaging.
  • Cai et al. (2018) Cai, L., Wang, Z., Gao, H., Shen, D., and Ji, S. (2018). Deep adversarial learning for multi-modality missing data completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1158–1166.
  • Chen et al. (2018) Chen, Y., Gao, H., Cai, L., Shi, M., Shen, D., and Ji, S. (2018). Voxel deconvolutional networks for 3D brain image labeling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1226–1234.
  • Chou and Shen (2008) Chou, K.-C. and Shen, H.-B. (2008). Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms. Nature protocols, 3(2), 153.
  • Christiansen et al. (2018) Christiansen, E. M., Yang, S. J., Ando, D. M., Javaherian, A., Skibinski, G., Lipnick, S., Mount, E., O’Neil, A., Shah, K., Lee, A. K., et al. (2018). In silico labeling: Predicting fluorescent labels in unlabeled images. Cell, 173(3), 792–803.
  • Fakhry et al. (2017) Fakhry, A., Zeng, T., and Ji, S. (2017). Residual deconvolutional networks for brain electron microscopy image segmentation. IEEE transactions on medical imaging, 36(2), 447–456.
  • Glory and Murphy (2007) Glory, E. and Murphy, R. F. (2007). Automated subcellular location determination and high-throughput microscopy. Developmental cell, 12(1), 7–16.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778.
  • Held et al. (2010) Held, M., Schmitz, M. H., Fischer, B., Walter, T., Neumann, B., Olma, M. H., Peter, M., Ellenberg, J., and Gerlich, D. W. (2010). Cellcognition: time-resolved phenotype annotation in high-throughput live cell imaging. Nature methods, 7(9), 747.
  • Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In CVPR, volume 1, page 3.
  • Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    , pages 448–456.
  • Jo et al. (2019) Jo, Y., Cho, H., Lee, S. Y., Choi, G., Kim, G., Min, H.-s., and Park, Y. (2019).

    Quantitative phase imaging and artificial intelligence: A review.

    IEEE Journal of Selected Topics in Quantum Electronics, 25(1), 1–14.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Koho et al. (2016) Koho, S., Fazeli, E., Eriksson, J. E., and Hänninen, P. E. (2016). Image quality ranking method for microscopy. Scientific reports, 6, 28962.
  • Kolda and Bader (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3), 455–500.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  • Li et al. (2014) Li, R., Zhang, W., Suk, H.-I., Wang, L., Li, J., Shen, D., and Ji, S. (2014). Deep learning based imaging data completion for improved brain disease diagnosis. In Proceedings of the 17th International Conference on Medical Image Computing and Computer Assisted Intervention, pages 305–312.
  • Ounkomol et al. (2018) Ounkomol, C., Seshamani, S., Maleckar, M. M., Collman, F., and Johnson, G. (2018). Label-free prediction of three-dimensional fluorescence images from transmitted light microscopy. bioRxiv, page 289504.
  • Quan et al. (2016) Quan, T. M., Hildebrand, D. G., and Jeong, W.-K. (2016). Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics. arXiv preprint arXiv:1612.05360.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer.
  • Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
  • Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
  • Wang et al. (2010) Wang, Q., Niemi, J., Tan, C.-M., You, L., and West, M. (2010). Image segmentation and dynamic lineage analysis in single-cell fluorescence microscopy. Cytometry Part A: The Journal of the International Society for Advancement of Cytometry, 77(1), 101–110.
  • Wang et al. (2018) Wang, Z., Zou, N., Shen, D., and Ji, S. (2018). Global deep learning methods for multimodality isointense infant brain image segmentation. arXiv preprint arXiv:1812.04103.
  • Yuan et al. (2018) Yuan, H., Cai, L., Wang, Z., Hu, X., Zhang, S., and Ji, S. (2018). Computational modeling of cellular structures using conditional deep generative networks. Bioinformatics.
  • Zhang et al. (2015) Zhang, W., Li, R., Deng, H., Wang, L., Lin, W., Ji, S., and Shen, D. (2015). Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. NeuroImage, 108, 214–224.