In modern biology and life science, augmented microscopy attempts to improve the quality of microscope images to extract more information, such as introducing fluorescent labels, increasing the signal-to-noise ratio (SNR), and performing super-resolution. Previous advances in microscopy have allowed the imaging of biological processes with higher and higher quality[1, 2, 3, 4, 5, 6, 7, 8]. However, these advanced augmented-microscopy techniques usually lead to high costs in terms of the microscopy hardware and experimental conditions, resulting in many practical limitations. In addition, specific concerns are raised when recording processes of live cells, tissues, and organisms; those are, the imaging process should neither significantly affect the biological processes nor substantially harm the sample’s health. For example, assessing phototoxicity is a major problem in live fluorescence imaging [9, 10]. With these restrictions, high-quality microscope images are hard, expensive, and slow to obtain. While some microscope images, like transmitted-light images , can be collected at relatively low cost, they are not sufficient to provide accurate statistics and correct insights without augmentation. As a result, modern biologists and life scientists usually have to deal with the trade-offs between the quality of microscope images and the restrictions in the process of collecting them [12, 13, 14].
In recent years, the development of deep learning  has pushed the boundaries of such trade-offs by enabling fast and inexpensive microscopy augmentation using computational approaches [16, 17, 18]. The augmented microscopy task is formulated as a biological image transformation problem in deep learning. Specifically, models composed of multi-layer artificial neural networks take low-quality microscope images as inputs, and transform them into high-quality ones through computational processes. Deep learning has led to success in various augmented microscopy applications, such as prediction of fluorescence signals from transmitted-light images [19, 20, 21, 22, 23, 24, 25, 8], virtual refocusing of fluorescence images , content-aware image restoration , fluorescence image super-resolution [28, 29], and axial under-sampling mitigation .
Among these successful applications of deep learning, U-Net based neural networks have been the mainstream models. The U-Net was first proposed for 2D electron microscopy image segmentation  and later extended to other biological image transformation tasks, including cell detection and quantification 
. In the field of augmented microscopy, most deep learning models directly apply U-Net based neural networks by only changing the loss functions for training[21, 25, 27, 22, 23]. In general, the U-Net is an encoder-decoder framework of neural network architectures for image transformation. It consists of a down-sampling path to capture multi-scale and multi-resolution contextual information, and a corresponding up-sampling path to enable precise voxel-wise predictions. Recent studies have enhanced the U-Net by incorporating residual blocks [33, 34, 35, 36] and supporting 3D image transformation .
Despite the success of these U-Net based neural networks for augmented microscopy, we observe three intrinsic limitations caused by the fact that they implement the encoder-decoder path by stacking local operators like convolutions and transposed convolutions with small kernels. First, in local operators, the size of receptive field (RF) of an output unit, determined by the kernel size, is usually small and does not aggregate information from the entire input (Fig. 1a). While stacking these local operators increases the size of RF for the final output units , the size of RF is still fixed given a specific neural network architecture. Each output unit follows a local path through the network and only has access to the information within its RF on the input image. Given a large input image, the network has to go deeper with more down-sampling and up-sampling operators to ensure each output unit received information from the entire input image. Such an approach is not efficient in terms of the amount of training parameters and computational expenses. In addition, the local path tends to focus on local dependencies among units and fails to capture long-range dependencies [39, 40], which are crucial for accuracy and consistency in biological image transformation. Second, the fixed-size RF limits the model’s inference performance as well. The U-Net is usually trained with small patches of paired images (Fig. 1b), where cutting large images into small patches increases the amount of training data and stabilizes the training process by allowing large batch sizes . As the U-Net produces the output of the same spatial size as the input, it is common to feed in the entire image or patches of much larger spatial sizes than the training patches during the prediction procedure (Fig. 1b, Supplementary Fig. 1), in order to speed up the inference [31, 25, 27, 32]. However, with the fixed-size RF, the model fails to take advantage of the knowledge from the entire input if the spatial size of the input is larger than that of RF, preventing potential inference performance boost. Third, all the local operators work with kernels whose weights are fixed after the training process, which means the importance of an input unit to an output unit is determined and not input-dependent during the inference stage (Fig. 1a). This property is helpful in detecting and extracting local patterns . However, the model is supposed to be able to selectively use or ignore extracted information when transforming different input images, raising the need of operators that support input-dependent weights.
In this work, we argue that all three limitations above can be addressed by introducing the attention operator  into U-Net based neural networks. In order to demonstrate this point, we compare the attention operator with a typical local operator, i.e., convolution (Fig. 1a). There are essential differences between the convolution and the attention operator. On one hand, the convolution has a local RF determined by its kernel, where each output unit receives information from a local area of input units. Meanwhile, note that the kernel weights are fixed after training. In other words, the weights do not depend on inputs during the inference. On the other hand, the attention operator computes each output unit as a weighted sum of all input units, where the weights are obtained through interactions between different representations of the inputs (Methods). As a result, the attention operator is a non-local operator with a global receptive field, which can potentially overcome the first two limitations. In addition, the weights in the attention operator are input-dependent, addressing the third limitation.
Based on this insight, we build a family of non-local operators upon the attention operator, namely global voxel transformer operators (GVTOs) (Fig. 1 c, Methods). GVTOs organically combine local and non-local operators (Supplementary Fig. 11) and can capture both local and long-range dependencies. In particular, GVTOs extend the attention operator to serve as flexible building blocks in the U-Net framework. Specifically, we develop GVTOs to support not only size-preserving, but also down-sampling and up-sampling tensor processing, which covers all kinds of operators in the U-Net framework. It is worth noting that, while GVTOs are designed for the U-Net framework, they can also be used in other kinds of networks as well.
c, Methods). GVTOs organically combine local and non-local operators (Supplementary Fig. 11) and can capture both local and long-range dependencies. In particular, GVTOs extend the attention operator to serve as flexible building blocks in the U-Net framework. Specifically, we develop GVTOs to support not only size-preserving, but also down-sampling and up-sampling tensor processing, which covers all kinds of operators in the U-Net framework. It is worth noting that, while GVTOs are designed for the U-Net framework, they can also be used in other kinds of networks as well.
With GVTOs, we propose global voxel transformer networks (GVTNets) (Fig. 1c, Methods), an advanced deep learning tool for augmented microscopy, in order to address the limitations and improve current U-Net based neural networks. GVTNets follow the same encoder-decoder framework as the U-Net while using GVTOs instead of local operators only. To be concrete, we force GVTNets to connect the down-sampling and up-sampling paths using the size-preserving GVTO at the bottom level, which separates GVTNets from the U-Net. In addition, we allow users to flexibly use more GVTOs to replace local operators in the U-Net framework.
In the following, we (1) demonstrate the power of the basic GVTNets where only one size-preserving GVTO at the bottom level is applied, (2) show the effectiveness of employing more GVTOs in GVTNets and point out how GVTNets improve the inference performance, (3) explore the use of GVTOs in more complex and composite models, and (4) investigate the generalization ability of GVTNets. All the experiments are conducted on publicly available datasets for augmented microscopy [25, 27, 18].
2.1 Global voxel transformer networks training and inference
Global voxel transformer networks (GVTNets) are trained end-to-end under a supervised learning setting through back-propagation (Methods). While the model aims at augmenting microscopy computationally, it still requires a relatively small amount of augmented microscopy images to be collected for training. Specifically, the training data are registered pairs of biological images before and after augmentation. Once trained, the model can be used to augment microscope images in silico, without involving any expensive microscopy hardware and technique. Following previous studies, we crop the training images into patches of smaller spatial sizes to train GVTNets. However, during the inference procedure, we feed in the entire image for prediction (Fig. 1b). Note that GVTNets are able to handle inputs of any spatial size, and in particular, tend to perform better given inputs of larger spatial sizes due to the ability of utilizing global information from the entire input. The power of GVTNets come from the use of global voxel transformer operators (GVTOs), which are inherently different from local operators as well as the fully-connected layers in deep learning (Methods).
2.2 Label-free prediction of 3D fluorescence images from transmitted-light microscopy
We first ask whether basic GVTNets achieve improved performance over U-Net based neural networks. A basic GVTNet differs from the U-Net only at the bottom level, by using a size-preserving GVTO instead of convolutions. The replacement is crucial, giving each output unit access to information from the entire input image, regardless of the spatial size. We apply a basic GVTNet on the public dataset from C. Ounkomol et al. , where the task is label-free prediction of 3D fluorescence images from transmitted-light microscopy (Fig. 2a).
The dataset is composed of 13 datasets corresponding to 13 different subcellular structures. All the images in the datasets are spatially registered and obtained from a database of images produced by the Allen Institute for Cell Science’s microscopy pipeline . The training and testing splits are provided by C. Ounkomol et al.  and available in our published code. For each structure, the training data are 30 spatially registered pairs of 3D transmitted-light images and ground truth fluorescence images. The number of testing images is 18 for the cell membrane, 10 for the differential interference contrast (DIC) nuclear envelope, and 20 for the others.
We use the model proposed by C. Ounkomol et al.  as the baseline model, which is the current state-of-the-art model on the 13 datasets. The baseline model is a U-Net based neural network of depth 5 containing training parameters, while the basic GVTNet that we used is of depth 4 containing training parameters (Supplementary Fig. 2). As a result, the basic GVTNet has only of training parameters of the baseline model. In addition, the computation speed becomes faster; that is, the GVTNet takes s to make prediction for one 3D image while the U-Net takes s .
We quantify the model performance by computing the Pearson correlation coefficient on the testing data (Methods). On all of the 13 datasets, our basic GVTNet consistently outperforms the U-Net baseline. We perform one-tailed paired t-tests and obtain P values smaller than 0.05 for all datasets, showing the improvements are statistically significant (Fig. 2b). The visualization of predictions indicates that the GVTNet captures more details than the U-Net baseline due to the access to more information, and is able to use global information to avoid local inconsistency (Fig. 2c). The quantitative testing results in terms of Pearson correlation coefficients are provided in Supplementary Table 1. Examples of predictions on testing images for all 13 structures can be found in Supplementary Fig. 3. These experimental results indicate the effectiveness of only one size-preserving GVTO and the resulted basic GVTNets.
We note that both GVTNets and the U-Net baselines perform poorly on the datasets corresponding to Golgi apparatus and Desmosomes subcellular structures. According to C. Ounkomol et al. , a possible explanation is that the correlations between the input transmitted-light microscope images and the target fluorescence images are weak in these two datasets. As most supervised deep learning methods models try to capture the correlations between inputs and outputs during training, the inference performance could be poor if the correlations are weak.
2.3 Content-aware 3D image denoising
Next, we explore the potential of GVTNets by applying more GVTOs. Specifically, we apply GVTNets with both size-preserving and up-sampling GVTOs on two independent content-aware 3D image denoising tasks (Fig. 3a); namely, improving the signal-to-noise ratio (SNR) of live-cell imaging of Planaria S. mediterranea and developing Tribolium castaneum embryos.
The datasets were published by M. Weigert et al. , which contain pairs of 3D low-SNR images and ground truth high-SNR images for training and testing. The training data are provided in the form of 17,005 and 14,725 small cropped patches of size for Planaria and Tribolium datasets, while the testing data are 20 testing images of size and 6 testing images of average size around for the two datasets, respectively. In addition, the testing data come with three image conditions referring to three different SNR levels, leading to three degrees of denoising difficulty (Fig. 3b). Here, the image conditions refer to the laser-power and exposure-time during image collection . Generally, low laser-power and short exposure-time lead to low SNR levels. Concretely, in the Planaria dataset, four different laser-power/exposure-time conditions are used: GT (ground truth) and C1–C3, specifically 2.31 mW/30 ms (GT), 0.12 mW/20 ms (C1), 0.12 mW/10 ms (C2), and 0.05 mW/10 ms (C3). Similarly, in the Tribolium dataset, four different laser-power imaging conditions are used: GT and C1–C3, specifically 20 mW (GT), 0.5 mW (C1), 0.2 mW (C2), and 0.1 mW (C3). As a result, each ground truth high-SNR image in testing dataset has three corresponding low-SNR images.
The baseline models in these experiments are the content-aware image restoration (CARE) networks , which are based on the 3D U-Net . The U-Net based CARE networks achieve the current best performance on these two datasets, serving as a strong baseline. We build a GVTNet by replacing the bottom convolutions and up-sampling operators with corresponding size-preserving and up-sampling GVTOs (Supplementary Fig. 4).
In order to quantify the model performance, we compute two evaluation metrics,i.e., the structural similarity index (SSIM)  and normalized root-mean-square error (NRMSE) (Methods). The models are evaluated under three SNR levels individually. The visualization results demonstrate that the GVTNet can take advantage of long-range dependencies to recover more details in areas with weak signals than the U-Net (Fig. 3c). The quantitative results also indicate significant and consistent improvements of the GVTNet over the U-Net based CARE under all image conditions on both datasets, revealing the advantages of GVTNets with more GVTOs (Supplementary Fig. 5, Supplementary Table 2). More examples of predictions on testing images can be found in Supplementary Fig. 6-7.
In order to provide insights on how GVTNets improve the inference performance by utilizing global information, we conduct extra experiments by varying the spatial sizes of input images during the inference process. To be specific, as both GVTNets and the U-Net are able to handle inputs of any spatial size, we can either feed the entire image directly into the model or crop the image into small prediction patches and reconstruct the entire augmented image after prediction (Supplementary Fig. 1). Theoretically, since the size of receptive filed (RF) in the U-Net is fixed and bounded, the prediction results will be the same as long as the size of prediction patches is larger than that of RF. On the other hand, the size of RF in GVTNets always cover the entire input image, allowing the use of more knowledge for better inference performance given large prediction patches. In order to verify this insight, we train the GVTNet and CARE on the Planaria dataset and compare prediction results in terms of SSIM when using prediction patches of sizes ranging from to (entire image size). The results are summarized in Fig. 3d. The prediction results of the U-Net remain the same when increasing prediction patch sizes, forming a horizontal line (Supplementary Fig. 1). On the contrary, significant improvements can be observed for the GVTNet. These results show that GVTNets are able to take advantage of larger prediction patches, which lead to a performance boost.
2.4 Content-aware 3D to 2D image projection
While we use GVTOs to build GVTNets, GVTOs are a family of operators that support any size-preserving, down-sampling and up-sampling tensor processing and can be used outside GVTNets. Therefore, we further examine the proposed GVTOs on more complicated and composite models. In particular, we apply GVTOs and GVTNets on the 3D Drosophila melanogaster Flywing surface projection task [43, 44] (Fig. 4a).
The model for this task is supposed to take a noised 3D image as the input and projects it into a denoised 2D surface image. The typical deep learning model involves two parts; those are, a network for 3D to 2D surface projection, followed by a network for 2D image denoising. For example, the current best model, CARE 
, uses a task-specific convolutional neural network (CNN) for projection and a 2D U-Net for denoising. The task-specific CNN is also composed of convolutions, down-sampling and up-sampling operators. We design our model based on CARE by applying GVTOs in the first CNN and replace the 2D U-Net with a 2D GVTNet (Supplementary Fig. 8). The resulted composite model employs size-preserving and up-sampling GVTOs in both parts.
We compare our model with CARE on the Flywing dataset  in terms of SSIM and NRMSE. The dataset contains 16,891 pairs of small 3D noisy image patches and ground truth 2D surface image patches for training, and 26 complete images for testing.
The quantitative results indicate that the composite model augmented by GVTOs achieves significant improvements (Fig. 4c). We provide the detailed quantitative results in Supplementary Table 3. The visualization results show that the GVTOs have a stronger capability to recognize non-noisy objects at regions of lower SNR within an image, where the original model tends to fail (Fig. 4b). This is because the global information is of great importance to the projection tasks, especially along the Z-axis, where the projection happens. Specifically, for each (x, y) location in the 3D image, only one voxel along the Z-axis will be projected to the 2D surface. This restriction is only available when the model has the global information along the Z-axis. Therefore, plugging GVTOs into the projection process can effectively improve the overall performance. More examples of predictions on testing images can be found in Supplementary Fig. 9.
2.5 Transfer learning ability of GVTNets
We have shown the effectiveness of GVTNets for augmented microscopy applications under a supervised learning setting. In the following, we further investigate the generalization ability of GVTNets under a simple transfer learning setting , where we train GVTNets on one dataset and perform testing on other datasets for the same task. In this case, the inconsistencies between the training and testing data often lead to the collapse of models based on local operators, such as the U-Net. One reasonable explanation is that the weights of kernels in local operators are fixed after training and independent to the inputs . This limits the ability to deal with the different data distributions in training and inference procedures.
As GVTOs achieve input-dependent weights, we hypothesize that GVTNets are more robust to such inconsistencies and have a better generalization ability. We conduct experiments to verify the hypothesis using the three datasets from M. Weigert et al. ; namely, the Planaria, Tribolium and Flywing datasets. Note that all these datasets originally have 3D high-SNR ground truth images for the 3D denoising task. By applying PreMosa  on the 3D ground truth images, we can obtain 2D ground truth images for the 3D to 2D projection task. Therefore, these datasets can be used in either task for both training and testing. The baseline models are still the U-Net based CARE networks in these experiments, and we use the same GVTNet as introduced above for comparison (Supplementary Fig. 4, Supplementary Fig. 8). In general, we train GVTNet and CARE on one of the three datasets, and compare their testing performance on the remaining two datasets, resulting in three sets of experiments. To be concrete, the first two experiments where either the Planaria or Tribolium dataset is used for training are doing the 3D denoising tasks. The third experiment where models are trained on the Flywing dataset is performing the 3D to 2D projection task.
The comparison results in terms of SSIM and NRMSE are shown in Fig. 5. The detailed quantitative results can be found in Supplementary Table 4. GVTNet obtains a more promising transfer learning performance than CARE, indicating a better generalization ability.
We have introduced GVTNets built on GVTOs, an advanced deep learning tool for augmented microscopy. Compared to the U-Net, GVTNets are more powerful models that are capable of capturing long-range dependencies and selectively aggregating global information for inputs of any spatial size. With GVTNets, various augmented microscopy tasks can be performed with significantly improved accuracy, such as predicting the fluorescence images of subcellular structures directly from transmitted-light images without using fluorescent labels, conducting content-aware image denoising, and projecting a 3D microscope image to a 2D surface for analysis. We have demonstrated the superiority of GVTNets and GVTOs on several publicly available datasets for augmented microscopy [25, 27, 18]. In particular, we have provided examples where GVTNets achieve better inference performance with inputs of larger spatial sizes, indicating the ability of utilizing global information. In addition, besides the supervised learning setting, GVTNets outperform the U-Net under a simple transfer learning setting, showing better generalization ability due to input-dependent weights.
We anticipate that our work would exert potential impacts on biological image analysis in general and augmented microscopy specifically. Image analysis plays an indispensable role in biological research, where machine learning methods and tools have been widely used and dramatically advanced biological research and discoveries. In particular, the past decade has witnessed revolutionary changes in machine learning with the rapid developments of deep learning . To make GVTNets easy to use in various biological image transformation tasks, we publish our code as an open-source tool with detailed instructions (Supplementary Note 2). Our code may greatly benefit both biology and computer science research communities.
We anticipate that our work would exert potential impacts on biological image analysis in general and augmented microscopy specifically. Image analysis plays an indispensable role in biological research, where machine learning methods and tools have been widely used and dramatically advanced biological research and discoveries. In particular, the past decade has witnessed revolutionary changes in machine learning with the rapid developments of deep learning. Recent studies [25, 24, 22, 28, 27, 53] have shown that deep learning allows biological research to transcend the limits imposed by imaging hardware, enabling discoveries at scales and resolutions that were previously impossible. We observe that most of these biological image analysis tasks can be formulated as biological image transformation problems . In such tasks, the U-Net [31, 32] is the most popular and successful deep model, achieving the state-of-the-art performances [25, 24, 28, 27, 18]. Our proposed GVTNets can be directly used to replace the U-Net and boost the performance by addressing intrinsic limitations of the U-Net. Specifically, our experimental results have shown the superiority of GVTNets in various augmented microscopy tasks. These results are expected to have an immediate and strong impact on basic biology by enabling discoveries, observations, and measurements that were previously unobtainable. In addition, since the limitations of the U-Net are general and not task-specific, we anticipate that GVTNets will improve the U-Net in other biological image transformation tasks and potentially benefit a wider range of biological research based on image analysis. Last but not least, from the practical perspective, the deployment of solution is as important as developing new solutions 
. To make GVTNets easy to use in various biological image transformation tasks, we publish our code as an open-source tool with detailed instructions (Supplementary Note 2). Our code may greatly benefit both biology and computer science research communities.
In the literature, there exist many other studies that attempt to improve the U-Net in various aspects [47, 48, 49, 50, 51]. Among them, some studies [47, 48, 51] explore a similar direction to our work, which is to allow the U-Net to capture long-range dependencies or global context information. They can be mainly divided into two categories. One is to add modules composed of dilated convolutions, like Zhang et al.  and CE-NET . Dilated convolutions can expand the receptive field of convolutions to capture longer-range dependencies. However, they are still local operators in essence, sharing similar limitations. For example, they cannot collect global information when inputs become larger than the receptive field. The other category is to apply global pooling to extract global information and use it to facilitate local operators, such as RSGU-Net . However, important spatial information is lost during global pooling, which potentially limits the performance. Different from these two categories, we extend the attention operator to achieve the goal. To demonstrate the advantages of our method over previous methods, we compare GVTNets with representative models, i.e., RSGU-Net  and CE-NET , on content-aware 3D image denoising tasks, as reported in Supplementary Table 5. Our method outperforms both methods significantly, with similar computational cost.
Other studies [49, 50] improve the U-Net in orthogonal directions. Oktay et al.  propose to add the gate mechanism to the skip connections, filtering out irrelevant information. It is worth noting that the gate mechanism and the attention mechanism are essentially different in terms of computation, functionality, and flexibility. The gate mechanism performs spatially element-wise filtering so that there is no explicit communication between spatial locations. On the contrary, the attention mechanism aggregates information from all spatial locations (Methods). Moreover, the gate mechanism can only be used for size-preserving tensor processing, while the attention mechanism can be extended for down-sampling and up-sampling tensor processing by our GVTOs. Zhou et al.  propose a nested U-Net architecture by adding dense skip connections. The nested architecture facilitates the training and yields better inference performance.
In terms of augmenting images with deep learning methods, generative adversarial network (GAN)
In terms of augmenting images with deep learning methods, generative adversarial network (GAN) is a promising choice [53, 22, 18, 54]. We point out that GAN based methods are orthogonal to our GVTNets in the sense that they can be used together. Note that GAN is composed of a generator and a discriminator. In GAN based image augmentation models, the generator is typically a U-Net , which we can improve with our GVTNets. We conduct experiments on content-aware 3D image denoising tasks. The results can be found in Supplementary Table 6. As indicated by the results, under the GAN framework, our GVTNets can improve the U-Net as well.
The key components of GVTNets are GVTOs. One concern about GVTOs is the efficiency. Given the inputs of the same size, GVTOs usually require more time and take up more memory for computation than local operators like convolutions. This is due to the use of the self-attention operator. However, the high cost of GVTOs does not necessarily make GVTNets more expensive than the U-Net. By taking advantage of the more powerful GVTOs, the overall network architecture can be simpler, improving the efficiency. For example, in the label-free fluorescence image prediction experiments, we have shown that a GVTNet can outperform a U-Net based neural network with only of training parameters and faster computation speed.
Another limitation of GVTNets is the shared disadvantage of current deep learning models [25, 27]. Models trained on one biological image transformation dataset can hardly be used for another dataset. Therefore, high-quality training data must be collected for each task, which is expensive and time-consuming. GVTNets have shown promising improvements under the simplest transfer learning setting without fine-tuning. We anticipate that the combination of GVTNets and recent advances of transfer learning  and meta learning  can greatly alleviate this limitation.
-  Gustafsson, M. G. Surpassing the lateral resolution limit by a factor of two using structured illumination microscopy. Journal of microscopy 198, 82–87 (2000).
-  Huisken, J., Swoger, J., Del Bene, F., Wittbrodt, J. & Stelzer, E. H. Optical sectioning deep inside live embryos by selective plane illumination microscopy. Science 305, 1007–1009 (2004).
-  Betzig, E. et al. Imaging intracellular fluorescent proteins at nanometer resolution. Science 313, 1642–1645 (2006).
-  Rust, M. J., Bates, M. & Zhuang, X. Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (storm). Nature methods 3, 793 (2006).
-  Heintzmann, R. & Gustafsson, M. G. Subdiffraction resolution in continuous samples. Nature photonics 3, 362 (2009).
-  Tomer, R., Khairy, K., Amat, F. & Keller, P. J. Quantitative high-speed imaging of entire developing embryos with simultaneous multiview light-sheet microscopy. Nature methods 9, 755 (2012).
-  Chen, B.-C. et al. Lattice light-sheet microscopy: imaging molecules to embryos at high spatiotemporal resolution. Science 346, 1257998 (2014).
-  Belthangady, C. & Royer, L. A. Applications, promises, and pitfalls of deep learning for fluorescence image reconstruction. Nature methods 1–11 (2019).
-  Laissue, P. P., Alghamdi, R. A., Tomancak, P., Reynaud, E. G. & Shroff, H. Assessing phototoxicity in live fluorescence imaging. Nature methods 14, 657 (2017).
-  Icha, J., Weber, M., Waters, J. C. & Norden, C. Phototoxicity in live fluorescence microscopy, and how to avoid it. BioEssays 39, 1700003 (2017).
-  Selinummi, J. et al. Bright field microscopy as an alternative to whole cell fluorescence in automated analysis of macrophage images. PloS one 4, e7497 (2009).
-  Pawley, J. B. Fundamental limits in confocal microscopy. In Handbook of biological confocal microscopy, 20–42 (Springer, 2006).
-  Scherf, N. & Huisken, J. The smart and gentle microscope. Nature biotechnology 33, 815 (2015).
-  Skylaki, S., Hilsenbeck, O. & Schroeder, T. Challenges in long-term imaging and quantification of single-cell dynamics. Nature biotechnology 34, 1137 (2016).
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
-  Sullivan, D. P. & Lundberg, E. Seeing more: a future of augmented microscopy. Cell 173, 546–548 (2018).
Chen, P. et al.
An augmented reality microscope with real-time artificial intelligence integration for cancer diagnosis.Nature medicine 25, 1453–1457 (2019).
-  Moen, E. et al. Deep learning for cellular image analysis. Nature methods 1–14 (2019).
-  Johnson, G. R., Donovan-Maiye, R. M. & Maleckar, M. M. Building a 3D integrated cell. bioRxiv 238378 (2017).
-  Ounkomol, C. et al. Three dimensional cross-modal image inference: label-free methods for subcellular structure prediction. bioRxiv 216606 (2017).
Osokin, A., Chessel, A.,
Carazo Salas, R. E. & Vaggi, F.
GANs for biological image synthesis.
Proceedings of the IEEE International Conference on Computer Vision, 2233–2242 (2017).
-  Yuan, H. et al. Computational modeling of cellular structures using conditional deep generative networks. Bioinformatics 35, 2141–2149, DOI: 10.1093/bioinformatics/bty923 (2019).
-  Johnson, G., Donovan-Maiye, R., Ounkomol, C. & Maleckar, M. M. Studying stem cell organization using “label-free” methods and a novel generative adversarial model. Biophysical Journal 114, 43a (2018).
-  Christiansen, E. M. et al. In silico labeling: predicting fluorescent labels in unlabeled images. Cell 173, 792–803 (2018).
-  Ounkomol, C., Seshamani, S., Maleckar, M. M., Collman, F. & Johnson, G. R. Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. Nature methods 15, 917 (2018).
-  Wu, Y. et al. Three-dimensional virtual refocusing of fluorescence microscopy images using deep learning. Nature methods 1–9 (2019).
-  Weigert, M. et al. Content-aware image restoration: pushing the limits of fluorescence microscopy. Nature methods 15, 1090 (2018).
-  Wang, H. et al. Deep learning achieves super-resolution in fluorescence microscopy. Biorxiv 309641 (2018).
-  Wang, H. et al. Deep learning enables cross-modality super-resolution in fluorescence microscopy. Nature Methods 16, 103–110 (2019).
-  Rivenson, Y. et al. Deep learning microscopy. Optica 4, 1437–1443 (2017).
-  Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention, 234–241 (Springer, 2015).
-  Falk, T. et al. U-net: deep learning for cell counting, detection, and morphometry. Nature methods 16, 67 (2019).
He, K., Zhang, X., Ren,
S. & Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
-  He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, 630–645 (Springer, 2016).
-  Fakhry, A., Zeng, T. & Ji, S. Residual deconvolutional networks for brain electron microscopy image segmentation. IEEE transactions on medical imaging 36, 447–456 (2017).
-  Lee, K., Zung, J., Li, P., Jain, V. & Seung, H. S. Superhuman accuracy on the snemi3d connectomics challenge. arXiv preprint arXiv:1706.00120 (2017).
-  Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, 424–432 (Springer, 2016).
-  Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
-  Vaswani, A. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).
-  Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7794–7803 (2018).
-  Wilson, D. R. & Martinez, T. R. The general inefficiency of batch training for gradient descent learning. Neural networks 16, 1429–1451 (2003).
-  Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P. et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 600–612 (2004).
-  Aigouy, B. et al. Cell flow reorients the axis of planar polarity in the wing epithelium of drosophila. Cell 142, 773–786 (2010).
-  Etournay, R. et al. Interplay of cell dynamics and epithelial tension during morphogenesis of the drosophila pupal wing. Elife 4, e07090 (2015).
-  Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 1345–1359 (2009).
-  Blasse, C. et al. Premosa: extracting 2d surfaces from 3d microscopy mosaics. Bioinformatics 33, 2563–2569 (2017).
-  Zhang, Q., Cui, Z., Niu, X., Geng, S. & Qiao, Y. Image segmentation with pyramid dilated convolution based on resnet and u-net. In International Conference on Neural Information Processing, 364–372 (Springer, 2017).
-  Huang, J. et al. Range scaling global u-net for perceptual image enhancement on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), 0–0 (2018).
-  Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
-  Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 3–11 (Springer, 2018).
-  Gu, Z. et al. Ce-net: context encoder network for 2d medical image segmentation. IEEE transactions on medical imaging 38, 2281–2292 (2019).
-  Goodfellow, I. et al. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680 (2014).
-  Cai, L., Wang, Z., Gao, H., Shen, D. & Ji, S. Deep adversarial learning for multi-modality missing data completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1158–1166 (2018).
-  Rivenson, Y. et al. Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning. Nature biomedical engineering 3, 466 (2019).
-  Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135 (JMLR. org, 2017).
We thank the CARE and the Allen Institute for Cell Science teams for making their data and tools publicly available. This work was supported in part by National Science Foundation grants DBI-1922969, IIS-1908166, and IIS-1908220, National Institutes of Health grant 1R21NS102828, and Defense Advanced Research Projects Agency grant N66001-17-2-4031.
Z.W. and Y.X. shared first-authorship. S.J. conceived and initiated the research. Z.W. and S.J. designed the methods. Z.W. and Y.X. implemented the training and validation methods. Z.W. and Y.X. designed and developed the software package. S.J. supervised the project. Z.W., Y.X., and S.J wrote the manuscript.
The authors declare no competing interests.
4.1 Network architecture
4.1.1 General framework
Global voxel transformer networks (GVTNets) follow the same encoder-decoder framework as the U-Net [31, 37, 32], which represents a family of deep neural networks for biological image transformations. An encoder takes the image to be transformed as the input and computes feature maps of gradually reduced spatial sizes, which encode multi-scale and multi-resolution information from the input image. Then a corresponding decoder uses these feature maps to produce the transformed image, during which feature maps of gradually increased spatial sizes are computed. GVTNets support both 2D and 3D biological image transformations. We use the 3D case to describe the architecture in detail (Fig. 1c).
In our GVTNets, the encoder starts with an initial convolution that transforms the input image into a chosen number of feature maps of the same spatial size, initializing the encoding. The encoding process is achieved by down-sampling operators interleaved with optional size-preserving operators. Each down-sampling operator halves the size along each spatial dimension of feature maps but doubles the channel dimension, i.e., the number of feature maps. To be specific, given an tensor representing feature maps of the spatial size as inputs, a down-sampling operator will output an tensor. Feature maps of the same spatial size are considered at the same level. As a result, the number of levels, also known as the depth of the network, is determined by the number of down-sampling operators in the encoder.
Correspondingly, the decoder is composed of the same number of up-sampling operators interleaved with optional size-preserving operators. The decoding process computes feature maps of increased spatial sizes in a level-by-level fashion, where each up-sampling operator doubles the size along each spatial dimension of feature maps but halves the channel dimension, as opposed to down-sampling operators. Therefore, there is a one-to-one correspondence between down-sampling and up-sampling operators. The decoder ends with an output convolution that outputs transformed image of the same spatial size as the input image.
The encoder and decoder are connected at each level. The bottom level contains the outputs of the encoder, which are feature maps of the smallest size in the U-Net framework. These feature maps, after optional size-preserving operators, serve as inputs to the decoder. In upper levels, there exist skip connections between the encoder and decoder. Concretely, the input feature maps to each down-sampling operator are concatenated or added to the output feature maps of the corresponding up-sampling operator. The skip connections allow the decoder to take advantage of encoded multi-scale and multi-resolution information, which increases the capability of the framework and facilitates the training process [31, 36].
4.1.2 Global voxel transformer networks
The major difference between our GVTNets and the original U-Net lies in the choices of the size-preserving, down-sampling, and up-sampling operators. GVTNets are equipped with global voxel transformer operators (GVTOs), which can be flexibly used for size-preserving, down-sampling, or up-sampling tensor processing. In particular, GVTNets fix the size-preserving operator at the bottom level to be the size-preserving GVTO, ensuring that global information is encoded and aggregated before going through the decoder. The other size-preserving operators are set to pre-activation residual blocks , consisting of two3] (Supplementary Fig. 10a). Down-sampling and up-sampling GVTOs can be used as corresponding operators based on the datasets and tasks.
4.2 Global voxel transformer operators
As described above, the key components of our GVTNets are global voxel transformer operators (GVTOs), which are able to selectively use long-range information among input units. We take the 3D case to illustrate the size-preserving GVTO first, followed by the down-sampling and up-sampling GVTOs.
4.2.1 Size-preserving GVTO
Given the input third-order tensor representing feature maps of the spatial size , the size-preserving GVTO performs three independent convolutions on and obtains three tensors, namely the query (), key (), and value () tensor, where , , . Afterwards, , , are unfolded along the channel dimension  into matrices , , . These matrices go through the attention operator defined as
where is a normalization function that normalizes each column of . Specifically, the size-preserving GVTO simply uses as the normalization function:
where is the second dimension of and subjected to corresponding changes in the down-sampling and up-sampling GVTOs. After the attention operator, the matrix is then folded back to a tensor . The final outputs of the size-preserving GVTO is the summation of and
, which means a residual connection from the inputs to the outputs. In particular, we use the pre-activation technique as well . As a result, the size-preserving GVTO preserves the dimension of the inputs (Supplementary Fig. 11e).
4.2.2 Down-sampling and up-sampling GVTOs
The extension from the size-preserving GVTO to the down-sampling and up-sampling GVTOs is achieved by changing the convolutions that compute , , . We take the down-sampling GVTO as an example for illustration. Given the same input tensor , we use a
convolution with strideto obtain and two independent convolutions to generate and . The following computation is the same; that is, , , are unfolded along the channel dimension into matrices and , , which are fed into the same attention operator and output the matrix . Folding it back results in a tensor . Comparing the dimensions of and , we achieve a down-sampling process that halves the size along each spatial dimension of feature maps but doubles the channel dimension. We complete the down-sampling GVTO by adding the residual connection in two ways, corresponding to two versions of the down-sampling GVTO (Supplementary Fig. 11a-b). One is to perform an extra convolution with stride through the residual connection from to , in order to transform to have the same dimension as ; the other is to directly add to , based on the fact that is obtained from .
The up-sampling GVTO is dual to the down-sampling GVTO. Instead of using a convolution with stride , it uses a transposed convolution with stride to obtain . In addition, the other two convolutions generate and . The up-sampling GVTO doubles the size along each spatial dimension of feature maps but halves the channel dimension and also has two versions corresponding to different residual connections (Supplementary Fig. 11c-d).
4.2.3 Advantages of GVTOs
It is noteworthy that, each spatial location in the output tensor of GVTOs has access to all the information in the input tensor, and is able to selectively use or ignore information. We illustrate this point by regarding as
-dimensional vectors, where each vector represents the information in a spatial location. In this view, each vector has a one-to-one correspondence to each column inand in GVTOs, respectively. Revisiting the attention operator, each column in is a vector representation of each spatial location in the output tensor, and has a one-to-one correspondence to each column in . Moreover, each column in is computed as the weighted sum of columns in , whose weights are determined by the interaction between the corresponding column in and all columns in . The weights can be viewed as filters of the amount of information from each spatial location in the inputs to the outputs. In addition, as both and are computed from the input tensor, the weights are input-dependent. Therefore, GVTOs achieve the dynamic non-local information aggregation.
4.2.4 Comparisons with Fully-Connected Layers
It is important to note that the proposed GVTOs are different from fully-connected (FC) layers in fundamental ways, although they both allow each output unit to use information from the entire input. Compared to FC layers, outputs in GVTOs are computed based on relations among inputs. Thus the weights are input-dependent, rather than learned and fixed during prediction as in FC layers. The only trainable parameters in GVTOs are the convolutions to compute , , , whose sizes are independent of input and output sizes. As a consequence, GVTOs allow variable-size inputs, and the positional correspondence between inputs and outputs is preserved in GVTOs. In contrast, FC layers require fixed-size inputs and positional correspondence is lost.
4.3 Training loss
GVTNets are trained in an end-to-end fashion with two options of the loss functions. One is the mean squared error (MSE):
where represents the ground truth image, represents the model’s predicted image, and represents the total number of voxels in the image. The other is the mean absolute error (MAE):
Both MSE and MAE measure the differences between the predicted image and the ground truth image. The training process applies the Adam optimizer  with a user-chosen learning rate to minimize the loss.
4.4 Evaluation metrics
4.4.1 Pearson correlation coefficient
Pearson correlation coefficient () is computed as
where and are the mean of voxel intensities in and , respectively.
4.4.2 Normalized root-mean-square error
The root-mean-square error (RMSE) is computed as
The normalized root-mean-square error (NRMSE) simply adds a normalization function on and , respectively. In our tools and experiments, we apply the same percentile-based normalization and transformation as in M. Weigert et al. . Concretely, the normalized root mean square error is defined by
is the percentile-based normalization, and denotes a transformation that scales and shifts . During the implementation, we let and to obtain so that the MSE is minimized.
4.4.3 Structural similarity index
4.5 Task-specific configurations
The settings of our device are - GPU: Nvidia GeForce RTX 2080 Ti 11GB; CPU: Intel Xeon Silver 4116 2.10GHz; OS: Ubuntu 16.04.3 LTS.
4.5.1 Label-free prediction of 3D fluorescence images from transmitted-light microscopy
The basic GVTNet used in the experiments of label-free prediction of 3D fluorescence images is illustrated in Supplementary Fig. 2. The network has depth 4, where the skip-connections add feature maps from the encoder to the decoder. In particular, the bottom block of the basic GVTNet is the size-preserving GVTO (Supplementary Fig. 11e). The number of feature maps after the initial convolution is set to 32. Batch normalization with the momentum of 0.997 and epsilon of 0.00001 is applied before each ReLU activation function.
The 13 subtasks corresponding 13 different subcellular structures are performed separately and independently. To train the GVTNet, the 30 pairs of training images are randomly cropped into patches of size and each training batch contains 16 pairs of patches. We minimize the MSE loss using the Adam optimizer with a learning rate of 0.001 for 70,000 to 100,000 minibatch iterations, depending on different subtasks. The training procedure lasts approximately 11h15m to 15h45m for each of the 13 datasets .
4.5.2 Context-aware 3D image denoising
The GVTNet used in the image denoising tasks is illustrated in Supplementary Fig. 4. It follows a 3D U-Net framework of depth 3, i.e., including 2 down-sampling and up-sampling operators, respectively. The skip-connections merge feature maps from the encoder to the decoder by concatenation instead of addition. The bottom block is the size-preserving GVTO and two up-sampling operators are the up-sampling GVTOs v2 (Supplementary Fig. 11d). The number of feature maps after the initial convolution is set to 32. No batch normalization is applied.
We use the MAE loss with the Bayesian deep learning technique  (Supplementary Note 1) to train the GVTNet. The training patch size is
. We train the model with a batch size of 16 and a base learning rate of 0.0004 with a decay rate 0.7 for every 10,000 minibatch iterations. The training procedure takes 50 epochs and lasts about 5h45m and 4h50m for the Planaria and Tribolium datasets, respectively.
4.5.3 Content-aware 3D to 2D image projection
The model for surface projection is composed of a 3D to 2D projection network and a 2D denoising network, as illustrated in Supplementary Fig. 8. The projection network predicts the probability of each voxel in the 3D input image belonging to the 2D surface, and uses summation weighted by the predicted probabilities along the Z-axis to finish the projection. The probabilities are estimated by a GVTO-augmented CNN. The following 2D denoising network is simply a 2D version of the GVTNet used in the image denoising tasks.
During training, the 3D input patch size is and the 2D ground truth patch size is . The other training settings are the same as those in image denoising experiments, except that we do not use the Bayesian deep learning technique. The training procedure lasts 4h55m for the Flywing dataset .
Further information on research design can be found in the Nature Research Reporting Summary linked to this article.
Datasets for label-free prediction of 3D fluorescence images from transmitted-light microscopy  can be downloaded from https://downloads.allencell.org/publication-data/label-free-prediction/index.html. Datasets for context-aware 3D image denoising and 3D to 2D image projection  can be downloaded from https://publications.mpi-cbg.de/publications-sites/7207.
-  Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105 (2012).
-  Kolda, T. G. & Bader, B. W. Tensor decompositions and applications. SIAM review 51, 455–500 (2009).
-  Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd international conference on learning representations (2015).
-  Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456 (2015).
-  Kendall, A. & Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, 5574–5584 (2017).