Deep neural networks show considerable promise for the rapid analysis of data collected at scientific experiments, enabling tasks such as anomaly detection, image enhancement , and image reconstruction  to be performed in a fraction of the time required by conventional methods. However, the substantial computational costs of deep models are an obstacle to their widespread adoption. The graphical processing unit (GPU) devices that are typically used to run these models are too expensive to be dedicated to individual instruments, while dispatching analysis tasks to shared data centers can require substantial data movement and incur large round trip latencies.
Specialized “edge” inference devices  such as the NVIDIA Jetson Tx2 GPU (henceforth TX2)  and Google Edge TPU  are potential solutions to this problem. (The Edge TPU is distributed by Coral Inc. in two configurations: Accelerator, which relies on a host machine, such as a PC or single-board Raspberry Pi, and Dev Board, which comes with a 64-bit ARM system as host.) These edge devices use techniques such as reduced precision arithmetic to enable rapid execution of deep models with a low price point (and low power consumption and compact form factor) that makes it feasible to embed them within scientific instruments.
The question remains, however, as to whether these edge inference devices can execute the deep models used in science with sufficient speed and accuracy. Models originally developed to run on GPUs that support 32-bit floating point arithmetic must be adapted to run on edge devices that may support only lower-precision integer arithmetic, a process that typically employs a technique called model quantization[30, 14, 22]. Google and NVIDIA have developed implementations of such schemes, allowing inference with integer-only arithmetic on integer-only hardware [18, 27, 35]. They report benchmark results showing that edge devices can perform inference as rapidly as a powerful PC, at much lower cost [8, 28]. However, questions remain when it comes to using such devices in scientific settings. The resulting models will be more compact, but will they be sufficiently accurate? And will the edge device run the models rapidly enough to meet scientific goals?
We explore these questions here by studying how a specific scientific deep learning model, TomoGAN111Code available at email@example.com:ramsesproject/TomoGAN.git [26, 25], can be adapted for edge deployment. TomoGAN uses generative adversarial network (GAN) methods  to enhance the quality of low-dose X-ray images via a denoising process. Although diverse object detection and classification applications have been implemented on edge devices, image restoration with a complex image generative model has not previously been attempted on them. We adapt TomoGAN to run on the Google Edge TPU (both Accelerator and Dev Board) and TX2
, and compare the accuracy and computational performance of the resulting models with those of other implementations. We also describe how to mitigate accuracy loss in quantized models by applying a lightweight “fine-tuning” convolutional neural network to the results of the quantized TomoGAN.
The rest of this paper is as follows. In §II, we describe how we adapt a pre-trained deep learning model, TomoGAN, for the Edge TPU. Next in §III, we present experiments used to evaluate edge computing performance and model accuracy, both with and without the fine-tuning component. In §IV we review related work, and in §V we summarize our results and outline directions for future work.
We next describe how we adapt the TomoGAN model for the Edge-TPU (i.e., Accelerator and Dev Board). Specifically, we describe the steps taken to improve the accuracy of the enhanced images, the datasets used, and our performance and accuracy evaluations.
We consider two approaches to quantizing the TomoGAN model for the Edge TPU: Post-Quantization and Quantization-Aware. In both methods, the first step is to design a non-quantized model with the expected features unique to both Post-Quantization and Quantization-Aware models.
Ii-A1 Post-Quantization-Based Inference Model
The steps followed to generate the Post-Quantization-based inference model are shown in Figure 1.
We first train a Post-Quantization
-based model, which differs from the standard TomoGAN model only in the input tensor shape, which is 64643 rather than 102410243. The partitioning of each 10241024 input image into multiple 6464 subimages is needed because of limitations on the output size permitted by the Edge-TPU-Compatible. See §III-A for details on training data. The average training time for 40,000 iterations was around 24 hours on a single NVIDIA V100 GPU.
Ii-A2 Quantization-Aware Based Inference Model
This second approach (§2) differs from the first only in the method used to generate the trained model. In order to attenuate the accuracy loss that may result from the quantization of trained weights in the inference stage, a more complex model is trained that induces fake quantization layers to simulate the effect of quantization.
The major drawback of this methodology is that the introduction of fake quantization layers leads to a much longer training time, extending to dozens of days. We thus conclude that this method is not feasible for larger models like TomoGAN, and adopt the Post-Quantization approach for our TomoGAN-based image restoration system.
Ii-B Model Generation
There are specific model generation schemes to provide mobile-computing and edge-computing friendly models.
Ii-B1 Mobile-Compatible Model Generation
A mobile-compatible model accepts quantized unsigned int8 inputs and generates quantized unsigned int8 outputs. A quantized int8_value representation is related to the corresponding real_value as follows
where zero-point and scale
are parameters. Prior to mobile-compatible model generation, we process a representative dataset to estimate the value range of the data that are to be quantized, and choose appropriate values for these parameters.
Ii-B2 Edge-TPU Compatible Model Generation
In order to exploit the Accelerator and Dev Board, the quantized model must be converted into an Edge-TPU-Compatible model by compiling it with the Edge TPU runtime. Edge-TPU-Compatible model generation is done by using a compiler deployed with Edge-TPU firmware libraries. This compiler enables a conversion of a Mobile-Compatible model into an Edge-TPU-Compatible model.
Ii-C Inference Workflow
We describe in turn the inference workflows used when running on a CPU, Edge-TPU, and Edge GPU. The CPU related experiments are carried out with the trained model with no quantization and the Edge-TPU and Edge-GPU experiments are carried out with the trained model with quantization.
Ii-C1 CPU Inference
The CPU inference workflow, shown in Figure 3, uses the non-quantized model. We feed the required inputs, non-quantized model, and noisy image of size 1102410243 to the CPU-based inference API, which returns a de-noised image with dimension 10241024.
Ii-C2 Edge TPU Inference
The Edge TPU inference workflow, shown in Figure 4, applies the Post-Quantization (II-A1) or Quantization-Aware (II-A2) models to preprocessed images. We use customized versions of the Accelerator and Accelerator BasicEngine inference API and TensorflowLite API  for this purpose. Each input image has shape 1102410243, where the dimension 3 results from grouping with each image two adjacent images, as used in TomoGAN to improve output quality. Each image is partitioned into 256 subimages of shape 164643, due to Edge-TPU-Compatible restrictions; after inference, processed subimages are buffered in memory, and once all have been processed, are stitched back together to form the de-noised output image.
Ii-C3 Edge-GPU Based Inference Workflow
The Edge GPU inference workflow, shown in Figure 5, is similar to the CPU inference workflow, except for the part of using a GPU specific quantized model. Each input image, with shape 1102410243, is passed to the Edge GPU-specific quantized model (produced with TensorRT  from the non-quantized model), which produces a de-noised image with shape 10241024 as output.
Ii-D Fine-Tuning Workflow
With Post-Quantization-enabled inference, some accuracy may be lost due to model quantization. We observed this effect in our preliminary results: the non-quantized model produced better output than the Post-Quantization Edge-TPU quantized model. To improve image quality in this case, we designed a shallow convolutional neural network (referred to as the Fine-Tune network in the rest of the paper) to be applied to the output of the quantized TomoGAN: see Figure 6. We use output from the Edge-TPU-Compatible model (see §II-C3 and §III-A) to train this network. The target labels are the corresponding target images for each inferred image from the mentioned portion of the training dataset. At the inference stage, we applied the Edge-TPU inference workflow (see §II-C3) and used its output as input to the Fine-Tune model. We shall see in §III-C that this Fine-Tune network improves image quality to match that of the images generated from the CPU inference workflow
In order to evaluate both computing throughput and model inference performance, we conducted a set of experiments on quantized inference and enhanced the output from direct inference with a shallow Fine-Tune network. We compared the performance and evaluated the image quality on CPU, GPU, and TPU devices individually.
We used two datasets for our experiments. Each dataset comprises 1024 pairs of 10241024 images, each pair being a noisy image and a corresponding ground-truth image, as described in Liu et al. . Ground truth images are obtained from normal-dose X-ray imaging and noisy images from low-dose X-ray imaging of the same sample. We used one dataset for training and the other for testing.
Iii-B Performance Evaluation
We evaluated both inference performance (i.e., throughput) on different hardware platforms and the quality of the resulting images. For inference performance, we studied a laptop CPU (§III-B1), the Accelerator and Dev Board Edge TPU (§III-B2), and the TX2 Edge GPU (§III-B3), applying for each the workflow of §II-C to a series of images and calculating the average inference latency.
Iii-B1 CPU Inference Performance Evaluation
Standard CPU-based experiments were conducted by using the non-quantized model with a personal computer comprising an Intel Core i7-6700HQ CPU@2.60GHz with 32GB RAM. The supported operating system was Ubuntu 16.04 LTS distribution. The non-quantized model takes an average inference time of 1.537 seconds per image: see firstname.lastname@example.orgGHz in Figure 7.
Iii-B2 TPU Inference Performance Evaluation
We evaluated Edge TPU performance on two platforms with different configurations: the Accelerator with an Edge TPU coprocessor connected to the host machine (a laptop with Intel i7 CPU) via a USB 3.0 Type-C (data and power) interface, and the Dev Board with Edge TPU coprocessor and a 64-bit ARM CPU as host. Columns Accelerator and Dev Board of Table I provide timing breakdowns for these two devices. The first component is the time to run the quantized TomoGAN model: 0.435 and 0.512 seconds per image for Accelerator and Dev Board, respectively. The second component, “Stitching,” is due to an input image size limit of 6464 imposed by the Edge TPU hardware and compiler that we used in this work. Processing a single 10241024 image thus requires processing 256 individual 6464 images, which must then be stitched together to form the complete output image. This stitching operation takes an average of 0.12 and 0.049 seconds per image on the Dev Board and Accelerator, respectively.
The third component, “Fine-Tune,” is the quantized fine-tune network used to improve image quality to match that of the non-quantized model, as discussed in §III-C; this takes an average of 0.070 and 0.166 seconds per image on Accelerator and Dev Board, respectively. We note that model compilation limitations associated with the current Edge TPU hardware and software require us to run the quantized TomoGAN and Fine-Tune networks separately, which adds extra latency for data movement between host memory and Edge TPU. We expect to avoid this extra cost in the future by chaining TomoGAN and Fine-Tune to execute as one model. (While the quantized TomoGAN requires 301 billion operations to process a 10241024 image, Fine-Tune takes only 621 million: a negligible 0.2% of TomoGAN.)
|Accelerator||Dev Board||Jetson Tx2|
|Quantized TomoGAN (s)||0.435||0.512||0.880|
|Power Consumption (w)||2||2||7.5|
|Peak Performance||4 TOPS||4 TOPS||1.3 TFLOPS|
Iii-B3 Edge GPU Inference Performance Evaluation
The original TomoGAN can process a 10241024 pixel image in just 44ms on a NVIDIA V100 GPU card. As our focus here is on edge devices, we evaluated TomoGAN performance on the TX2, which has a GPU and is designed for edge computing. Column Jetson Tx2 in Table I shows results. We see an average inference time per image of 0.88 seconds for the TX2. We compare in Figure 7 this time with the quantized TomoGAN Edge TPU times (not including stitching and fine tuning).
We note that in constructing the model for the TX2, we used NVIDIA’s TensorRT toolkit [27, 36] to optimize the operations of TomoGAN. We also experimented with 16-bit floating point, 32-bit floating-point, and unsigned int8, and observed similar performance for each, which we attribute to the lack of Tensor cores in the TX2’s NVIDIA Pascal architecture for accelerating multi-precision operations.
Iii-B4 Performance Discussion
We find that inference is significantly faster on the Edge-TPU than on the CPU or TX2, and faster on TX2 than on the CPU. Accelerator is faster than Dev Board, because the former has a more powerful host (the laptop with i7) with better memory throughput than the latter (the 64-bit SoC ARM platform). These performance differences may appear small, but we should remember that a single light source experiment can generate thousands of images, each larger than 10241024, e.g., 25602560, and thus any acceleration is valuable.
Iii-C Image Quality Evaluation
We used structural similarity index (SSIM)  to evaluate image quality. We calculated this metric for images enhanced from the original TomoGAN, the quantized TomoGAN, and the quantized TomoGAN plus Fine-Tune network, with results shown in Figure 8. We observe that SSIM for the quantized TomoGAN+Fine-Tune is comparable to that of the original (non-quantized) TomoGAN.
Iv Related Work
The opportunities and challenges of edge computing have received much attention [9, 31]. Methods have been proposed for both co-locating computing resources with sensors, and for offloading computing from mobile devices to nearby edge computers [7, 5, 6].
Increasingly, researchers want to run deep neural networks on edge devices [4, 40, 24], leading to the need to adapt computationally expensive deep networks for resource-constrained environments. Quantization, as discussed above, is one such approach [18, 17, 20, 16]. Others include the use of neuromorphic hardware , specialized software , the distribution of deep networks over cloud, local computers, and edge computers , and mixed precision computations .
Various deep networks have been developed or adapted for edge devices, including Mobilenet , VGG , and Resnet . However, that work focuses on image classification and object detection. In contrast, we are concerned with image translation and image-to-image mapping to provide an enhanced image. Also, we are applying our image restoration model on edge devices, an approach that has not been discussed in the literature.
Our use of a fine-tuning network to improve image quality is an important part of our solution, allowing us to avoid the excessive training time required for the quantization-aware model. We are not aware of prior work that has used such a fine-tuning network, although it is conceptually similar to the use of gradient boosting in ensemble learning.
We have reported on the adaption for edge execution of TomoGAN, an image-denoising model based on generative adversarial networks developed for low-dose x-ray imaging. We ported TomoGAN to the Google Coral Edge TPU devices (Dev Board and Accelerator) and NVIDIA Jetson TX2 Edge GPU. Adapting TomoGAN for the Edge TPU requires quantization. We mitigate the resulting loss in image quality, as measured via the SSIM image quality metric, by applying a fine-tune step after inference, with negligible computing overhead. We find that Dev Board and Accelerator provide 3 faster inference than a CPU, and that Accelerator is 1.5 faster than TX2. We conclude that edge devices can provide fast response at low cost, enabling scientific image restoration anywhere.
The work reported here focused on image restoration. However, before images can be enhanced with TomoGAN, they must be reconstructed from the x-ray images, for example by using filtered back projection (FBP) . FBP is not computationally intensive: processing the images considered here using the TomoPy implementation  takes about 400ms per image on a laptop with an Intel i7 CPU. Nevertheless, for a complete edge solution, we should also run FBP on the edge device. We will tackle that task in future work.
This work was supported in part by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research was accomplished when V. Abeykoon is an intern at Argonne National Laboratory under the supervision of Z. Liu. We thank the JLSE management team at the Argonne Leadership Computing Facility and the Google Coral team for their assistance.
-  (1997) Arcing the edge. Technical report Technical Report 486, Statistics Department, University of California Berkeley. Cited by: §IV.
-  (2019) Efficient hybrid network architectures for extremely quantized neural networks enabling intelligence at the edge. CoRR abs/1902.00460. External Links: Cited by: §IV.
-  (2009) Anomaly detection: a survey. ACM Computing Surveys 41 (3), pp. 15. Cited by: §I.
-  (2019) Deep learning with edge computing: a review. Proceedings of the IEEE 107 (8), pp. 1655–1674. Cited by: §IV.
-  (2018) Task offloading for mobile edge computing in software defined ultra-dense network. Journal on Selected Areas in Communications 36 (3), pp. 587–597. Cited by: §IV.
-  (2015) Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Transactions on Networking 24 (5), pp. 2795–2808. Cited by: §IV.
-  (2018) ThriftyEdge: resource-efficient edge computing for intelligent IoT applications. IEEE Network 32 (1), pp. 61–65. Cited by: §IV.
-  (2019-Sept.) Edge TPU benchmark. External Links: Cited by: §I.
Edge-centric computing: vision and challenges. ACM SIGCOMM Computer Communication Review 45 (5), pp. 37–42. Cited by: §IV.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing systems, pp. 2672–2680. Cited by: §I.
-  (2019-07) Google Coral. External Links: Cited by: §I.
-  (2014) TomoPy: a framework for the analysis of synchrotron tomographic data. Journal of synchrotron radiation 21 (5), pp. 1188–1193. Cited by: §V.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV.
-  (2018) Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635. Cited by: §I.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §IV.
-  (2016) Binarized neural networks. In Advances in Neural Information Processing Systems, pp. 4107–4115. Cited by: §IV.
Quantized neural networks: training neural networks with low precision weights and activations.
Journal of Machine Learning Research18 (1), pp. 6869–6898. Cited by: §IV.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §I, §IV.
-  (2002) Principles of computerized tomographic imaging. Medical Physics 29 (1), pp. 107–107. Cited by: §V.
-  (2016) Bitwise neural networks. arXiv preprint arXiv:1601.06071. Cited by: §IV.
-  (2019) Neuromemristive circuits for edge computing: a review. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §IV.
-  (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §I.
-  (2016) DeepX: a software accelerator for low-power deep learning inference on mobile devices. In 15th International Conference on Information Processing in Sensor Networks, pp. 23. Cited by: §IV.
-  (2015) Can deep learning revolutionize mobile sensing?. In 16th International Workshop on Mobile Computing Systems and Applications, pp. 117–122. Cited by: §IV.
-  (2019) Deep learning accelerated light source experiments. In Proceedings of the IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), Cited by: §I.
-  (2019) TomoGAN: low-dose x-ray tomography with generative adversarial networks. arXiv preprint arXiv:1902.07582. Cited by: §I, §III-A.
-  (2017) Nvidia 8-bit inference width TensorRT. In GPU Technology Conference, Cited by: §I, §III-B3.
-  (2019-Sept.) NVIDIA benchmarks on deep learning. External Links: Cited by: §I.
-  (2019-Sept.)(Website) External Links: Cited by: §I.
-  (2018) Value-aware quantization for training and inference of neural networks. In European Conference on Computer Vision, pp. 580–595. Cited by: §I.
-  (2017) The emergence of edge computing. Computer 50 (1), pp. 30–39. Cited by: §IV.
-  (2017) A deep cascade of convolutional neural networks for dynamic MR image reconstruction. IEEE Transactions on Medical Imaging 37 (2), pp. 491–503. Cited by: §I.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IV.
-  (2017) Distributed deep neural networks over the cloud, the edge and end devices. In 37th International Conference on Distributed Computing Systems, pp. 328–339. Cited by: §IV.
-  (2019-Sept.) TensorFlowLite for mobile based deep learning. External Links: Cited by: §I, §II-C2.
-  (2019-07) TensorRT. External Links: Cited by: §II-C3, §III-B3.
-  (2004-04) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Cited by: §III-C.
-  (2018) Scaling for edge inference of deep neural networks. Nature Electronics 1 (4), pp. 216. Cited by: §I.
-  (2018) Low-dose CT image denoising using a generative adversarial network with Wasserstein distance and perceptual loss. IEEE Transactions on Medical Imaging 37 (6), pp. 1348–1357. Cited by: §I.
Edge intelligence: paving the last mile of artificial intelligence with edge computing. arXiv preprint arXiv:1905.10083. Cited by: §IV.
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan.