## 1 Introduction

Ever since Wilhelm Röntgen shocked the world with a ghostly photograph of his wife’s hand in 1896, the imaging power of X-rays has been exploited to help see the unseen. Their penetrating power allows us to view the internal structure of many objects. Because of this, X-ray sources are widely used in multiple imaging and microscopy experiments, e.g. in Computed Tomography (CT), or simply tomography. A tomography experiment measures a transmission absorption image (called radiograph) of a sample at multiple rotation angles. From 2D absorption images we can reconstruct a stack of slices (tomos in Greek) perpendicular to the radiographs measured, containing the 3D volumetric structure of the sample. Tomography is used in a variety of fields such as medical imaging, semiconductor technology, biology and materials science. Modern tomography instruments using synchrotron-based light sources can achieve measurement speeds of over 200 volumes per second using 40 kHz frame rate detectors [garcia2019]. Tomography can also be combined with microscopy techniques to achieve resolutions down to a single atom using electrons [scott2012electron]. Its experimental versatility has also been exploited by combining it with spectroscopic techniques, to provide chemical, magnetic or even atomic orbital information about the sample.

Nowadays, tomographic analysis software faces three main challenges. 1) The volume of the data is constantly increasing as X-ray sources become brighter and newer generation detectors increase their resolution and acquisition frame rate. 2) Instruments from different facilities (or even from the same one) present a variety of experimental settings that can be exclusive to said instrument, such as the geometry of the measurements, the data layout and format, noise levels, etc. 3) New experimental use cases and algorithms are frequently explored and tested to accommodate new science requisites. These three requirements strongly force tomography analysis software to be HPC and flexible, both in terms of modularity and interfaces, as well as in hardware portability. Currently, TomoPy [Tomopy_Gursoy] and ASTRA [ASTRA_Aarle]

are the most popular solutions for tomographic reconstruction at multiple synchrotron and tabletop instruments. TomoPy is a Python-based open source framework optimized for performance using a C backend that can process a variety of data formats and algorithms. ASTRA is a tomography toolbox accelerated using both GPU and CPU computing and it is also available through TomoPy

[pelt2016integration]. Although both solutions are highly optimized at different levels, they do not provide the level of flexibility required to be easily extendable by third parties regarding solver modifications or accessing specific operators.In this work we present a novel framework that focuses on providing multi-CPU and GPU acceleration with flexible operators and interfaces for both 1-step and iterative tomography reconstruction. The solution is based on Python 3 and relies on CuPy, mpi4py, SciPy and NumPy to provide transparent CPU/GPU computing and innocuous multiprocessing through MPI. The idea is to provide easy HPC support without compromising the solution lightweight so that development, integration and deployment is streamlined. The current operators are based on sparse matrix-vector multiplication (SpMV) computation which benefit from preexisting fast implementations on both CuPy and SciPy and provide faster reconstruction time than direct dense computation [jackson1991selection]. By minimizing code complexity, we can efficiently implement advanced iterative techniques [venkatakrishnan2013plug, mohan2014model] that are not normally implemented for production, also due to their computational complexity; prior implementations could take up to a full day of a supercomputer to reconstruct a single tomogram [venkatakrishnan2016making]. The high level technologies and modular design employed in this project permits the proposed solution to be remarkably flexible, both for exploratory uses (algorithm development or new experimental settings), and also in terms of hardware: we can scale the reconstruction from a single CPU, to a workstation using multiple CPU and GPU processors, to large distributed memory systems. The experimental results demonstrate how the proposed solution can reconstruct datasets of 68 GB in less than seconds, even surpassing the performance of TomoPy’s fastest reconstruction engine by X. All the code of this project is also open source and available at [xpack].

The paper is structured as follows: Section 2 overviews the main concepts regarding tomography reconstruction. Section 3 presents the proposed implementation with a detailed description of the challenges behind its design and the techniques employed, and Section 4 assesses its performance through experimental results. The last section summarizes this work.

## 2 Tomography

Tomography is an imaging technique based on measuring a series of 2D radiographs of an object rotated at different angles relative to the direction of an X-ray beam (Fig. 1). A radiograph of an object at a given angle is made up of line integrals (or projections). The collection of projections from different angles at the same slice of the object is called sinogram (2D); and the final reconstructed volume is called tomogram (3D), which is generally assembled from the independent reconstruction of each measured sinogram.

Physically, the collected data measures attenuation, which is the loss of flux through a medium. When the X-ray beams are parallel along the optical axis, the beam intensity impinging on the detector is given by:

where is the attenuation coefficient as a function of position in the sample, is the input intensity collected without a sample, is the projection after rotating around the axis, and are the coordinates on the detector that sample the data onto detector pixels. The negative log of the normalized data provides the projection, also known as X-ray transform:

with element-wise log and division. The Radon transform (by H.A. Lorentz [bockwinkel1906propagation]) at a fixed is then given by the set of projections for a series of angles :

### 2.1 Iterative Reconstruction Techniques

The tomography inverse problem can be expressed as follows:

The pseudo-inverse described below provides the fastest solution to this problem, and it is typically known as *Filtered Back Projection* in the literature. When implemented in Fourier space, the algorithm is referred to as *non-uniform inverse FFT* or *gridrec*.

The inverse problem can be under-determined and ill-conditioned when the number of angles is small. The equivalent least squares problem is:

where is a preconditioning matrix, with a diagonal matrix, and

denotes a 1D Fourier transform. Note that

does not need to be computed when using the Fubini-Radon operator (see below). The model-based problem is:where is a weighted norm to account for the noise model, may incorporate streak noise removal [maia2010compressive] as well as preconditioning, is a regularization term such as the Total Variation norm to account for prior knowledge about the sample, and is a scalar parameter to balance the noise and prior models.

Many algorithms have been proposed over the years including Filtered Back Projection (FBP), Simultaneous Iterative Reconstruction Technique (SIRT), Conjugate Gradient Least Squares (CGLS), and Total Variation (TV) [bouman1993generalized]. FBP can be viewed as the first step of the preconditioned steepest descent when starting from 0. To solve any of these problems, one needs to compute a Radon transform and its (preconditioned) adjoint operation multiple times.

### 2.2 The Fubini-Radon transform

One of the most efficient ways to perform Radon() is to use the Fourier central slice theorem by Fubini [grunbaum]. It consists on performing first a 2D Fourier transform, denoted by of , and then interpolating the transform onto a polar grid, to finally 1D inverse Fourier transforming the points along the radial lines:

(1) |

We will refer to this approach as the *Fubini-Radon* transform (Fig. 2). Numerically, the slicing interpolation between the Cartesian and polar grid in the Fubini-Radon transform is the key step in the procedure. It can be carried out using a *gridding* algorithm that maintains the desired accuracy with low computational complexity. The *gridding* algorithm essentially allows us to perform a *non-uniform FFT*. The projection operations require arithmetic operations when computed directly using discretization of the line integrals, while the Fubini-Radon version requires + operations, where the first term is due to the slicing operation and the second term is due to the two dimensional FFT. For sufficiently large

, the Fubini-Radon transform requires fewer arithmetic operations than the standard Radon transform using projections. Early implementations on GPUs used ad hoc kernels to deal with atomic operations and load-balancing of the highly non-uniform distribution of the polar sampling points

[maia2010compressive], but became obsolete with new compute architectures. In this work we implement the slicing interpolation using a sparse matrix-vector multiplication. The SpMV and SpMM operations are level 2 and level 3 BLAS functions which have been heavily optimized (see e.g. [yelik]) on numerous architectures for both CPUs and GPUs.## 3 Radon Transform by Sparse Matrix Multiplication

The gridding operation requires the convolution between regular samples and a kernel to be calculated at irregular sample positions, and vice versa for the inverse gridding operation. In order to maintain high numerical accuracy and minimize the number of arithmetic operations, we want to limit the width of the convolution kernel. Small kernel width can be achieved by exploiting the finite sample dimensions ( in the field of view) using a pair of functions so that if , 0 otherwise. By the convolution theorem:

where is the convolution operator, denotes the Hadamard or elementwise product, is called the convolution *kernel*
and the *deapodization* factor.
We choose with finite width , and the *deapodization* factor can be pre-computed as if , 0 otherwise}. Several kernel functions have been proposed and employed in the literature, including truncated Gaussian, Kaiser-Bessel, or an interpolation kernel to minimize the worst-case
approximation error over all signals of unit norm [fessler2003nonuniform].

The Fubini-Radon transform operator and its pseudo-inverse iRadon can be expressed using a sparse matrix (Fig. 3) to perform the interpolation (see also [Ou:EECS-2017-90]):

(2) |

where bold indicates 2D vectors such as , , and is a diagonal matrix to account for the density of sampling points in the polar grid.

Fourier transforms and multiplication of with a sinogram in vector form , , (sparse matrix-vector multiplication or SpMV) produces a tomogram of dimension ; multiplication with a stack of sinograms (sparse matrix-matrix multiplication or SpMM) produces the 3D . The diagonal matrix can incorporate standard filters such as the Ram-Lak ramp, Shepp-Logan, Hamming, or a minimum residual filter based on the data itself [mrfbp]. We can also employ the density filter solution that minimizes the difference with the impulse response (a constant in Fourier space) as , with as the vector of the diagonal elements of . Note that in this case, the matrix can be incorporated directly into for better performance.

The row indices and values of the sparse matrix are related to the coordinates where the kernel windows are added up on the output 2D image as , with kernel window stencil , and , (for ) . The column index is simply given by the consecutive sequence of natural numbers , repeated times, and the row index and value are given by:

where is the broadcasting sum with the window stencil reshaped to dimensions , is the lexicographical mapping from 2D to 1D index, is the kernel function and is the decimal part, and represents the Kronecker product with the unit vector , for a window of width (see Fig. 3). has at most non-zero elements, and the sparsity ratio is at most , or when the kernel is set to 0 at the borders. We account for a possible shift of the rotation axis and avoid FFTshifts in the tomogram and radon spaces by applying a phase ramp as , with when the projected rotation axis matches the central column of the detector. For better performance, FFTshifts in Fourier space are incorporated in the sparse matrix by applying an FFTshift of the coordinate, and by using a checkerboard pattern in the deapodization factor .

### 3.1 Parallel workflow

The Fubini-Radon transform operates independently on each tomo and sinogram, so we can aggregate sinograms into chunks and distribute them over multiple processes operating in parallel. Denoising methods that operate across multiple slices can be handled using halos with negligible reduction in the final Signal-to-Noise-Ratio (SNR), while reducing or avoiding MPI neighborhood communication.

Pairs of sinograms are combined into a complex sinogram which is processed simultaneously, by means of complex arithmetic operations, and is split back at the end of the reconstruction. We can limit the amount of chunks assigned to each process in order to avoid memory constraints. Then, when the data has more slices than what can be handled by all processes, it is divided up ensuring that each process operates on similar size chunks of data and all processes loop through the data. When the number of slices cannot be distributed equally to all processes, only the last loop chunk is split unequally with the last MPI ranks receiving one less slice than the first ones.

The setup stage uses the experimental parameters of the data (number of pixels, slices and angles) and the choice of filters and kernels to compute the sparse matrix entries, deapodization factors and slice distribution across MPI ranks. During the setup stage, the output tomogram is initialized as either a memory mapped file, a shared memory window (whenever possible) or an array in rank-0 to gather all the results.

Several matrix formats and conversion routines exist to perform the SpMV operation efficiently. In our implementation, the sparse matrix entries are first computed in Coordinate list (COO) format which contains a list of (row, column, value) tuples. Zero-valued and out-of-bound entries are removed, and then the sparse matrix is converted to compressed sparse row (CSR) format, where the entries are sorted by column and row, and the row index is replaced by a compressed pointer. The sparse matrix and its transpose are stored separately and incorporate preconditioning filters and phase ramps to avoid all FFTshifts. The CSR matrix entries are saved in a cache file for reuse, with a hash function derived from the experimental parameters and filters to identify the corresponding sparse matrix from file. The FFT plans are computed at the first application and stored in memory until the reconstruction is restarted. When the data is loaded from file and/or the results are saved to disk, parallel processes pre-load the input to memory or flush the output from a double buffer as the next section of the data is processed.

Our implementation uses cuSPARSE and MKL libraries for the SpMV and FFT operations, MPI for distributed parallelism through shared memory when available, or scatterv/gatherv and non-blocking double buffers for I/O and MPI operations. All these libraries are accessed through Python, NumPy, CuPy, SciPy and mpi4py; we also rely on h5py or tifffile modules to interface with data files. This framework also provides the capability to call TomoPy and ASTRA solvers on distributed architectures using MPI.

We used this framework to implement the most popular algorithms described in Sec. 2.1, namely FBP, SIRT, CGLS, and TV. To achieve high throughput, our implementation of SIRT uses the Hamming preconditioning and the BB-step acceleration [bb-step], which provides 10-fold convergence rate speedup and makes it comparable to the conjugate gradient method but with fewer reductions and lower memory footprint. The CGLS implementation is based on the conjugate gradient squared method [cgs], and the TV denoising employs the split-Bregman [goldstein2009split] technique.

## 4 Experiments and Results

The experimental evaluation presented herein is two-fold. We assess the performance of our implementation on both shared and distributed memory systems and on CPU and GPU architectures, and we also study how it compares to TomoPy, the state-of-the-art solution on X-ray sources, in terms of running time and quality of reconstruction.

We employ two different datasets for this analysis. The first one is a simulated Shepp-Logan phantom generated using TomoPy, with varying sizes to analyze the performance and scalability of the solution. The second one is an experimental dataset generated at Lawrence Berkeley National Laboratory’s Advanced Light Source during an outreach program with local schools out of a bread-crumb inserted at the micro-tomography beamline 8.3.2. The specifics of the experiments were: 25 keV X-rays, pixel size 0.65 microns, 200ms per image and 1313 angles over 180 degrees. The detector consisted of 20 micron LuAG:Ce scintillator and Optique Peter lens system with Olympus 10x lens, and PCO.edge sCMOS detector. The total experiment time, including camera readout/overhead, was around 6 minutes, generating a sinogram stack of dimension = .

We use two different systems for this evaluation. The first is the Cori supercomputer (Cori.nersc.gov), a Cray XC40 system comprised of 2,388 nodes containing two 2.3 GHz 16-core Intel Haswell processors and 128 GB DDR4 2133 MHz memory, and 9,688 nodes containing a single 68-core 1.4 GHz Intel Xeon Phi 7250 (Knights Landing) processor and 96 GB DDR4 2400 GHz memory. Cori also provides 18 GPU nodes, where each node contains two sockets of 20-core Intel Xeon Gold 6148 2.40 GHz, 384 GB DDR4 memory, 8 NVIDIA V100 GPUs (each with 16GB HBM2 memory). For our experiments, we use the Haswell processor and the GPU nodes.^{1}^{1}1Cori configuration page: https://docs.nersc.gov/systems/cori/
The second system employed is CAM, a single node dual socket Intel Xeon CPU E5-2683 v4 @ 2.10GHz with 16 cores 32 threads each, 128 GB DDR4 and 4 NVIDIA K80 (dual GPU with 12 GB of GDDR5 memory each).

The first experiment reports the performance results and scaling studies of our iRadon implementation and of TomoPy-Gridrec, when executed on both Cori and CAM, over the simulated dataset. The primary objective is to compare their scalability using both CPUs and GPUs. We executed both algorithms at varying levels of concurrency using a simulation size of . On Cori, we used up to 8 Haswell nodes in a distributed fashion, only using physical cores in each node. On CAM, we run all the experiments on a single node, dual socket. The speedup plots are shown in Fig. 4. The reported speedup is defined as where is the time it takes to run the parallel algorithm on processes with an input size of , and is the time for the best serial algorithm on the same input.

First, we notice that the iRadon algorithm running on GPU has a super-linear speedup on both platforms. This is unusual in general, however possible in some cases. One known reason is the cache effect, i.e. the number of GPU changes, and so does the size of accumulated caches from different GPUs. Specifically, in a multi-GPU implementation, super-linear speedup can happen due to configurable cache memory. In the CPU case, we see a close to linear speedup. On CAM, the performance decreases by the use of hyperthreads when running on 64 cores. Finally, there is a clear difference in speedup results compared to TomoPy-Gridrec implementation. We believe that the main difference here is due to the fact that TomoPy only uses a multithreaded implementation with OpenMP, while our implementation relies on MPI.

We also evaluate our implementation with different simulation sizes maintaining the number of angles and rays () and varying the number of slices (), running using 64 CPUs and 8 GPUs. Performance results in slices per second are shown in Fig. 5. One can notice that the GPU implementation of iRadon presents an increase in performance when the number of GPU increases. This is a known behaviour for GPU performance when the problem is too small compared to the capabilities of the GPU, and the device is not completely saturated with data, not taking full advantage of the parallelized computations. For both platforms, our CPU implementation of iRadon performs significantly better than TomoPy.

In terms of raw execution time, TomoPy-Gridrec outperforms our iRadon implementation by a factor of X when running on a single CPU on Cori. On the other hand, the iRadon execution time using a 256 CPU cores on Cori is seconds, outperforming TomoPy by a factor of X. Our GPU implementation of iRadon runs in seconds using 16 V100 GPUs, which improves the CPU implementation (1 core) by a factor of X, and runs X faster compared with 256 CPU cores.

The last experiment focuses on the analysis of the different algorithms implemented in this work, in terms of execution time and reconstructions quality. Fig. 7 shows the reconstruction of slices of the bread-crumb experimental dataset on CAM (32 CPUs and 8 GPUs), for 3 different implemented algorithms: iRadon, SIRT, and TV, and also for TomoPy-Gridrec and TomoPy-SIRT. All iterative implementations (SIRT and TV) run for 10 iterations. Our iRadon implementation presents the best execution time for CPU ( sec), while on GPU, it runs in sec. Our SIRT implementation outperforms TomoPy’s by a factor of X. We report the SNR values (and corresponding execution times) of our implemented algorithms in Table 7, using a simulation dataset of size (256, 1024, 1024). We can observe how both SIRT and TV present the best results in terms of reconstruction quality.

Fig. 8 shows a reconstructed slice of the bread-crumb data using the iRadon and the TV algorithms, along with a zoomed-in region of the same slice. The difference in reconstruction quality is minor in this case due to the dataset presenting high contrast and a large number of angles. Still, in the zoomed-out image we can appreciate higher contrast fine features on the TV reconstruction. Sparser datasets would be analyzed in the future to assess the performance of TV and iterative solutions on more challenging scenarios.

It is important to remark that all the execution times presented in this section refer to the solver portion of the calculations. When running the TV algorithm on the complete bread-crumb data using 8 GPUs on CAM, for example, the solver time takes approximately of the total execution time ( minutes). Most of the remaining time is taken by I/O () and gather ().

## 5 Conclusions

This paper presents a novel solution for tomography analysis based on fast SpMV operators. The proposed software is implemented in Python relying on CuPy, SciPy and MPI for high performance and flexible CPU and GPU reconstruction. As opposed to existing solutions, the software presented tackles the main requirements existing in tomography analysis: it can run over most hardware setups and can be easily adapted and extended into new solvers and techniques, while greatly simplifying deployment at new beamlines. The experimental results of this work demonstrate the remarkable performance of the solution, being able to iteratively reconstruct datasets of 68 GB in less than seconds using 256 cores and in less than seconds using 16 GPUs. For the simulated datasets analyzed, the proposed software outperforms the reference tomography solution by a factor of up to X, while running on CPU. When reconstructing the experimental data, our implementation of the SIRT algorithm outperforms TomoPy by a factor of X running on CPU. The code of this project is also open source and available at [xpack].

As future work, we will employ CPU and GPU co-processing, Block Compressed Row (BSR) format and sparse matrix-dense matrix multiplication (SpMM) to enhance the throughout of the solution. We will also explore the Toeplitz approach [Ou:EECS-2017-90]

, which permits combining the Radon transform with its adjoint into a single operation, while also avoiding the forward and backward 1D FFTs. Half-precision arithmetic is also probably sufficient to deal with experimental data with photon counting noise obtained with 16 bits detectors and can further improve performance by up to an order of magnitude using tensor cores. Generalization to cone-beam, fan beam or helical scan geometries using generalized Fourier slice methods

[cone-beam] will also be subject of future work. We will also explore the implementation of advanced denoising schemes based on wavelets or BM3D [bm3d], combining the operators presented in this work.## Acknowledgments

Work by S. M. was supported by Sigray, Inc. A. T. work was in part sponsored by Sustainable Research Pathways of the Sustainable Horizons Institute. P. E. was funded through the Center for Applied Mathematics for Energy Research Applications. T. P. is supported by the grant “Scalable Data-Computing Convergence and Scientific Knowledge Discovery”, program manager Dr. Laura Biven. D. P. is supported by the Advanced Light Source. This research used resources of the National Energy Research Scientific Computing Center (NERSC) and the Advanced Light Source, U.S. DOE Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

Comments

There are no comments yet.