Magnetic resonance imaging (MRI) is a powerful medical imaging modality for visualizing tissue contrast, and it is especially attractive for pediatric imaging as it does not involve ionizing radiation. Nonetheless, imaging speed is a major limitation of MRI. For example, MRI scans are orders of magnitude longer compared to computed tomography. A typical MRI exam consists of several sequences, each providing different diagnostic image contrast. As each sequence takes several minutes, the entire MRI exam can take 30 minutes to one hour. This drives the cost up as only a small number of patients can be accommodated into the daily schedule. In pediatrics, patients can become uncooperative over the course of the exam, necessitating the use of anesthesia .
Several advances to MRI acquisition have led to markedly faster scanning. Parallel Imaging (PI) uses multiple spatially sensitive receiver coils placed around the anatomy and exploits the redundancy in the receive channels to reduce the number of measurements required for reconstruction [2, 3]. As the scan time is directly proportional to the number of acquired samples, the speedup is proportional to the number of channels. An additional avenue for faster scanning is through Compressed Sensing (CS) [4, 5]. In CS, signal sparsity is used to reduce the number of samples required to perform reconstruction. Sparsity typically increases with the number of dimensions, making CS well-suited to MRI experiments involving image or signal dynamics. Since PI and CS exploit fundamentally different properties of the image and imaging system to reduce sampling requirements, they can be combined to realize order-of-magnitude reductions in scan time [6, 7]. We refer to their synergistic combination as PICS.
T2 signal relaxation is an intrinsic MRI property of tissue, providing invaluable diagnostic image contrast. Volumetric T2-weighted imaging is typically obtained using a 3D fast spin-echo (3D-FSE) MRI acquisition [8, 9], but the acquisition is lengthy and often results in image blurring due to the signal relaxation during the data collection . T2 Shuffling  is a particular PICS-based technique that mitigates these effects by modifying the 3D-FSE acquisition. From the data, a time-series of 3D images are reconstructed, representing the temporal signal relaxation during the FSE acquisition. As a result, the contrast in the time series of images transitions from proton-density weighting to heavy T2 weighting while not suffering from 3D-FSE blurring. In particular, this technique is very promising for reducing the exam time in routine imaging applications, since the images can be reformatted and sliced into arbitrary orientations as well as displayed at different levels of T2 contrast. This is shown in Figure 1. For example, T2 Shuffling has been applied to pediatric knee MRI , where small structures in the knee require oblique reformats at a high resolution. Other clinical applications are also leveraging 4D PICS-based approaches with promising initial results [13, 14, 15].
T2 Shuffling and other PICS approaches require iterative solvers for reconstruction . Therefore, the reduction in scan time is offset by the increased computational complexity. As the acquisition and reconstruction methods grow in complexity, the computational requirements continue to grow, leading to reconstructions that can take tens of minutes to several hours [16, 17, 14]. In order for these approaches to gain clinical traction, the image reconstruction pipeline should maintain a low latency, so that the images are available at the scanner before the patient leaves the table. In our experience, conservative latency requirements for clinical integration are about two minutes.
Motivated by the need for fast PICS reconstructions, a number of general-purpose software packages have been built as a mechanism to translate the work to the clinic [18, 19, 20, 21, 22, 23, 24]. As a result, several applications of PICS have seen success in a clinical research setting [25, 26]. Initial efforts focused on accelerating 3D reconstructions using CPUs and GPUs, achieving reconstruction times on the order of one minute [27, 28, 29, 30]. Extensions to cloud-based processing  have enabled reconstructions that take advantage of scalable compute. More recently, the focus has shifted to 2D and 3D dynamic imaging [32, 15, 33, 13]. The latter often involves significantly higher memory/computation requirements.
In this work, we build a distributed PICS reconstruction using multiple high-performance computers and deploy it in a clinical setting. We specialize the reconstruction to T2 Shuffling, which despite its demonstrated benefits, has seen slow adoption in practice due to its prohibitively long reconstruction time [12, 34]. We holistically analyze the entire reconstruction pipeline and the system requirements for successful deployment at the Lucile Packard Children’s Hospital. We describe the approaches we took to optimize the computational components, as well as analyze the scalability of the distributed computing. Consequently, we reduce the end-to-end reconstruction time for T2 Shuffling from eight minutes on a single shared-memory system to under 90 seconds on a four-node distributed system. As a result of the fast processing, we are able to improve image quality robustness by incorporating additional signal processing steps into the reconstruction pipeline [35, 36, 37] while maintaining a suitable latency. We present results for pediatric knee MRI, where the T2 Shuffling acquisition and distributed reconstruction is used to replace multiple conventional 2D FSE scans, enabling short exams that can be accommodated within the existing schedule . Furthermore, we analyze the multiple-scanner setup in the hospital by simulating reconstruction events using real scanner schedule data. We present different configurations and sizes needed to support a target processing latency for our institution.
Ii MRI Reconstruction
The MRI reconstruction problem is often formulated as an unconstrained inverse problem using a prescribed forward model that describes the imaging experiment.
Ii-a Parallel Imaging and Compressed Sensing
A forward model for the MR sensing operator is that an image is multiplied by the local spatial sensitivity profile of each coil, Fourier transformed, and discretely sampled[2, 38]. Let
represent a vectorized 3D image withvoxels. The data formation process can be described for each coil as
where the diagonal matrix is the local spatial sensitivity profile of the coil, is the number of coils, is the 2D discrete Fourier transform, and the truncated diagonal matrix is a binary mask that is active at the locations of acquired k-space samples. The measurements from the coil are given by . Complex white Gaussian noise is added to each measurement. For convenience, we represent , , and as -dimensional, with zeros inserted in the locations that were not sampled. The full forward model is thus described by , where , , and . It is possible to incorporate into the forward model prior knowledge about the system operating conditions. For example, a “soft-SENSE” formulation  can be used to improve robustness to motion and to an under-prescribed field of view – two realities that are seen in clinical practice . In the soft-SENSE formulation, sets of sensitivity maps are represented and image components are reconstructed. Typically, the first image set captures the relevant image information, while the second (and more) image set captures data inconsistencies. In this case, , , is the image component, and is the set of sensitivity maps.
where is a regularization functional designed to promote sparsity. The optimization problem (2) can be iteratively solved with the Fast Iterative Soft-Thresholding Algorithm (FISTA) . Two main operations are sequentially applied in each iteration. In the first, the gradient of the data consistency term is computed and applied, where represents the complex Hermitian transpose (adjoint) of the matrix . In the second, the proximal operator of is applied. When is an -like functional, this corresponds to soft thresholding. The iteration complexity is dominated by these two components.
For 3D Cartesian imaging, it is possible to decouple the acquired data into independent 2D slices by performing an inverse Fourier transform in the fully sampled readout direction . Then, each slice is independently reconstructed and later rejoined into a 3D volume. Although this approach reduces the effective SNR of the reconstruction, it enables parallel processing of each slice, say through multi-CPU or multi-GPU architectures.
Ii-B T2 Shuffling
In T2 Shuffling, the temporal signal dynamics due to T2 relaxation during a 3D-FSE acquisition are incorporated into the imaging formulation . Specifically, it aims to recover the T2 relaxation curve at each voxel, resulting in a set of 3D images. The time series of images are represented by , where is the timepoint (called an echo time, TE) and is the number of TEs (called the echo train length). Each TE image is independently passed through the forward model and independently sampled, producing the measurement vector , where represents the measurements from the TE and coil.
As the relaxation curves are smooth and correlated, the temporal signal dynamics are well approximated by a low-dimensional subspace . Thus, the temporal signal evolution of the voxel, , is approximated as
where, is a matrix consisting of the basis vectors of the subspace of size and are the basis coefficients of . The vector represents the basis images consisting of voxels each. Thus, images are jointly reconstructed. The forward model, accounting for the subspace representation, is shown in Figure 2.
T2 Shuffling further exploits spatio-temporal correlations between neighboring voxels as locally lower rank (LLR) [42, 43]. This is accomplished by imposing a nuclear norm constraint  on spatio-temporal blocks extracted from the basis coefficient images . More precisely, let
where extracts a block of voxels centered around the voxel from each coefficient image and reshapes it into a matrix, and
is the regularization parameter. The nuclear norm proximal map for each patch is solved via singular value soft thresholding, and can be computed independently for each matrix. For practical reasons, the coefficient images are randomly shifted each iteration and non-overlapping blocks are extracted; this stochastic approach reduces blocking artifacts due to boundary conditions without computing fully overlapping blocks.
where the operators are defined in a manner consistent with (1). After solving (5), a spatial wavelet soft-threshold is applied to each basis coefficient image for additional spatial denoising. The time-series of images are then recovered through (3). The T2 Shuffling reconstruction has also been extended to include partial Fourier sampling [45, 46], in which approximate conjugate symmetry is used to reduce the number of acquired measurements. Following the reconstruction, a Homodyne filter  that accounts for low-frequency phase is applied to .
Ii-C Clinical Integration
To run T2 Shuffling in a clinical workflow, it is necessary to integrate (5
) into a full reconstruction pipeline. This includes processing the vendor-formatted raw data, computing the basis based on the scan parameters, estimating coil sensitivity maps, applying gradient non-linearity corrections, converting the result to DICOM images, and transferring the images to the Picture Archiving and Communication System (PACS). The full pipeline must maintain a low latency so that the images are available at the scanner console before the patient leaves the table. The pre-processing and post-processing overhead is a non-negligible portion of the overall latency; these aspects are often not accounted for when comparing algorithm runtimes.
Iii Baseline Implementation
In what follows, we describe the main computational steps of the baseline implementation of T2 Shuffling, with which we later compare our optimized solution in Section VI. The baseline implementation used the Berkeley Advanced Reconstruction Toolbox (BART) [22, 48], a software programming library intended for high-dimensional PICS applications. We describe the pipeline in three sections: (A) a pre-processing stage that computes all the information needed for the iterative slice-by-slice reconstruction; (B) the iterative reconstruction stage that computes the basis coefficient images; and (C) post-processing stage that processes the reconstruction and feeds the resulting DICOM images back to the scanner console and to PACS.
Conversion to BART Format. As a first step the acquired k-space data, consisting of an array of interleaved single precision complex floats, are converted from the vendor-proprietary format to the BART format. The measurements are stored as a sparse array consisting of 3D spatial frequency index, TE index, and coil channel index.
Re-sort and Average. From the sparse array, the measurements corresponding to the central 24-by-24-by-24 window are extracted from each channel, averaged across the TEs, and reformatted as a 4D array. We refer to this low-resolution k-space as the calibration data.
Coil Compression. From the calibration data, a coil compression matrix is computed and applied to the raw data . The purpose of coil compression is to reduce the computational cost of large receive arrays in the reconstruction by linearly combining the data from multiple coils into a smaller number of virtual coils. This process is done with little to no loss of information.
Scale Factor. As the receive gain changes from scan to scan, a scaling normalization factor is computed and applied. The calibration data are inverse Fourier transformed and root-sum-of-square coil-combined to form a low-resolution image. The scaling is computed as the percentile of the coil-combined image.
ESPIRiT Maps. From the low-res data, soft-SENSE coil sensitivity profiles are estimated using ESPIRiT , an auto-calibrating PI method.
Create Basis. The RF refocusing flip angle schedule is used to compute a temporal basis that matches the particular acquisition . The basis is computed by simulating an ensemble of signal evolution curves with the extended phase graph algorithm 
, performing a singular value decomposition (SVD) on the matrix of simulated signals, and retaining thelargest singular vectors.
Re-Sort and Project Data. Using the basis, the sparse-array data are re-sorted and projected onto the basis, yielding a 5D data set of size , where is the 3D image matrix size, is the number of coil-compressed channels, and is the subspace size.
Compute Mask. As a last step, the spatial sampling patterns are computed from the sparse array and stored as a zero-filled, one-zero mask for each time point.
Iii-B PICS Reconstruction
Prior to the iterative reconstruction, a 1D unitary inverse Fourier transform is applied in the readout () direction of the 5D projected k-space data to decouple the readout slices. The reconstruction is then performed independently on each slice, and multiple slices are reconstructed in parallel based on the number of available CPU cores.
Linear Recon. As the baseline implementation resulted in latency times exceeding 5 minutes, a fast and low-resolution linear reconstruction (with Tikhonov regularization in place of LLR regularization) is first solved to provide image quality feedback to the technologist before releasing the patient.
Detailed Recon. The full iterative reconstruction is performed to solve the optimization problem in (5) using FISTA. Following the iterative reconstruction, the slices are re-joined into a 5D volume for post-processing.
Process TE. In the post-processing step, the 5D reconstructed volume is denoised with a single iteration of 3D wavelet soft-thresholding to reduce noise amplification. After, the reconstructed images are multiplied by the sensitivity maps and Fourier transformed to synthesize fully reconstructed k-space data. The data are back-projected with the basis at each desired echo time point, and Homodyne filtering is applied. The result is corrected for gradient non-linearities, converted into the DICOM format, and transferred back to the scanner and to the hospital PACS. In the baseline implementation, three time points are sequentially processed, corresponding to a proton-density weighted image (TE = 21 ms), intermediate weighted image (TE = 50 ms), and T2 weighted image (TE = 90 ms).
In this section, we describe strategies for speeding up the baseline implementation of T2 shuffling, as well as improvements to image quality which are enabled by the speedups. We identify and parallelize several serial bottlenecks in the BART toolbox (e.g. FFT phase modulation, normalization, and basis generation). We also exploit inter-operator fusion to improve cache behavior and eliminate costly transpose operations. Some computations are also completely removed from the pipeline using pre-computation and/or code refactoring. We also distribute the computation across multiple machines by exploiting parallelism across image slices in the PICS portion, and exploiting parallelism across time points during post-processing. This improved performance allows us to add additional signal processing steps to the pipeline, which improves image quality.
Iv-a Improving Multi-core Parallelism
FFTmod. Before running FFT, the image data must be un-centered, either through a circular shift or through a phase modulation, multiplying each data element to add a shifting phase that is relative to its position in the multi-dimensional array. While this operation is embarrassingly parallel, its parallelization requires several modulo divisions for the computation of each element’s index. Fortunately, when the dimensions of the array are even, for consecutive elements the phase alternates between and , or and . As an optimization, we pre-select even-sized dimensions for images and optimize the phase modulation by parallelizing over blocks of consecutive elements. For each block, we compute the position and phase for the initial element, and alternate the phase for the remaining elements.
Sort. The baseline code in BART used Quicksort in the process of estimating scaling normalization factors. We replace the call to Quicksort with GNU parallel sort.
Basis Generation. As the basis is computed by simulating an ensemble of signal time courses, it is embarrassingly parallel across the different signal simulations. We take advantage of this fact during pre-processing to simulate each signal evolution on a single CPU core, and parallelize the simulations across all the cores of the machine. In each simulation, the signal value is propagated through the time points by repeatedly applying matrix operations that represent the acquisition dynamics . We further reduce the simulation time by pre-computing common matrix terms that do not change during the simulation.
Multi-dimensional Operations. We also looked at BART itself to find opportunities for performance improvement through more efficient uses of parallelism. From the baseline BART implementation, we modified the parallel md_nary operation, which applies an arbitrary function to many elements in a multi-dimensional array, to be a flat parallel loop instead of a nested for loop. We also added a configuration capability for the loop chunk size and tuned it for coil compression and data re-sorting.
Process TE. We reduce the post-processing time by operating directly on the coil-compressed data, reducing coil-by-coil processing (e.g. Homodyne filter, FFT). Additionally, we process the three time points in parallel, allocating one third of the machine’s cores to each TE.
Iv-B Fusing Forward and Adjoint Linear Operators
Due to the nature of FISTA, the forward and adjoint operators are often applied in sequence (i.e. ). We can identify several optimizations that are not otherwise possible when applying these operators naively.
The first optimization which takes advantage of this is already included in the baseline . Noticing that operates across time and that and operate across space, the operators commute, i.e. Thus, the forward and adjoint operation is equivalent to Defining as the space-time kernel matrix associated with the spatial frequency, we pre-compute and store as a dense array. In this work, we further exploit the property that each is real-valued and symmetric. We modify the storage to include only the real part of the matrix which reduces memory loads of this matrix by a factor of two. We also take advantage of the symmetry by loading only half of the matrix, cutting down on memory accesses by an additional factor of two.
We also take advantage of the structure of the joint application of the forward 2D Fourier transform followed by inverse 2D Fourier transform (i.e. ). This operation is shown visually in Figure 3. The 2D Fourier transforms are applied typically by doing a set of 1D FFTs on the rows, followed by a matrix transposition, then a set of 1D FFTs again on the rows, followed by a matrix transposition to return the data to its original format. The lines in the figure correspond to the matrix ordering after each operation (ie row or column). Since the application of is an independent computation across voxels, it is agnostic to the ordering. Therefore, we can leave the argument in transposed form in between the forward and inverse Fourier transform computations, thus avoiding two of four transpose operations. This optimization is shown as the shortcut path in the figure.
We also fuse and reorder loops in the forward and adjoint operators in order to increase reuse in cache and reduce traffic to and from memory. This is shown in Algorithm 1. We improve reuse in cache by fusing the sensitivity map application of each image row to the FFT computations in both the forward (lines ) and adjoint (lines ) phases. We also move the loop over sensitivity maps to the outermost position (line ). This means that instead of computing intermediate vectors, we take one sensitivity map and compute only intermediate vectors at a time, reducing this portion of the working set by a factor of which allows for better use of cache.
Finally, we parallelize the new optimized forward and adjoint operators across image panels (lines , , and ), which are groups of roughly consecutive rows assigned to a single thread. This creates two levels of coarse-grained parallelism: (1) the volume split across slices, and (2) across panels within a slice. The second level of parallelism wasn’t initially necessary due to the amount of parallelism available across slices, however it became useful for two reasons. First, by increasing the number of cores applied to each problem, we increase the likelihood that intermediate results will fit in on-chip memory. The benefit of this effect is shown in more detail in Section VI. Secondly, this allows us to scale to more cores than the number of slices, which in our experiments is 288.
Iv-C Removing Unnecessary and/or Redundant Computations
Fast Linear Reconstruction. The original pipeline included a fast linear reconstruction to provide quick feedback to the technologist. As the overall latency of the detailed reconstruction became nearer to that of the linear reconstruction due to our optimizations, we removed the linear reconstruction altogether.
Basis and Space-time Kernel Caching. The basis operator is determined by the target scan anatomy (e.g. knee), the refocusing RF pulse flip angle schedule, and the associated echo spacing between the pulses during the data acquisition. As it does not depend on the particular patient scan, it can be stored and re-used for future scans that have the same acquisition parameters. To accomplish this, we calculate a SHA-1 hash of the list of flip angles, echo spacing, and target anatomy. Before computing the basis, we search for the hash in a look-up table to determine if the basis has been previously computed. If it is not found, we compute and store the matrix in the look-up table. In a similar fashion, we pre-compute and store the space-time kernel in a separate look-up table. The SHA-1 hash of consists of the SHA-1 of the basis and of the sampling pattern. Since the basis and the space-time kernel are 7.1 KB and 7.7 MB, respectively, we found the load times to be negligible.
Single-pass Locally Lower Rank Regularizer (LLR). Recall from Section II that T2 Shuffling uses LLR regularization to exploit spatio-temporal correlation within the basis coefficient images. The LLR operates on patches of the coefficient images, which are size and non-overlapping in our implementation. In each solver iteration, the coefficient images are first circular-shifted randomly in both dimensions by a small amount less than the patch size. Then, the patches are extracted from the coefficient images and rearranged into a 2D matrix per patch with one column per coefficient image and rows. Common factors are extracted across coefficient images using the SVD, and any singular values below a threshold are set to zero before regenerating the coefficient images back from the factors.
We modified this procedure work in a single pass through the coefficient images in order to reduce memory traffic and unnecessary copies. Each thread allocates a temporary buffer of size ahead of time. Then we pass over the image, extracting data according to the parameters of the circular shift, without explicitly performing the shift. We perform the SVD threshold operation in the thread-local buffer, and scatter the data back to the original coefficient image when all the work for that patch is complete.
Iv-D Image Quality Improvements
The reduction in processing time enabled additional signal processing steps to be added with the aim of improving robustness across a diverse set of pediatric patients. Specifically, we used Auto-ESPIRiT , a parameter-free ESPIRiT implementation in which the internal parameters are chosen based on Stein’s Unbiased Risk Estimate (SURE). Auto-ESPIRiT produces more consistent sensitivity maps in cases with a high degree of data inconsistency, e.g. due to motion or noise. To use Auto-ESPIRiT, an accurate noise estimate is needed. To this end, we incorporated channel noise pre-whitening [35, 36] as a step in pre-processing. As the SURE-based parameter estimate can be computed for each slice independently, it is performed immediately before the PICS reconstruction.
Iv-E Distributed Implementation
|Spatial matrix size||288260240|
|Coil compressed channels||7|
|Echo train length||80|
|Basis coefficient images||4|
|LLR patch size||1212|
|LLR regularization parameter||0.0125|
|ESPIRiT calibration size||2882424|
Figure 4 depicts our implementation of the distributed reconstruction pipeline. We distribute The PICS portion and the post-processing over multiple machines in the cluster, while the pre-processing happens on a single machine.
The PICS portion is divided across slices with approximately equal numbers of slices sent to all machines. In order to do this, we must send the pre-processed SENSE maps, k-space data, sampling pattern, basis matrix, and other files to each machine. To do this, the files are loaded into Python from /dev/shm, communicated using the mpi4py library, and written to /dev/shm on the destination machines. We use /dev/shm extensively as the system uses the filesystem to store intermediate results from multiple calls to BART. The Broadcast primitive is used for all data, but it is is not always needed in cases where a scatter would suffice, such as for the k-space data and sensitivity maps. However we use broadcast for convenience, and to avoid local volume slicing overheads. Once the data is distributed, the PICS computation proceeds in a hybrid fashion with MPI parallelism across slices and OpenMP parallelism within each slice.
The post-processing portion is distributed over only three machines using mpi4py, as there are currently three time points being post-processed. In order to do this, the machines each need all the k-space data and the updated SENSE maps to proceed with post-processing generating the final images. We use mpi4py again to achieve this data transfer. We use the Allgather primitive to collect the data to these three machines. It is not strictly necessary to use Allgather, as only three machines need all the data. However we do this for convenience and find that the cost of this primitive is relatively small in comparison to the entire reconstruction runtime for the small cluster configurations we are considering.
Pre-processing is left to run on a single machine so as to not over-complicate the implementation. Our plan is to expand this distributed framework to other reconstruction methods with a similar parallelism profile. Such a framework would be complicated by excessive use of distributed primitives within areas such as pre-processing that we’d like to make easily customizable by other MRI reconstruction experts.
V Experimental Setup
With institutional review board approval and informed consent/assent, pediatric patients presenting with indications of knee pain were scanned using the T2 Shuffling sequence on a 3 Tesla MRI scanner (GE Healthcare, Waukesha) at Lucile Packard Children’s Hospital over a one month period. A representative case from the scans was used to carry out the design and analysis.
V-a Scan Sequence and Reconstruction Parameters
The acquisition and reconstruction dimensions and sizes are shown in Table I.
The T2 Shuffling sequence modified a 3D-FSE acquisition to randomly shuffle and re-acquire phase encode ( and ) measurements during the echo trains. Scans were performed sagittally at 0.6 mm isotropic resolution with a 16-channel GEM-Flex receive coil array. The pulse repetition time and echo spacing were 1200 ms and 6 ms, respectively, and the echo train length was 83. The data from the first three echoes covered the central portion of k-space and were used to perform ESPIRiT calibration . The remaining echoes were used in the iterative reconstruction. Each TE was under-sampled with a variable density Poisson Disc distribution, at an acceleration factor of 139 per time point. Collectively, this represented a relative acceleration of 1.7 and apparent acceleration of 6.9, as defined in Table 1 of . A partial Fourier acceleration of 0.65 was used in the slice direction . The acquisition 3D array was , and the total scan time was 7 minutes and 2 seconds.
The distributed reconstruction was controlled through a Python script, using mpi4py for distributed communication. Each underlying component of the reconstruction was implemented using BART. In the reconstruction, soft-SENSE maps were used and subspace coefficient images were reconstructed. The data were coil compressed from channels to virtual channels. The fully sampled readout direction () was inverse Fourier transformed, and each slice was independently solved with iterations of FISTA. The data were normalized based on the maximum signal value of the raw data and the LLR regularization parameter was held fixed at across all scans. Homodyne filtering was applied to each reconstructed virtual coil image and coil-combined with root sum-of-squares. Gradient non-linearity correction and DICOM generation was performed using the Orchestra reconstruction library (GE Healthcare) and the Ox-BART post-processing code, available at http://github.com/mrirecon/ox-bart.
V-B Experimental Compute Cluster
Each compute node has two (dual-socket) Intel Xeon111Intel and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. Platinum 8180 processors 2.5 GHz. with 28 cores and 38.5 MB total L3 cache per socket. Each node contains 192 GB DDR4 RAM, arranged as 12 DIMMs of 16 GB each, in order to fully utilize the system’s available memory channels. The nodes are fully-connected dual-rail (one per socket) Intel Omni-Path Architecture (OPA) 100-gigabit switch.
Our code is compiled in two parts. Part 1, the BART portion, is compiled with the GNU compiler (gcc 4.8.5), uses FFTW version 3.3.6-pl2 compiled with AVX2 support for optimized transforms, and the Intel Math Kernel Library (MKL) version 18.0.0 for optimized linear algebra routines. Part 2, our optimized kernels described in Section IV, are compiled with the Intel compiler (icc) with the following flags -O3 -ipo -no-prec-div -fp-model fast=2 -xCORE-AVX512 -mkl -qopenmp -std=c99 -vec-threshold0 -static. We use MKL for optimized FFT and optimized linear algebra routines. These two parts are linked together with icc.
We run experiments using /dev/shm as temporary storage space for data between multiple invocations of the BART library. For the timing experiments, we pre-load the input data into /dev/shm and do not count this transfer time in the runtime results. Similarly, we write the output DICOM files to /dev/shm instead of the NFS in order to remove the consideration of network transfer time from our experiments as we observed NFS transfer times to vary widely depending on other jobs running in the cluster.
|Conversion to BART Format||2||1||1||1|
|Re-sort and Average||8||4||4||4|
|Re-sort and Project Data||4||1||1||1|
|Linear Recon. Process DICOM||23||-||-||-|
|Slice and Detailed Recon.||271||-||-||-|
|Slice and ESPIRiT Maps||-||13||24||9|
|No Slice and Detailed Recon.||-||94||106||28|
|Join Reconstructed Slices||-||2||6||1|
|Process All TEs||70||19||18||9|
V-C Hospital Compute Cluster
A four-node system was deployed locally at the hospital, with each node matching the specifications and compilations used in the experimental compute cluster. The compute cluster is located in a room adjacent to the scanner and connected to the hospital network and MRI scanners with a 1 Gigabit network link. The cluster itself used a 10 Gigabit switch in place of the OPA switch for communication between compute nodes during reconstruction.
Vi-a Single-machine Optimization
In Table II, we quantify222Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.,333Performance results are based on testing as of March 31st, 2018, and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. the benefits of applying the optimizations described in Section IV, averaged over five identical reconstructions. The main improvements to the pre-processing step are a result of the changes to FFTmod and to parallelizing and caching the basis. Accounting for data slicing and joining, the detailed PICS reconstruction is sped up from 275 seconds to 109 seconds on one machine. The post-processing is reduced from 70 to 18 seconds, benefiting from parallel processing. Overall, the optimizations led to a 3x speedup. Due to the processing gain, we were able to add reconstruction quality improvements at an additional cost of 30 seconds on a single machine.
Figure 5 shows the effect of increasing the number of cores per problem while running many independent PICS slice reconstructions in parallel. All cores on the machine are fully occupied for each point on the graph. In the case of one core per problem, there are 56 independent problems being solved simultaneously. For two cores per problem there are 28 independent simultaneous problems, and so on. As we increase the number of cores per problem, it becomes more likely that the working set derived from processing the slices fits in L2 cache, which is 1 MB/core for our configuration. The Forward/Adjoint operation per-iteration runtime, LLR per-iteration runtime, and the full solver per-iteration runtime is shown. The Forward/Adjoint runtime and full solver runtimes improve performance with increasing cores up to four due to beneficial caching effects from decreasing the number of simultaneous problems running on the system. Beyond four cores, the overall throughput decreases due to increasing overheads of parallelism as well as the impact of serial portions elsewhere in the solver. As a result, we use four cores per problem in the optimized implementation for the remainder of the paper.
Vi-B Multi-machine Scaling
Figure 6 shows the runtime of T2 Shuffling from one to 12 machines. The PICS portion is the only part that is distributed, so therefore this portion scales with the number of machines added. However, the computation is currently limited to one machine for pre-processing and three machines for post-processing so these components quickly become the bottleneck as we increase the number of machines. We didn’t choose to distribute the pre and post-processing further so as to hide the distributed computing details from future users of the software who are more productive in the shared-memory BART environment. The four-node configuration, which we deployed in the hospital, provides a total compute time of roughly 85 seconds.
Since the collectives are implemented in Python using mpi4py, we expect that much of the runtime is being spent loading data into Python from /dev/shm and preparing it for the collective operation, not in the communication itself. This is consistent with the experimental results, in which we see the total time spent in the distributed collective operations Broadcast and Allgather stays roughly constant as we increase the number of machines from two to 12.
We also see an improvement in the post-processing time from one to three nodes, as we distribute the post-processing to three machines. There is also a sharp improvement in PICS runtime between 10 and 11 nodes. This is the result of load balancing. Since we are doing 14 slices at a time per machine and 288 total slices, 10 nodes gives us exactly 140 slices executed at once on the cluster which completes in two steps, followed by one extra step with one slices on 8 of the 10 machines. At 11 nodes, there are no slices left over after two steps, so the performance improves sharply due to the improved overall load balance.
Vi-C Hospital Cluster Statistics
Over the one-month deployment period, 35 pediatric knee T2 Shuffling scans were performed. The mean, standard deviation, and maximum times for the main steps of the processing pipeline across these scans are shown in TableIII. The increase in runtime relative to that reported in Table II is attributed to latency associated with reading the raw data and writing the DICOM images across the 1 Gigabit network. Using 10 Gigabit, as opposed to 100 Gigabit Omnipath for inter-node communication, added about one second to both the Broadcast and the Allgather, which is a small portion of the overall runtime. There is an additional overhead of 10-20 seconds while writing to PACS, which is common to all offline reconstruction implementations.
|Mean (s)||Std. Deviation (s)||Max (s)|
Vi-D Multi-Scanner Simulation
Our current hospital cluster deployment serves four scanners with the 4-node cluster, and is used for T2 Shuffling knee exams. Here we consider the possible tradeoffs involved with serving many exams from many scanners using compute clusters, if we were to use our current distributed system and reconstruction method for all exams from these scanners. This serves as an upper bound on the amount of computation over the course of the day.
Figure 8 shows the schedule from one day out of the one-month study at our institution. Each row represents an MRI scanner, and each entry shows the duration of a scan. The dotted box shows a zoomed-in portion from the schedule to better visualize the individual scans. With the exception of one scanner, there is high utilization across the scanners.
Using this data, we can simulate the case where one or more compute clusters are reconstructing images from the scanners. To do so, we first use the image sizes to estimate the runtimes if they were all to employ PICS-based acquisitions and reconstructions. We ignore the issues with transferring data to the clusters and only look at cluster computation time. We also estimate the cost of distributed primitives in the cluster by scaling the measured values from our experiments by the total size of the volume compared to our experimental data. For simplicity, we assumed that the pre-processing and post-processing times scaled as , and that the PICS time scaled as ; that is, the processing is dominated by the FFT factors and linearly scales with the number of concurrent slices. We fit the constants for each stage to match the processing time for the T2 Shuffling knee dataset (Table II).
We used the simpy discrete event simulation library to model contention on the cluster. Our model has one process representing the hospital, which creates a new reconstruction process for each scan reconstruction task as they arrive. Each scan reconstruction task tries to access a priority resource representing the clusters. Earlier scans are given higher priority. Once a cluster resource is acquired, the reconstruction process holds the resource for a number of time units equal to our runtime estimates for that particular scan. The time spent waiting and reconstructing are logged for each task throughout the simulated day.
Figure 9 shows the result of the simulation. The average wait time is the average time that each reconstruction task spends waiting for the cluster resource. Average total time is the average compute time plus the wait time. The average wait time is reduced significantly when using two clusters rather than one, but there are diminishing returns to adding more clusters after this. We can also examine how to choose the best cluster configuration for a fixed number of machines. Assuming we have eight machines, it makes the most sense, in terms of average and max total reconstruction time, to configure these as two clusters with four nodes per cluster. For this configuration, we can achieve clinically feasible average times (¡2min), average wait times near zero, with slightly higher max time of roughly 200 seconds due to a few scans with large reconstruction times.
Developments to PICS-based acquisitions have shown great promise at reducing MRI scan times, but the reconstruction times have remained a barrier to clinical adoption. In this work we developed a distributed reconstruction architecture for T2 Shuffling and achieved latency times that make its clinical use feasible. We have already deployed the system at Lucile Packard Children’s Hospital for pediatric knee imaging with great initial success. Our approach is well-suited for other PICS acquisitions, in which the slices can be independently reconstructed, and it presents an avenue for other reconstruction formulations to benefit from the distributed optimizations.
Our optimization of the pipeline led to an overall improvement in performance by after speedups and adding additional operations to improve image quality. A speedup came from single-node improvements and a speedup came from distributed parallelization on a 4-node cluster. By enabling clinically feasible reconstruction times, a larger number of clinical protocols could potentially leverage PICS acquisitions, leading to shorter overall exam times. Thus, we expect better scanner utilization as more patients can be accommodated into the same schedule. Dedicated compute clusters could support this increased utilization without the need for transferring data outside the hospital network.
-  S. S. Vasanawala, M. T. Alley, B. A. Hargreaves, R. A. Barth, J. M. Pauly, and M. Lustig, “Improved pediatric mr imaging with compressed sensing.” Radiology, vol. 256, no. 2, pp. 607–16, 8 2010.
-  K. P. Pruessmann, M. Weiger, M. B. Scheidegger, and P. Boesiger, “Sense: sensitivity encoding for fast mri.” Magn Reson Med, vol. 42, no. 5, pp. 952–62, 11 1999.
-  M. A. Griswold, P. M. Jakob, R. M. Heidemann, M. Nittka, V. Jellus, J. Wang, B. Kiefer, and A. Haase, “Generalized autocalibrating partially parallel acquisitions (grappa).” Magn Reson Med, vol. 47, no. 6, pp. 1202–10, 6 2002.
-  M. Lustig, D. Donoho, and J. M. Pauly, “Sparse mri: The application of compressed sensing for rapid mr imaging.” Magn Reson Med, vol. 58, no. 6, pp. 1182–95, 12 2007.
-  M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Compressed sensing mri,” IEEE Signal Proc. Mag., vol. 25, no. 2, pp. 72–82, 3 2008.
-  K. G. Hollingsworth, “Reducing acquisition time in clinical mri by data undersampling and compressed sensing reconstruction.” Phys Med Biol, vol. 60, no. 21, pp. R297–322, 11 2015.
-  A. C. Yang, M. Kretzler, S. Sudarski, V. Gulani, and N. Seiberlich, “Sparse reconstruction techniques in magnetic resonance imaging: Methods, applications, and challenges to clinical adoption.” Invest Radiol, vol. 51, no. 6, pp. 349–64, 2016.
-  J. Hennig, A. Nauerth, and H. Friedburg, “RARE imaging: A fast imaging method for clinical mr,” Magn Reson Med, vol. 3, no. 6, pp. 823–833, 12 1986.
-  J. P. Mugler, “Optimized three-dimensional fast-spin-echo mri,” J Magn Reson Imaging, vol. 39, no. 4, pp. 745–767, 1 2014.
-  R. F. Busse, A. C. S. Brau, A. Vu, C. R. Michelich, E. Bayram, R. Kijowski, S. B. Reeder, and H. A. Rowley, “Effects of refocusing flip angle modulation and view ordering in 3d fast spin echo.” Magn Reson Med, vol. 60, no. 3, pp. 640–9, 9 2008.
-  J. I. Tamir, M. Uecker, W. Chen, P. Lai, M. T. Alley, S. S. Vasanawala, and M. Lustig, “T2 shuffling: Sharp, multicontrast, volumetric fast spin-echo imaging,” Magn Reson Med, 1 2016.
-  S. Bao, J. I. Tamir, J. L. Young, U. Tariq, M. Uecker, P. Lai, W. Chen, M. Lustig, and S. S. Vasanawala, “Fast comprehensive single-sequence four-dimensional pediatric knee mri with t2 shuffling,” Journal of Magnetic Resonance Imaging, 2016.
-  T. Zhang, S. Chowdhury, M. Lustig, R. Barth, M. Alley, T. Grafendorfer, P. Calderon, F. Robb, J. Pauly, and S. Vasanawala, “Clinical performance of contrast enhanced abdominal pediatric mri with fast combined parallel imaging compressed sensing reconstruction.” J Magn Reson Imaging, vol. 40, no. 1, pp. 13–25, 7 2014.
-  J. Y. Cheng, K. Hanneman, T. Zhang, M. T. Alley, P. Lai, J. I. Tamir, M. Uecker, J. M. Pauly, M. Lustig, and S. S. Vasanawala, “Comprehensive motion-compensated highly accelerated 4d flow mri with ferumoxytol enhancement for pediatric congenital heart disease.” J Magn Reson Imaging, vol. 43, no. 6, pp. 1355–68, 2016.
-  L. Feng, L. Axel, H. Chandarana, K. T. Block, D. K. Sodickson, and R. Otazo, “Xd-grasp: Golden-angle radial mri with reconstruction of extra motion-state dimensions using compressed sensing.” Magn Reson Med, vol. 75, no. 2, pp. 775–88, 2 2016.
-  Y. Wang and L. Ying, “Compressed sensing dynamic cardiac cine mri using learned spatiotemporal dictionary.” IEEE Trans Biomed Eng, vol. 61, no. 4, pp. 1109–20, 4 2014.
-  R. M. Lebel, J. Jones, J.-C. C. Ferre, M. Law, and K. S. Nayak, “Highly accelerated dynamic contrast enhanced imaging.” Magn Reson Med, vol. 71, no. 2, pp. 635–44, 2 2014.
-  M. Uecker, P. Virtue, F. Ong, M. J. Murphy, M. T. Alley, S. S. Vasanawala, and M. Lustig, “Software toolbox and programming library for compressed sensing and parallel imaging,” in ISMRM Scient. Workshop on Data Sampling and Image Recon., Sedona, 2013.
-  J. Gai, N. Obeid, J. L. Holtrop, X.-L. Wu, F. Lam, M. Fu, J. P. Haldar, W.-m. W. Hwu, Z.-P. Liang, and B. P. Sutton, “More IMPATIENT: A gridding-accelerated toeplitz-based strategy for non-cartesian high-resolution 3d mri on gpus,” Journal of Parallel and Distributed Computing, vol. 73, no. 5, pp. 686–697, 5 2013.
-  M. S. Hansen and T. S. Sorensen, “Gadgetron: an open source framework for medical image reconstruction.” Magn Reson Med, vol. 69, no. 6, pp. 1768–76, 6 2013.
-  F. Knoll, A. Schwarzl, C. Diwoky, and D. Sodickson, “gpunufft: An open source gpu library for 3d regridding with direct matlab interface,” in Proc. Intl. Soc. Mag. Reson. Med. 22, Milan, 2014, p. 4297.
-  M. Uecker, F. Ong, J. I. Tamir, D. Bahri, P. Virtue, J. Y. Cheng, T. Zhang, and M. Lustig, “Berkeley advanced reconstruction toolbox,” in Proc. Intl. Soc. Mag. Reson. Med. 23, Toronto, 2015, p. 2486.
-  T. Ou, F. Ong, M. Uecker, L. Waller, and M. Lustig, “Nufft: Fast auto-tuned gpu-based library,” in Proc. Intl. Soc. Mag. Reson. Med. 25, Honolulu, Hawaii, 2017, p. 3807.
-  J.-M. . M. Lin, “Python non-uniform fast fourier transform (pynufft): An accelerated non-cartesian mri package on a heterogeneous platform (cpu/gpu),” Journal of Imaging, vol. 4, no. 3, p. 51, 3 2018.
-  K. King, “Ge signa pulse of mr,” pp. 48–51, 2016.
-  C. Forman, J. Wetzl, C. Hayes, and M. Schmidt, “Magnetom flash: The magazine of mri,” pp. 8–13, 3 2016.
-  S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. W. M. Hwu, B. P. Sutton, and Z.-P. P. Liang, “Accelerating advanced mri reconstructions on gpus.” J Parallel Distrib Comput, vol. 68, no. 10, pp. 1307–1318, 10 2008.
-  D. Kim, J. Trzasko, M. Smelyanskiy, C. Haider, P. Dubey, and A. Manduca, “High-performance 3d compressive sensing mri reconstruction using many-core architectures,” Int J of Biomed Imaging, vol. 473128, 6 2011.
-  M. Murphy, M. Alley, J. Demmel, K. Keutzer, S. Vasanawala, and M. Lustig, “Fast l-spirit compressed sensing parallel imaging mri: scalable parallel implementation and clinically feasible runtime.” IEEE Trans Med Imaging, vol. 31, no. 6, pp. 1250–62, 6 2012.
-  P. Després and X. Jia, “A review of gpu-based medical image reconstruction,” Physica Medica: European Journal of Medical Physics, vol. 42, pp. 76–92, 10 2017.
-  H. Xue, S. Inati, T. S. Sorensen, P. Kellman, and M. S. Hansen, “Distributed mri reconstruction using gadgetron-based cloud computing.” Magn Reson Med, vol. 73, no. 3, pp. 1015–25, 3 2015.
-  G. T. Kowalik, J. A. Steeden, and V. Muthurangu, “Implementation of a generalized heterogeneous image reconstruction system for clinical magnetic resonance,” Concurrency and Computation: Practice and Experience, vol. 27, no. 6, pp. 1603–1611, 4 2015.
-  S. T. Ting, R. Ahmad, N. Jin, J. Craft, J. S. d. Silveira, H. Xue, and O. P. Simonetti, “Fast implementation for compressive recovery of highly accelerated cardiac cine mri using the balanced sparse model,” Magnetic Resonance in Medicine, vol. 77, no. 4, pp. 1505–1515, 4 2017.
-  J. I. Tamir, M. Lustig, V. Taviani, M. T. Alley, B. Perkins, L. Hard, D. Mortensen, and S. S. Vasanawala, “Targeted rapid knee mri exam using t2 shuffling,” in Proc. Intl. Soc. Mag. Reson. Med. 25, Honolulu, Hawaii, 2017, p. 0231.
-  P. B. Roemer, W. A. Edelstein, C. E. Hayes, S. P. Souza, and O. M. Mueller, “The nmr phased array,” Magnetic Resonance in Medicine, vol. 16, no. 2, pp. 192–225, 10 1989.
-  M. S. Hansen and P. Kellman, “Image reconstruction: an overview for clinicians.” J Magn Reson Imaging, vol. 41, no. 3, pp. 573–85, 3 2015.
-  S. S. Iyer, F. Ong, and M. Lustig, “Towards a parameter free espirit: Soft weighting for robust coil sensitivity estimation,” in Proc. Intl. Soc. Mag. Reson. Med. 24, Singapore, 2016, p. 1093.
M. Uecker, P. Lai, M. J. Murphy, P. Virtue, M. Elad, J. M. Pauly, S. S. Vasanawala, and M. Lustig, “Espirit-an eigenvalue approach to autocalibrating parallel mri: Where sense meets grappa.”Magn Reson Med, vol. 71, no. 3, pp. 990–1001, 3 2014.
-  R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. of the Royal Statistical Society, Series B, vol. 58, pp. 267–288, 1994.
-  A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 1 2009.
-  C. Huang, C. G. Graff, E. W. Clarkson, A. Bilgin, and M. I. Altbach, “T2 mapping from highly undersampled data by reconstruction of principal component coefficient maps using compressed sensing.” Magn Reson Med, vol. 67, no. 5, pp. 1355–66, 5 2012.
-  J. D. Trzasko and A. Manduca, “Local versus global low-rank promotion in dynamic mri series reconstruction,” in Proc. Int. Symp. Magn. Reson. Med 19, Montreal, 2011, p. 4371.
-  T. Zhang, J. M. Pauly, and I. R. Levesque, “Accelerating parameter mapping with a locally low rank constraint.” Magn Reson Med, vol. 73, no. 2, pp. 655–61, 2 2015.
-  J.-F. . F. Cai, E. J. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
-  C. Santa-Marta, J. Lafuente, J. Vaquero, P. Garcia-Barreno, and M. Desco, “Resolution recovery in turbo spin echo using segmented half fourier acquisition.” Magn Reson Imaging, vol. 22, no. 3, pp. 369–78, 4 2004.
-  J. I. Tamir, M. Uecker, S. S. Vasanawala, and M. Lustig, “T2 shuffling with partial fourier acquisition and reconstruction,” in ISMRM Scientific Workshop on Data Sampling and Image Recon., Sedona, 2016.
-  D. C. Noll, D. G. Nishimura, and A. Macovski, “Homodyne detection in magnetic resonance imaging.” IEEE Trans Med Imaging, vol. 10, no. 2, pp. 154–63, 1991.
-  M. Uecker and J. I. Tamir, “Bart: version 0.4.02,” 11 2017.
-  T. Zhang, J. M. Pauly, S. S. Vasanawala, and M. Lustig, “Coil compression for accelerated imaging with cartesian sampling.” Magn Reson Med, vol. 69, no. 2, pp. 571–82, 2 2013.
-  M. Weigel, “Extended phase graphs: dephasing, rf pulses, and echoes - pure and simple.” J Magn Reson Imaging, vol. 41, no. 2, pp. 266–95, 2 2015.