In the last decades, Medicine has benefited enormously from medical imaging techniques. Starting with radiography and echography, and more recently X-ray tomography, nuclear medicine and magnetic resonance imaging, these modalities are now of widespread use in most hospitals all over the world. Since the advent of these techniques in clinical practice, huge effort has been carried out by scientists worldwide to make them faster, easier to use for the practitioner, more accurate and, overall, safer and more convenient for the patient. However, this comes at the price of higher computational costs to the extent that processing power is often the bottleneck.
Graphics Processing Units (GPUs), when used as general-purpose computing devices, may overcome these computational needs, as several software packages (see section 2 for a number of examples) have shown. Nevertheless, software development over GPU devices is not straightforward: parallelization itself poses a significant paradigm change from traditional (single-threaded) programming but, remarkably as well, GPUs are autonomous devices and, as such, work asynchronously from the CPU. Hence, communication and synchronization issues arise likewise, hindering the use of GPUs to develop new medical imaging algorithms, especially in the earliest research stages, in spite of their unquestionable value.
In addition, the sake of simplicity often causes developers to focus on only one device type (e.g. CPU vs. GPU), which, in turn, tends to result in augmented effort if multiple device types need to be supported afterwards. In this regard, it should be stressed that interesting processing devices beyond CPUs and GPUs are available, namely DSPs, FPGAs, or many-core systems such as Intel Phi. OpenCL comes up here as a natural answer to the problem of supporting multiple device types but, when compared to its direct competitor in the GPU field, nVidia CUDA, it tends to fall behind in terms of simplicity and maturity.
The aforementioned scenario, combined with the need to set up a complex working environment which is inherent to asynchronous programming, give rise to more particular, although closely related, issues: device selection and initialization, data streaming to/from the computing device, synchronization and communication between threads, memory management and others; those isssues bring forth an additional overhead that prevents researchers and practitioners from focusing on the algorithm development itself (the so-called kernels).
This paper presents a C++ programming framework which intends to solve the forenamed issues while being:
Device agnostic: being OpenCL-based, no knowledge about the final computing device type is needed at development time.
Straightforward to use: device discovery and initialization, data transfers to and from the device and the file system, kernel loading and compiling, etc. are handled automatically. Therefore, the developer focuses on kernel development while housekeeping chores are kept to a minimum.
Coherent in code writing and data manipulation: data structures need to be defined only once, independently of the final computing device types. Accordingly, host and device code is unique for all device types.
Easy to integrate in existing projects: device-dependent traits are confined in a class specifically designed for this purpose, while already existing classes for data processing and storage are sufficiently generic so as to support manipulation of many ordinary data types —such as real and complex N-dimensional arrays, among others—, but also easy to override and to derive from, if needed.
The rest of this paper is organized as follows: section 2 reviews the main existing approaches to the problem of using dedicated devices for medical image processing, with an special focus on image reconstruction; section 3 describes the architecture of ; section 4 shows a case study where is used to reconstruct an MRI image from fully sampled -space; we also include performance comparison of with two other well-known packages as well as a comparison with OpenMP; section 5 concludes the paper.
2 Related Work
The use of dedicated devices for medical image processing has drawn notable attention in recent years, especially with the evolution of GPUs towards general-purpose computing devices (GPGPU). This section briefly describes the most remarkable works in this direction we are aware of.
The Berkeley Advanced Reconstruction Toolbox (BART)  is a programming library as well as a collection of command line tools to carry out MR image reconstruction . In the paper, the authors refer to this toolbox as a “framework for iterative image reconstruction”, so the terms library and framework are used interchangeably albeit differences exist between both. This toolbox has a twofold orientation, namely, it serves for rapid algorithm prototyping and testing and it also facilitates the integration of these algorithms into the data acquisition and reconstruction pipeline. The library also provides support for parallel computation using multiple CPUs or, for some of the available algorithms, GPUs. CUDA is, however, the only supported API for the latter option, so dedicated processing is limited to GPU-type devices from nVidia.
Another interesting and very popular framework is “Gadgetron”; this framework allows the user to construct streaming data processing pipelines, where data is processsed by a series of modules (the gadgets). The pipeline is configured by means of an XML description file. Gadgetron is based on a client-server architecture, in which the latter is responsible for performing the actual reconstruction computations while the former is in charge of sending raw data to the server and receiving reconstructed images from it, either for visualization or storage. Client-server communication works over TCP/IP, using an ad-hoc protocol which typically conveys the input or output data and the XML description file. This XML file contains a reader and a writer definition as well as a list of gadgets which conform the processing pipeline. Gadgets are specified by their name, i.e. the library in which they are located and the actual class that implements it; properties can also be defined for parameter setup within each gadget. Gadgetron uses Linux container technology  to provide its network services, and is distributed through the Docker platform . Since it is conceived as a network service with a client-server approach, Gadgetron deployment should be carefully planned in advance. However, Gadgetron allows the practitioner to build gadgets as standalone applications as well, for cases where no client/server approach or pipeline specification is needed. Parallel computing is supported in CPU as well as in GPU , again, via CUDA.
Other well-known software packages for MR acquisition and reconstruction are Yarra ,
CS_MoCo_LAB , Codeare , MIRT  and GPUNufft . However, we do not mean to be exhaustive since our aim is not to overcome these platforms in terms of performance, but we intend to solve the issues referred to in the introduction while complying with relevant requirements for frameworks (also itemized in that section) and maintaining performance comparable to other state of the art frameworks.
Apart from these initiatives, general-purpose software packages are available, such as Matlab or Python, which have been adapted to GPU computing by adding a CUDA interface (by means of the Parallel Computing Toolbox in Matlab, and of the PyCUDA package for Python). For the latter, PyOpenCL  wraps the OpenCL API so it can be invoked from Python.
With some similarities with the former libraries, GPI  can also be mentioned. It is a Python-based graphical programming interface used for image reconstruction. The interface is typically used to create workflows for data manipulation and algorithm testing; built-in functionality can be extended by means of simple code interfaces. Recently, some functionality of BART seems to be integrated within a GPI node library.
Overall, Gadgetron seems a nice tool with a similar motivation as BART. i.e., rapid algorithm prototyping and testing, although Gadgetron is more easily seen as a framework while BART may be thought of as a library or a collection of tools. In this sense, has been designed as a framework, with development of new algorithms in mind. On the other hand, computation on dedicated devices does not seem to be at the core of either Gadgetron or BART. Alas, being CUDA-based, algorithm development differs substantially depending on the final computing device (which can only be the CPU or a nVidia GPU). Hence, both BART and Gadgetron include non-interchangeable versions of the same algorithm written for CPU and GPU. This limitation lies deeply within the CUDA fundamentals and is, therefore, difficult to overcome. Typically, it manifests itself as the need of two different data structures to hold the same data: one for host data (for CPU computing) and another for device data (for GPU computing). Gadgetron, for instance, has hoNDArray for host and cuNDArray for GPU, as does Matlab with its GPU-specific GPUArray. BART, on the other hand, relies on conditional compiling to select the desired variant at compile-time. The use of OpenCL overcomes this limitation, as the same source code and data structures serve their purpose independently of the final computing device (which is chosen at run-time), but this particular piece of freedom comes at the price of complexity, due to the broader range of circumstances which must be taken into account: multiple vendors with their own extensions and limitations, multiple device types, multiple device instances, intercommunication between devices and the host, choosing the optimum device for a particular task, and so on. takes care of most of them.
OpenCL is a sound, versatile and efficient platform to support computing on dedicated devices, and is an industry standard supported by the Khronos Group . However, it is more complex and less mature than CUDA, two facts that hinder its widespread usage in the medical image computing community, despite its advantages. has been designed to fill in this gap by managing the complexity referred to above and offering developers direct interaction with device kernels, host processes and data so that they can focus on algorithm development exclusively.
3 Package architecture
This section describes the main features and the internal architecture of . The complete source code is available at  as a git repository and as a downloadable archive.
3.1 Features of
is designed as a set of C++ classes which provide three prominent services to the developer: computing device management, data storage and manipulation, and algorithm handling.
3.1.1 Computing device management
Straightforward device selection
One of the most visible drawbacks of OpenCL is that the computing device is not handled automatically, while it is the case with CUDA. OpenCL also introduces the concept of platforms to address the problem of supporting different hardware vendors. It is the developer’s job to retrieve and traverse the list of available platforms and devices, and then choose the most appropriate one. With OpenCLIPER, the desired device is selected according to a combination of criteria (e.g. device class, vendor, supported OpenCL version, etc) in a single call.
Widespread computing device support
OpenCLIPER targets OpenCL version 1.2, which is widely adopted by most significant vendors as of today. Although more recent versions include interesting features, such as shared virtual memory, we plan to continue supporting 1.2 at a minimum.
3.1.2 Data storage and manipulation
Automatic use of mapped/pinned host memory
Data transfers to/from the computing device automatically use mapped host memory (also called pinned memory ), so the transfer speed is maximized. Data objects are transferred in a single call.
Support for highly-heterogeneous data
Other image processing and reconstruction approaches, such as Gadgetron, typically provide a data structure to contain a single N-dimensional array of a given data type (integer, float, etc) the transfer of which, in addition, should be carried out component-wise manually. However, many application fields manage data which does not fit comfortably in that abstraction. Consider, for example, several 3D+t non-uniformly sampled volumes coming from several sensors, each one with various synchronization signals of their own. If this was the case, one would have to create their own ad-hoc data structures and handle data transfers component-wise manually as well. OpenCLIPER, however, is agnostic about the internal data organization, so arbitrarily complex data can be transferred to/from the device in a single call. Both OpenCL buffers and images are supported.
Predictable data layout in device memory
OpenCLIPER keeps track of data positions and sizes automatically, in both the host and the computing device. A single data set is always aligned and contiguous, even though it is highly heterogeneous. Data can be processed in batches because the starting position and the size of each component is known in advance and it is readily available from OpenCL kernels.
Common data formats supported out-of-the-box
OpenCLIPER supports many image formats as well as volumes in Matlab’s .mat format. For unsupported data formats, new readers and writers may be added by deriving from the appropriate class. In addition, OpenCLIPER supports volumes in raw data format as well.
3.1.3 Algorithm handling
Automatic kernel loading and compiling
In OpenCLIPER, kernel loading and compiling, as well as error checking and reporting, is automatically done in a single call, even for kernels scattered among multiple source files. It also keeps track of kernels at run time so the developer has them readily available by name.
Standard interface to kernel launching
OpenCLIPER is designed to ease the job of launching kernels; to this end, the concept of process is introduced. Processes have a customizable but standard calling interface, so that any kernel is treated by means of the three following actions: set its input and output data sets, set its parameters, and launch. Since it is quite common that kernels need some kind of initialization before doing the actual job, and this initialization may be costly, OpenCLIPER separates kernel initialization from kernel launching, so that performance is not compromised. Processes can be chained at no cost (setting outputs from a stage as inputs for the next one is zero-copy).
Single source code
Being OpenCL-based, the same source code is valid for every computing device (as long as there is an OpenCL implementation for it), as opposed to CUDA-based approaches.
3.2 Description of
CLapp is the main class of . It acts as an interface to the OpenCL device and, as such, stores information about the current OpenCL platforms and devices, their associated command queues and so on. It also contains the list of data objects to be processed in the computing device. CLapp also deals with memory management of the computing device, as well as with data transfers to/from it.
Data is an abstract class. Objects of its derived classes contain a set of images, volumes or -dimensional data in the most general case. Each (derived from) Data object consists of one or more NDArray objects, which need not be equal in sizes or dimensions. This way, a single acquisition containing heterogeneous data may be stored in a single object.
Data is meant to be specialized for the problem to be solved, so final implementations may contain as many as necessary data types, be them real (such as for echo or radiograph data) or complex (such as for -space data from an MRI scanner). provides two general-purpose specializations out-of-the-box called XData (for data with a direct physical interpretation) and KData (for -space data). Readers and writers are provided for the most commonly-used image formats (via the DevIL library ), as well as for Matlab’s .mat format.
NDArray represents a signal, image, volume or any -dimensional data structure. It is defined as an abstract class because it does not know the specific (machine) data type in which data are stored. Thus, it is limited to common attributes and methods for all possible data types, and to creation of objects of the actual container class, which is ConcreteNDArray. This way, implementors of new Data specializations do not have to deal with the deep details of data storage and class coupling is kept to a minimum.
ConcreteNDArray is a templated class which stores the actual data. It is not meant to be used by users but just to contain raw data and details which depend on the machine data type.
Process is an interface to algorithms which process data. As such, it is an abstract class which developers should derive from to implement their own processes. Its purpose is to provide a standard front-end to algorithms, so that no prior knowledge about their internals is needed to start working with them.
Overall, is designed to simplify the handling of computing device, data and algorithms while still allowing fast processing of data. To this end, pointers to raw data are always available, both in host memory and in device memory, and process launching is designed to be as immediate as possible.
While having three classes involved in data storage (Data, NDArray and ConcreteNDArray) may seem redundant, it has a number of advantages: first, data sets containing information from different sources may be easily built; second, low-level particularities related to machine data representation are isolated from the user (e.g. user classes do not need to be templated), thus enhancing class decoupling and, third, support for new data types may be more easily added without affecting the rest of the code.
3.3 Usage of
A typical usage of follows these simple steps (see listing 1):
Create a CLapp object.
Initialize the computing device. The user may impose a specific platform, device or device class, according to several criteria, or leave the choice to the framework.
Load and compile OpenCL kernels. This is done in a single call; the framework takes care of creating program objects, indexing kernels, compiling and linking for the chosen device and making them available to the user. If compilation fails, the error log is automatically at user disposal.
Create an input Data object. Data may be loaded from files (JPEG, TIFF, PNG, and other usual image formats are supported, as well as Matlab’s .mat format) or just be allocated empty in memory.
Create an output Data object.
Register input and output in the CLapp object. This also sends the data to the computing device.
Create process object(s) and set their input/output to some of the previously registered Data objects.
Initialize and launch processes. As previously discussed, initialization is separated from process launching. Parameters in each call may be varied.
Get data from the computing device back to the host device.
Save processed data. As when loading input data, commonly used image formats and Matlab’s .mat files are supported.
Clean up allocated resources.
Listings 1, 2, 3 and 4 show a simple intensity inverting filter built with (using namespace directives are used for brevity). It can be seen that most of the described steps translate in one or two lines of code in listing 1, and that C++ exceptions are used to ease error handling.
Listings 2 and 3 define the process class while listing 4 describes the actual computation to be executed on each image pixel, in this case, an inversion of contrast. As can be observed, ’s main goal is to allow developers to focus on their core work, i.e. kernel development, while housekeeping is handled by the framework in a fast and simple manner.
4 Case study: MRI reconstruction
In what follows, all the calculations have been carried out on a Intel(R) Corporation Core(TM) i7-4790 CPU @ 3.60GHz with 16 GB of RAM), running Ubuntu 16.04 LTS, host compiler gcc 5.4, OpenCL version 1.2. The GPU used is a GeForce GTX 970, CUDA 7.5 and CUDA compiler nvcc 7.5.17.
4.1 Sensitivity-weighted multicoil image reconstruction
MRI reconstruction is a prominent application of medical imaging nowadays. Briefly outlined, data coming from the MRI scanner belongs in the so-called
-space, which is the Fourier-transformed space of the real image space (the-space from now on). In addition, current scanners have several antennas and may carry out parallel acquisition and reconstruction, so a single acquisition consists of -space images , one per antenna. Each antenna is most sensitive to a particular part of its surrounding space, which is defined by a sensitivity map . Therefore, the process of reconstructing an -space image from the scanner data can be described as:
where stands for the complex conjugate of .
Translating expression 1 into a C++ program using is very similar in essence to the example shown in section 3.3 (notice that listing 1 in that section is the equivalent to listing 5 in this section and the same goes for listings 3 and 6). Now, complementarily, we show some other capabilities of :
Platform and device traits: selection of the computing device may be left up to as in listing 1, or be specified by the user through various criteria such as OpenCL version, device type, platform/device vendor, etc.
Reading/writing files from Matlab: Data objects may be constructed right away from Matlab’s .mat files and, accordingly, saved to as well.
Multiple CL program loading: CL kernels may reside in multiple input files. They are automatically compiled and readily available to the user (indexed by name) in a single step.
Error log from CL compiler: If kernel compilation fails, the error log is immediately available in the description of the generated exception.
For space reasons, the kernels are not shown here (kernels would be allocated in files complexElementProd.cl and xImageSum.cl as indicated in listing 5). Details on the actual operations can be found in . Note also that the Fourier transform uses the clFFT library  and is included in as a process which depends on it.
In summary, equation 1 translates into a process SimpleMRIRecon, which calls subprocesses that do the inverse FFT, multiply the result by the complex conjugate of the sensitivity maps and sum it all to give the final result (there is zero data copying between processes). Therefore, processes can be thought of as mathematical operators. This metaphore is intentional, and allows developers to think of their algorithms as the mathematical expressions they originate from.
Listings 5 and 6 also show how a process can (and should) be split in what needs to be done only once (such as initialization work) and what is to be done in each call to the process (the real work). To illustrate this, let us focus on the FFT process (which internally uses the clFFT library). Like many FFT implementations, clFFT needs an initialization step (called plan baking) which takes much longer than the FFT calculation itself. To reflect this, the init() method of the FFT process (which is called once) does the plan baking, while the launch() method simply performs the FFT calculation itself.
4.2 Performance comparison with other frameworks: Sum of Squares
Our purpose is to show that the benefits of described in this paper are accompanied by a performance that is comparable to other state of the art frameworks. To this end, we have carried out another type of reconstruction for multicoil images; in this case, we have used the so-called Sum of Squares, i.e., we have added the squared modulus of the image in each coil (referred to hereafter as RSS, from root sum of squares). The frameworks used for comparison have been BART  and Gadgetron . These frameworks have been downloaded from their respective websites; then, they have been compiled on the linux server mentioned above with no host compiler optimization (device compiler does optimize to its maximum).
The application domain selected for this experiment is cardiac MRI, namely, 2D cine images ; the number of images to be reconstructed is 16, with size and the number of coils is 8. We depart from Cartesian fully sampled raw data in -space. Table 1 shows performance figures of the three frameworks for an average of one hundred reconstruction executions on a CPU; table 2 shows the corresponding results for execution on a GPU (in both, quantities are expressed in miliseconds). As for the FFT, execution times seem substantially worse for than for the other frameworks; however, we should keep in mind that, as for this operation, we are actually comparing library performance as opposed to framework performance; specifically, performance results mainly depend of how efficiently the FFT is coded in each external library (these libraries are the highly, assembly-level optimized cuFFT and FFTW for BART and Gadgetron in GPU and CPU, respectively, and OpenCL’s equivalent clFFT for , both in GPU and CPU). RSS, on the other hand, is directly coded as kernels in the three alternatives, and it is a sufficiently simple operation so as to render the effect of internal compiler optimization negligible. Therefore, as the tables reveal, ease of use of does not bring forth a performance loss but results are pretty much comparable with those obtained from the other two platforms.
4.3 Performance comparison with multithread options
As a final comparison, we intend to show the benefits of our framework , with respect to multithread computing. To this end, we have resorted to a simple exercise of adding two matrices and we have measured processing time with respect to the matrices size. We have used as baseline a single thread implementation of the operation. Then four options are compared, namely, the same operation using OpenMP (referred to as Series 1 in figure 2, blue color), using as device CPU (Series 2, red color, in figure 2), using as device GPU (Series 3, yellow color, in the same figure) and CUDA (Series 4, green color); in the four cases the magnitude shown is the speed up with respect to the baseline, i.e., the ratio of the processing time for the baseline and the time for each of the other four schemes (higher is faster). Results are shown, as indicated, in figure 2 for an average of one hundred executions; clearly, multithreading using OpenMP is not competitive with the GPU alternatives, and it shows comparable results with when CPU is used as the device. With respect to the two GPU implementations, results are very similar, with random variations along the five matrix sizes.
5 Conclusion and further work
The main contribution of OpenCL to the programming of dedicated computing devices is to provide a single language and API that covers every device type. Hence, the programs presented here should run unmodified in CPUs, GPUS, DSPs, FPGAs, etc. from any vendor, as long as an OpenCL implementation exists. This is true even when the host program is in binary form since OpenCL kernels are compiled at run-time (when the computing device to use is known). In contrast, CUDA-based programs run exclusively on nVidia GPUs. They cannot even run in the host CPU unless the program includes two versions of the algorithms as, for instance, BART does.
With , every algorithm (process) looks similar to each other by following the 11-point path we have described in section 3.3. The code separates nicely device administration tasks from algorithmic tasks in an easy-to-trace structured code flow. The programmer should therefore concentrate on the kernel specifics, i.e., on the algorithmic-related issues, while low level tasks are dealt with by the framework. Data structures are defined only once, irrespective of the targeted device; initializations and function execution are also clearly separated for the sake of efficiency. The framework has been satisfactorily tested on CPUs and GPUs from two different vendors. No performance loss is observed when is compared with other CUDA-based reconstruction frameworks.
In terms of limitations, at the current stage, does not support heterogeneous concurrent computation nor on-the-fly device selection and task partitioning for performance optimization. This is future work we intend to develop.
This work is supported by MINECO under grants TEC-2014-57428 and TEC2017-82408-R, as well as by Junta de Castilla y León under grant VA069U16.
-  Advanced Micro Devices, Inc. clFFT. https://github.com/clMathLibraries/clFFT.
-  Yarra Team at CAI2R. http://yarra.rocks.
-  Canonical, Ltd. Linux containers. https://linuxcontainers.org/.
-  Docker, Inc. Docker. https://www.docker.com/.
-  Jeff Fessler. Michigan image reconstruction toolbox. http://web.eecs.umich.edu/~fessler/code/index.html.
-  M Hansen, D. Atkinson, T. Schaeffter, and T. Sorensen. Real-time reconstruction of sensitivity encoded radial magnetic resonance imaging using a graphics processing unit. IEEE Trans. on Medical Imaging, 28(12):1974–1985, 2009.
-  M. Hansen and T. Sorensen. Gadgetron: an open source framework for medical image reconstruction. Magnetic Resonance in Medicine, 69(6):1768–1776, 2013. Available at http://gadgetron.github.io/.
-  Andreas Klöckner. PyOpenCL. https://mathema.tician.de/software/pyopencl/.
-  nVidia Corporation. OpenCL Best Practices Guide, 2011.
Javier Royuela-del Val, Lucilio Cordero-Grande, Federico Simmross-Wattenberg,
Marcos Martín-Fernández, and Carlos Alberola-López.
Nonrigid groupwise registration for motion estimation and compensation in compressed sensing reconstruction of breath-hold cardiac cine MRI.Magnetic Resonance in Medicine, 75(4):1525–1536, 2016.
-  Andreas Schwarzl. gpunufft. https://github.com/andyschwarzl/gpuNUFFT.
-  Federico Simmross-Wattenberg, Manuel Rodríguez-Cayetano, Javier Royuela-del Val, Elena Martín-González, Elisa Moya-Sáez, Marcos Martín-Fernández, and Carlos Alberola-López. OpenCLIPER. http://opencliper.lpi.tel.uva.es.
-  J. Tamir, F. Ong, J. Cheng, M. Uecker, and M. Lustig. Generalized magnetic resonance image reconstruction using the berkeley advanced reconstruction toolbox. In ISMRM Workshop on Data Sampling and Image Reconstruction, Sedona, Arizona, USA, 2016.
-  The Khronos Group, Inc. The khronos group. https://www.khronos.org/.
-  Martin Uecker, Frank Ong, Jonathan I. Tamir, Dara Bahri, Patrick Virtue, Joseph Y. Cheng, Tao Zhang, and Michael Lustig. Berkeley Advanced Reconstruction Toolbox. In Annual Meeting of the International Society for Magnetic Resonance in Medicine, number 23, page 2486, Toronto, Canada, 2015. Available at https://mrirecon.github.io/bart/.
-  Kaveh Vahedipour. Codeare. http://codeare.org/.
-  Denton Woods. Developer’s image library (DevIL). http://openil.sourceforge.net/docs/DevIL%20Manual.pdf.
-  Nicholas Zwart, Jim Pipe, and Ashley Anderson. Graphical programming interface: A development environment for MRI methods. Magnetic Resonance in Medicine, 74(5):1449–1460, 2015. Available at http://gpilab.com/.