High-performance computing hardware is progressing quite rapidly towards more and more heterogeneous platforms, which makes it very difficult for researchers and developers of scientific software to keep up with the latest developments in chip designs and to explore emerging hardware technologies like, e.g., hybrid CPU-FPGA devices, without spending a large part of their time on reimplementing the same algorithms and core functionalities over and over again for the different compute devices. Software packages that explicitly aim at supporting heterogeneous platforms, e.g., multi-core CPU systems combined with many-core accelerators like GPUs, dedicated vector processors, and/or FPGA expansion cards, suffer even more from the rapid changes of the hardware market, since their developers need to ensure that the implementations of an algorithm for the different hardware platforms are kept at the same maturity and functionality level.
Over the last two decades, the trend in high-performance computing (HPC) for practical large-scale applications goes away from writing hand-optimized application codes towards compiler-based code generation and automated performance tuning [1, 3].
It is a long-living myth that the use of expression template meta-programming techniques automagically leads to efficient computer code. However, the use of specialized expression template libraries (ETL) like Armadillo , ArrayFire , Blaze , Eigen , VexCL , ViennaCL  that provide hardware-optimized linear algebra routines for vectors as well as dense and sparse matrices allows to write concise source code at an abstract mathematical level that gets compiled into executables that exploit the hardware capabilities to the extent implemented by the authors of the linear algebra back-ends. None of them, of course, supports all target hardware platforms and the provided functionality and foreseen use case scenarios largely differ from one library to the other. Despite all differences, the common aim of expression template libraries is to provide mechanisms to formulate mathematical vector expressions like y=0.5*sin(x+y) and evaluate them in a single loop over the vector entries rather than creating temporaries for sub-expression.
The Fluid Dynamic Building Blocks (FDBB) project  makes an attempt to develop a unified wrapper interface for the core functionality of all of the aforementioned linear algebra back-ends and provides an extendible set of expression templates for developing fluid dynamic applications with the focus placed on compressible flows. These expressions include transformations between conservative, primitive and characteristic variables as well as Riemann invariants, different types of equations of state (EOS), as well as inviscid fluxes and their flux Jacobians. FDBB is a header-only C++11/14 library, which is designed to leave the underlying linear algebra back-ends largely unmodified so that applications automatically benefit from improvements in the ETLs provided that their API does not change between versions. The low-order API of our package is released as standalone software, the Unified Expression Template Library Interface (UETLI) .
The overall structure of our software package is depicted in Figure 1.
2.1 Low-level API: UETLI
The different linear algebra back-ends largely vary in functionality, maturity, performance, calling conventions and in the way they evaluate the expressions on the target hardware. Most CPU back-ends employ a delayed evaluation approach based on recursive templates and template meta-programming, to combine several operations into one to reduce (or eliminate) the need for temporaries. In contrast, the multi-device back-ends ArrayFire , VexCL  and ViennaCL  utilize just-in-time (JIT) compilation techniques to convert automatically generated source code into executable code.
Consider the code snippet depicted in Figure 2 that computes the element-wise sine of the vector sum , scales the result by the constant 0.5 and assigns it to . The tag<ID>(expression) function is optional to assign a unique ID tag to the expression, which helps the VexCL  back-end to not pass the same expression as multiple arguments to the device kernel. As a general design principle of our software, functionality that is only supported by some ETLs reduce to no-ops in the other cases, which is realized by template specialization. The CONSANT(value, expression) macro ensures that the data type of the constant equals that of the expression result and, moreover, enables further optimization if that is provided by the back-end. The unitary operation elem_sin(expression) is one example of more than 30 element-wise arithmetic operations that can be applied to vectors, dense and sparse matrices, and block expressions, the latter being discussed below.
Figure 3 shows the OpenCL source-code that has been auto-generated by the VexCL  library from the above vector expression and can be further processed into CPU or GPU code by the OpenCL subsystem. VexCL also provides code generation engines for CUDA and OpenMP as well as experimental support for Maxeler’s dataflow computing platform , which aims at making field-programmable gate arrays (FPGAs) usable as next-generation accelerator devices. Maxeler Technologies provides a software development kit consisting of a Java-like programming language, MAXJ, as well as compilers and libraries to synthesizes the high-level compute kernels into bitstreams to reconfigure the FPGAs at runtime. In this case, JIT compilation can take up to several hours or days but, still, the fully automated generation of FPGA bitstreams from mathematical expressions is an attractive feature. Switching to another back-end is realized by changing lines 1–2 of Figure 2 leaving the actual mathematical expression in lines 3-4 unmodified.
The library makes extensive use of rvalue references, move semantics, and perfect forwarding. Wrapper functions are implemented based on the design pattern depicted in Figure 4. If type A requires special treatment in back-end BACKEND other than calling sin(std::forward<A>(a)), the get_element_sin_impl<A> trait needs to be specialized to hold EnumETL::BACKEND in its attribute value. The functionality is then provided by the specialized function elem_sin_impl<A,EnumETL::BACKEND>::eval(A&& a).
This approach enables minimally-invasive addition of back-ends and fine-grained control over specialized treatment at the level argument types in unary and binary operators.
The elem_sin(a) function automatically returns an expression object, whose type depends on the adopted linear algebra back-end. It can be either assigned (=evaluated) to a vector or matrix object, i.e. y = elem_sin(x), or passed as argument to another function, i.e. y = elem_sin(elem_sin(x)). The latter is used on Section 2.2 to compose expressions for the inviscid fluxes and the flux-Jacobians from smaller modular building blocks.
However, not all back-ends support the construction of expressions from sub-expressions, which is caused by their inability to store temporarily created sub-expressions as objects in further expressions but instead just store references, which will become invalid once the underlying objects reach the end of their scope. We have implemented a caching mechanism as a remedy, which is itself a lightweight ETL with full UELTI functionality support. Cache expressions encapsulate the temporal expression objects that are generated by the back-end and pass references to them transparently to the back-end functions.
Figure 5 illustrates the general use of the caching mechanism for the Armadillo  back-end. The two column vectors are encapsulated by the CacheExpr<Tag,ExprType> objects, which require unique ID tags next to the expression types. All unary and binary operations return themselves cache-type objects that hold the sub-expression objects internally. The expression object can be obtained using the get() function, which makes it even possible to combine the caching mechanism with native expression template code
x.get() = y.get() + arma::randu(10);
It should be noted that line 5 in the above code snippet does not trigger evaluation of the cached expression but only assigns the unevaluated and encapsulated expression to the object E. Evaluation happens during the assignment to y in line 6. In practice, lines 5-6 are typically fused unless E serves as sub-expression in multiple further expressions.
The caching mechanisms is moreover a helpful debugging tool. E.pretty_print(os) yields
An extensive dump of the entire expression tree is produced by E.print_debug(os).
Lastly, UETLI provides a framework for working with block expressions, that is, expressions that are composed from block matrices and vectors of fixed block size. As an example, consider the following block matrix-vector multiplication
where , , and . With the aid of block expressions, this can be implemented as shown in the code snippet depicted in Figure 6. Here, M and Y (lines 9-12) are a block matrix view and block column vector, respectively, and E (lines 13-14) is a block expression consisting of 2 rows and 1 column. Block matrices and vectors can only store objects of the same type, whereas block expressions accept an arbitrary combination of types as it is required to handle the different expression objects returned from the element-wise sine and cosine function. View objects in contrast to the non-view counterparts only store references to the arguments passed, which implies that the sparse scalar matrices A, B, C, and D are not duplicated and copied into M (line 9), which would have been the case if M was of BlockMatrix type. In contrast, the move constructor of the block column vector is used in lines 10-12 so that, again, no data duplication takes place.
makeBlockExpr(exprs…) creates an object of type BlockExpr<nrow,ncol,Exprs…>, which is the most flexible block container since it supports the mixing of back-ends by design. However, most linear-algebra back-ends do not support mixed expressions, i.e., matrix-vector multiplication between a sparse Eigen matrix and a Blaze vector (yet). The aforementioned block types support all unitary and binary operations and element-wise functions that can be applied to a scalar matrix or vector object if they make mathematical sense.
The idx-th sub-item of a block object is accessible via fdbb::utils::get<idx>(obj), whereby all block types adopt row-major storage ordering by default. The latter can be adjusted by passing StorageOrder::ColMajor as template parameter to the block objects.
2.2 High-level API: FDBB
On top of UETLI, we created the expression template library FDBB, which provides the main building blocks for developing fluid dynamics applications.
Secondary variables can be computed from the primary ones with only a few lines of code. Let denote the state vector of conservative variables in 3D and assume a perfect gas with adiabatic index . Then the absolute pressure is computed in a single line as shown in the code snippet depicted in Figure 7. By making dimension (’3’) and variable type (’EnumForm::conservative’) template parameters, it is even possible to write dimension- and formulation-independent application code. This approach is most effective in combination with factory-based object creation .
Instead of passing the scalar variables one by one, it is handy to collect them in a block expression that can be passed as single parameter (see paragraph ’Passing of Variables’ below for details). The scalar variables can be accessed via fdbb::utils::get<idx>(U).
A further advantage of using block expression is the easy conversion between state vectors. Let the vector of conservative values U be defined as in line 1 of Figure 8. Then the conversion from conservative to primitive variables and vice versa can be realized elegantly as illustrated in the following code snippet. Assuming that well-designed linear algebra back-ends will not perform copy operations if source and destination vectors are the same, no memory bandwidth is lost on unnecessary transfers of the density variable.
Equations of state.
User-defined equations of state of the form with absolute pressure , volume , and absolute temperature can be specified by implementing a derived class that implements the prototype EOF_pVT depicted in Figure 10. In a forthcoming release of FDBB, experimental support for the open-source thermophysical property library CoolProp  will be enabled, which provides a large collection of equations of state for (pseudo-)pure fluids. Since CoolProp does not accept vector expressions as arguments, its use is currently limited due to the generation of temporary objects. This shortcoming can be overcome by extending CoolProp to accept generic vector arguments and perform computations based on expression templates rather than data directly.
dimensional tensor of inviscid fluxes
is implemented as a ready-to-use block expression, cf. Figure 11, whereby it is assumed that eos and varU are defined as in lines 1–4 of Figure 7 and U is the state vector of conservative variables defined in line 1 of Figure 8. It should be noted that F only holds the expressions, while their evaluation takes place upon assignment to block matrix f.
The implementation of flux-Jacobian matrices for the inviscid fluxes is not yet finished and will be enabled in a forthcoming release of the FDBB library.
Passing of Variables.
FDBB has been designed with utmost flexibility in mind. Except for a few exceptions, all functions exhibit the same generic interface shown in Figure 12,
which allows the user to pass any combination of arguments and leave the mapping to variables to an extra trait that is passed to the variable type, say varU, as additional template parameter. The default behavior is a perfectly forwarding 1-to-1 map from the parameter pack vars… to the variables, whereby the mapping from the variable, dimension, and formulation triple to the argument index is realized via an extensive specialization of the MapVar2Arg<Var, dim, Form> trait. The default behavior can be changed by providing a user-defined mapping that follows the structure depicted in Figure 13. Additional traits the support the passing of state variables as block objects as illustrated in Figures 8 and 9 are provided and can be further adjusted to the needs of the user.
3 Performance Analysis
An extensive performance analysis of all possible combinations of linear algebra back-ends and FDBB features on all supported hardware platforms is beyond the scope of this paper. We restrict ourselves to a synthetic micro-benchmark to measure the computational overhead introduced by the extra FDBB layer and one mini-application.
The kinetic energy or a multiple thereof occurs quite frequently as sub-expression in fluid dynamics applications. We therefore chose the calculation of from the conservative state vector in 3D, i.e. varU::v_mag2(U), as micro-benchmark.
All tests were run under CentOS Linux 6.7 with thread pinning (likwid-pin -c N:0-15) on a dual-socket workstation (Intel E5-2670 @ 2.6 GHz, 20MB cache) with 64GB main memory. The ArrayFire and VexCL back-ends were tested in CUDA-mode on an NVIDIA Tesla K20Xm GPU with 6GB memory and ECC turned off. The exact compiler versions were gcc 5.3.0 and nvcc 7.5.17 with CUDA driver version 352.93.
Figure 14 shows the compute performance (left) and the memory bandwidth (right) measured for a wide range of problem sizes for the element-wise expression
Remarkably, no measurable performance loss is observed between the back-end specific implementations (straight lines) and the FDBB-enabled generic ones (symbols).
To estimate the performance of FDBB in real-life scenarios we implemented a mini-app, in which the conservative variables are initialized once by physical values and used to evaluate the inviscid fluxes multiple times. The computing times are given in Table1. Columns 2–4 show the wall clock-times (in ) measured for the three efficient linear algebra back-ends Blaze, Eigen, and VexCL. Though all back-ends run in CPU-mode and employ OpenMP parallelization, the VexCL back-end is 1,5x faster then the slowest one for the largest problem size, which consumes 1,25 GB of main memory.
The same mini-app has been run on an IBM Power S822LC server, which consists of 128 cores running at 4.02 GHz and features Nvidia’s NVLink interconnect to communicate with four Nvidia P100 Pascal GPUs. The results are given in columns 5–7 of Table 1. The performance of the two CPU back-ends surprisingly differs by 5x. The savings in terms of computing times resulting from using a single GPU is up to 30x.
|2x Intel E5-2670 @ 2.6 GHz||POWER8NVL @ 4.02 GHz + GP100GL|
This paper described the main features and design principles of the FDBB project (https://gitlab.com/mmoelle1/FDBB) and provided a brief performance analysis. From the results obtained from a synthetic micro-benchmark we conclude that the additional abstraction layer introduced by FDBB does not cause any performance penalty. Preliminary timings for the more realistic mini-app support our claim that it is possible to write generic codes for heterogeneous HPC systems without sacrificing efficiency. It is, however, necessary to implement mechanisms to choose the optimal back-end for the platform at hand since the performance of the different back-ends can differ by orders of magnitudes.
The author would like to thank Denis Demidov, Peter Gottschling, Klaus Iglberger, Karl Rupp, and Conrad Sanderson for their support on integrating the different linear algebra back-ends into FDBB. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 678727.
-  Basu, P., Williams, S., Straalen, B. Van, Oliker, L., Colella, Ph., and Hall, M. Compiler-Based Code Generation and Autotuning for Geometric Multigrid on GPU-Accelerated Supercomputers, Parallel Computing (PARCO) (2017). DOI: 10.1016/j.parco.2017.04.002.
-  Bell, I. H., Wronski, J., Quoilin, S., and Lemort, V. Pure and Pseudo-pure Fluid Thermophysical Property Evaluation and the Open-Source Thermophysical Property Library CoolProp. Industrial & Engineering Chemistry Research (2014) 53(6):2498–2508. DOI: 10.1021/ie4033999.
-  Christen, M., Schenk, O. and Burkhart, H. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. DOI: 10.1109/IPDPS.2011.70.
-  Demidov, D., Ahnert, K. Rupp, K. and Gottschling, P. Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing (2013) 35(5):C453-C472. DOI: 10.1137/120903683.
-  Gamma, E., Helm, R., Johnson, R., and Vlissides, J. Design Patterns: Elements of Reusable Object-Oriented Software. Addison
-  Guennebaud, G., Jacob, B. et al. Eigen v3. (2010), http://eigen.tuxfamily.org.
-  Iglberger, K., Hager, G., Treibig, J. and Rüde, U. Expression Templates Revisited: A Performance Analysis of Current Methodologies. SIAM Journal on Scientific Computing (2012) 34(2):C42–C69. DOI: 10.1137/110830125.
-  Möller, M. and Jaeschke, A. FDBB: Fluid Dynamics Building Blocks. (2018), Retrieved from https://gitlab.com/mmoelle1/FDBB.
-  Möller, M. UETLI: Unified Expression Template Library Interface. (2018), Retrieved from https://gitlab.com/mmoelle1/uetli.
-  Pell, O. and Averbukh, V. Maximum Performance Computing with Dataflow Engines. Computing in Science & Engineering (2012) 14(4):98–103. DOI: 10.1109/MCSE.2012.78.
-  Rupp, K., Tillet, Ph., Rudolf, F., Weinbub, J., Morhammer, A., Grasser, T., Jüngel, A., and Selberherr, S. ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures. SIAM Journal on Scientific Computing (2016) 38:412–439. DOI: 10.1137/15M1026419.
-  Sanderson, C. and Curtin, R. Armadillo: a template-based C++ library for linear algebra. Journal of Open Source Software (2016) 1(2):26. DOI: 10.21105/joss.00026
-  Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, James and Melonakos, J. (2015). ArrayFire - A high performance software library for parallel computing with an easy-to-use API. Atlanta: AccelerEyes. Retrieved from https://github.com/arrayfire/arrayfire
-  Gottschling, P. and Lumsdaine, A. The Matrix Template Library 4. Retrieved from www.osl.iu.edu/research/mtl/mtl4/.