Domain Specific Languages (DSLs) (Van Deursen et al., 2000) have become increasingly popular for several reasons. They help to hide domain specific details so that external developers can use the capabilities of the domain without special knowledge. SIMD operations are a good example for an DSL. Libraries such as Vc (Kretz, 2015) provide the capabilities to use the SIMD instructions of an arbitrary processor architecture, the user can focus on writing the application code.
DSLs for SIMD optimization are often used in software on HPC clusters. An example can be the simulation of the fluid dynamics around a wing body. If this wing body needs to be modified such that e.g. the drag is reduced a shape optimization is necessary. In order to perform the shape optimization accurate sensitivity information needs to be available. A general concept, for the semi automatic generation of sensitivity information, is the application of Algorithmic Differentiation (AD) (Griewank and Walther, 2008) to the simulation software. AD describes how a computer program can be interpreted as the concatenation of several millions of elemental operators like , , and
. The chain rule and the directional derivative is applied on this large concatenation of functions. This yields the forward mode of AD. The reverse mode of AD is introduced by applying the discrete adjoint calculus(Dunford and Schwartz, 1958; Apostol, 1969) on the forward mode.
The reverse mode of AD has to store a certain set of information for each elemental operation that is evaluated. If this is an operation of an DSL there are two concepts how this operation can be treated. Traditional operator overloading AD tools like ADOL-c (Walther and Griewank, 2009) or CoDiPack (Sagebaum et al., 2017) insert their AD type into the library for the DSL. Every operation inside the library is then evaluated with the type from the AD tool. For a simple linear system solve with a shifted right hand side
these tools would store the information for several thousands of elemental operations. An example implementation of the above equation in CoDiPack requires 62000 bytes for a vector dimension of 10.
The second concept is to treat the elemental operation as an external function and provide specialized code for the sensitivity computation. This process is quite involved, error prone and has to be repeated for every new application. An external function introduces an overhead in time and memory which makes it only feasible for, in this case, large linear systems. For small dimensions it would not be efficient.
The approach proposed in this paper adds the functionality of a DSL directly to the set of elemental functions that the AD tool can handle. This eliminates the overhead from the external function approach and does not need information from intermediate statements. If this is done for the above example the required memory would be 1200 bytes which is a reduction by the factor of 51.
However, difficulties arise when implementing this idea. Instead of handling just the floating point type of the programming language, the AD tool needs to handle all types introduced by the DSL. Furthermore, the implementation of the elemental functions is problematic. There are several thousands of DLSs already available and an AD developer can not handle all of them. The AD developer typically has not the knowledge about the DSL to implement the sensitivity computation. On the other hand, the developer of the DSL has not the knowledge to implement an AD tool for his language. Therefore, we want to develop a generalized AD tool in this paper. The developer of the DSL has to provide the specification for the types and operations of the DSL, and the derivative computation for each operation. The generalized AD tool will then be generated with this information and can be used to compute sensitivities.
This paper will first give an introduction to AD and explain very briefly how operator overloading tools are implemented. Afterwards the concept of AD for DSL is explained in more detail. The design goals and implementation details for the generalized AD tool are shown in the following chapters. The performance of the implementation is compared using examples with and without DSLs.
2. Algorithmic Differentiation
2.1. Basic theory
In this paper we are only providing a very brief introduction to AD, for a complete derivation of the reverse AD mode see (Griewank and Walther, 2008; Naumann, 2012). AD will be applied on the function
with and . The reverse AD mode for this equation is, following the derivation in (Griewank and Walther, 2008),
with and and describes what the reverse AD mode computes. The adjoint solution is computed in the background by evaluating
with , , and for every elemental operation that computes . can consist of several millions of elemental operations with and which are evaluated from to . In order to evaluate equation (3), AD stores information for every during the primal evaluation. Afterwards, AD evaluates equation (3) for from to with the stored information. The sensitivity information is propagated in reverse order from the output variables to the input variables. An example elemental function can be with . The reverse AD evaluation would then be
2.2. Theory for extended operators
The definition of in (3) assumes that an elemental function can only have one output element. This restriction can be lifted as described in (Griewank and Walther, 2008). Let be a function with output variables with . can be written as . to are elemental operators in the context of AD and can be treated with the theory presented above. The only difference is the special dependencies of the operators to which can be exploited in the implementation for domain specific languages.
3. Operator overloading AD implementation
AD can be applied to a source code in two ways. The source transformation approach generates a new source code that adds additional statements for the reverse AD mode. This approach is mostly applied via Tapenade (Hascoët and Pascual, 2013) or ADIC (Bischof et al., 1996) to Fortran codes. The second approach introduces AD through the operator overloading capabilities of a language like C++. All operators for a new computation class are overloaded and store the required information for the reverse AD mode in the background. After the program is evaluated the stored information is interpreted in the reverse order and the sensitivities are propagated from the output values of the program to the input values. For a detailed description of an AD tool implementation see (Sagebaum et al., 2017; Hogan, 2014).
In general each AD tool implements a tape structure where the information is stored. The tape consists of several global data vectors and data streams. In this paper we are mainly considering a primal value taping approach. Here, the input values in equation (3) are stored. For the adjoint variable identification (e.g. ) a reuse index management scheme is used. This means that the identifier of a destructed variable can be assigned to a different variable after the destruction. A primal value taping approach with a reuse index management scheme requires two global vectors. The first one is the primal value vector and holds the primal values of all variables that are currently used in the program. The second vector is the adjoint vector which holds all corresponding adjoint values for each primal value (e.g. for ). These vectors are accessed in a random access pattern and are stored in random access memory (RAM). The taping scheme requires furthermore six data stream. For each elemental operator (statement) the identifier of the output value (left hand side (lhs)), the old value of the left hand side, the function handle for the evaluation of equation (3) and the number of active arguments are required. For each argument of an elemental operator (right hand side (rhs)) the identifier is required. The sixth data stream consists of the data for all constant arguments. For details on the required data see (Sagebaum et al., 2018). Figure 1 shows the data for the taping scheme in a graphical way and highlights the required data for a sample elemental operation . It also shows the index manager which holds all indices that are currently not used. The index manager can be queried for new indices where describes the index that corresponds to and can be used to access the adjoint .
4. AD for domain specific languages
Each AD tool provides an AD type that is then used in the application. Here we assume that the type is called Real. A general use would then be:
Any DSL can be differentiated with AD by exchanging the calculation type of the implementation. If the DSL is templated this can be archived by changing the template argument. That is, Vector<double> would become Vector<Real>. The introduction showed that this approach can be rather inefficient since all intermediate steps are recorded by the AD tool. It can also not be applied if the DSL is defined in language intrinsics like the F64vec4 type for a four element double vector from the AVX instruction set (Firasta et al., 2008).
Lets assume that the DSL is named and consists of a finite set of data objects and a finite set of operations where describes the ordered set of arguments for with for all in . The language is then specified via . The traditional approach would now modify the objects and operations by using the AD type for the implementation. This generates the language . The new approach of this paper takes all data objects and creates a new set of objects where is an identifier for a data object of type . For each operator a new set of operators is created by changing the arguments of the operator to the corresponding data objects in . This is done for each subset of arguments such that data objects and data objects are intermixed. The mathematical formulation is now
where describes a subset with elements. The set of the arguments is then defined as
The new differentiated language is defined as and is based on the language . In terms of software engineering the old process can be described as defining the type Real as
and transforming the language by changing Vector<double> to Vector<Real>. The new process defines the AD type
and then creates a new language by using DSLReal<Vector<double> >. That is, instead of putting the AD type inside the structure, the AD type is wrapped around the structure.
5. Designing AD for domain specific languages
The implementation of the new language from the language should not be done for only one sample DSL. There are several thousand DSLs available and new ones are created regularly. An AD implementation for DSLs needs to have the following properties:
Independent of the DSL
Easy to adapt to new DSLs
No AD implementation knowledge of the DSL developer/user required
The first property is already motivated above. The second one describes the general usability of the implementation. For a library developer it should be quite simple to describe his DSL to the AD tool implementation. The third property is important to guarantee the usability of the tool. The developer of an DSL, who wants to provide an AD capability of his language, should only provide the derivative information and nothing more. The AD specific data management is then handled by the AD tool which is presented in this paper.
The global vectors from figure 1 need to be addressed first. If the design is left as it is then all data objects of the language need to be converted into the data format of this vector which can introduce a significant overhead. A further problem is the memory alignment of the data. SIMD types for example should not be instantiated at arbitrary memory locations. The design of one global vector is therefore droped. Each data object will have its own global vectors which can then be configured for the requirements of the data object. Since the index manager translates identifiers into positions of the global vectors, there needs to be one index manager for each data object.
The second design decision concerns the data streams of the tape in figure 1. From the six required streams the left hand side old data and the constant value data stream have to hold object specific data. If the same approach is used as for the global vectors it would introduce two data stream for each data object of the language . The synchronization and management is then much more involved and can currently not be handled with the library for the stream management. Therefore, the stream structure is not changed and all data objects need to be converted to and from the streams. These two decisions lead to the new tape layout in figure 2. The design consists now of the global tape and secondary tapes that hold the information for each data object.
This covers the basic design of the new tape for the language . It is now important to specify the information for the reverse evaluation. The reverse AD equation (3) from section 2.1 is reformulated in the context of the language . Let be in then the primal evaluation is
with and . takes in this case the place of the elemental operator . The reverse AD equation can now be formulated in the same context as
The implementation of the AD tool will provide all values for the inputs , the output and the adjoint variables. The missing information is the computation of the derivatives . Since this is a DSL specific information, the user needs to provide it. If only the derivative is made available the type of the object is not clear. If the computation of is provided the result is a data object of the same type as and the computation might be implemented in an optimized way. Therefore, the third design decision is that the user has to provide functions for the computation of for all and the evaluation of the transposed derivative multiplied by a data object that is for all .
6. Implementing AD for DSL
The data format for the specification of the language L and the additional derivative information is the most critical point. The best option would be some kind of pseudo code or real code that can be parsed and used for the generation. Since this is the first implementation we are using the format of the code generator. The code generation is done with the gsl library (iMatix Corporation, 2018) which allows a very dynamic mixing of generated code and generation specific logic. The data format for gsl is xml (Bray et al., 1997) and therefore the specification for the language is also done in xml.
A new data object is created with
This will trigger the generation of the type ActiveMatrix<T>. Currently the underlying type is just defined as a template but can be specified directly in future implementations. The class itself contains the storage for the type T and the identifier. Furthermore, it provides getter and setter functions for the members as well as some AD specific forwards to the implementation of the global tape from figure 2. The xml code will also trigger the generation of a MatrixExpression type which is used by AD to create expression templates in order to improve the recording process. For an overview about expression templates see (Veldhuizen, 1995; Aubert et al., 2001) and for an AD implementation with expression templates see (Hogan, 2014; Sagebaum et al., 2017).
The operators are defined via a xml structure:
<name> provides the name of the operator. <structure name> has to be the name of the structures defined as shown in listing 1. <arg name> is the name of an argument. The element <arg> can be specified multiple times and the nested <reverse> element specifies the code for the evaluation of as discussed above. The <primal> element specifies the computation of . The <function> element can either be specified inside a <structure> element which will generate a member function or in the global xml body which will generate a function. An example for the multiplication of a matrix and a vector would look like:
The generated function and expression object is then:
The code in the code sections is included in the generated functions. Each argument has the type of the original data object. The return value is provided as an argument with the name r and for member functions the class itself is provided as an argument with the name t. For all involved arguments the corresponding bar value is provided with the same type as the corresponding primal value . The name is extended by _b. The example shows only the generation for the case where both arguments are active (AA). The generation of the cases with one passive argument (PA and AP) are generated accordingly.
7.1. Coupled Burgers equations
The coupled Burgers equations (Biazar and Aminikhah, 2009) are chosen as a first test case. They are used to compare the performance of the new AD tool to some existing AD tools. This ensures that now major performance killers are used in the implementation.
The coupled Burgers equations
are discretised with an upwind finite difference scheme. The initial conditions are
and the exact solution is after (Biazar and Aminikhah, 2009),
The computational domain is the unit square and the boundary conditions are taken from the exact solution. For the test runs a grid size of 601x601 grid points and 32 discrete time steps are used. The computations are performed on the Elwetritsch cluster of the TU Kaiserslautern and the test case is evaluated on one node of the cluster which consists of two Intel E5-2640v3 processors. Two load cases are considered. For the first case, only one process is run. For the second, the sequential program is run on each of the 16 cores simultaneously. This simulates an environment where the memory bandwidth of the node is fully utilized. The two cases are called ‘sequential’ and ‘bandwidth limited’ in the further analysis. As a comparison for the DSL AD tool we have chosen CoDiPack (Sagebaum et al., 2017) since the basic infrastructure from there is used for the new AD tool.
The results for the time Measurements are presented in figure 3. They show, that the added complexity for the handling of the DSL operators and types, does not increase the required time. For the recording time in the sequential case the DSL AD tool is even slightly faster than CoDiPack. The DSL AD tool is currently not as heavily templated as CoDiPack which might provide the compiler with opportunities for optimization. The memory for both tools is also identical since they store the same data.
7.2. Spline evaluation
For the DSL specific test case the spline interpolation example from the Vc library(Kretz, 2015) is chosen. The interpolation is performed on a unit square where each dimension of the square is divided into regions. The splines for the interpolation in each sub region are third order splines in both dimensions. The vectorization is done over the nodes of the splines. They are three dimensional and handled with SIMD vectors of size four. The example is implemented to run with single precision and therefore the AD tools are also changed to single precision.
The computations are performed on the Elwetritsch cluster of the TU Kaiserslautern and the test case is evaluated on one node of the cluster which consists of two Intel E5-2640v3 processors. Here we currently consider only the load case where one thread is executed. This means that the test is not memory bandwidth limited. Each test consists of the evaluation of samples points. These points are generated randomly which is also true for the coefficients of the spline interpolation. The time results are averaged over 20 runs. For the comparison CoDiPack is again used as a baseline value, in addition the DSL AD tool is once run with the scalar implementation of the interpolation and once run with the vectorized implementation. All evaluations of the sample points are recorded in one large tape and evaluated accordingly. This reduces the management overhead for the tape in the provided time measurements.
The recording time for CoDiPack is roughly 1.5 seconds slower during the recording of the tape than the AD DSL tool as shown in figure 4. If the evaluation times are considered then the relation is reversed. Here, the DSL AD tool is one second slower than CoDiPack. This discrepancy is in contrast to the Burgers test case and might come from the increased amount of constant values which is introduced by the interpolation nodes of the spline. CoDiPack can use the constant data directly from the stream since all arguments of the expression have the same type. This is not possible for the DSL AD tool as it needs to restore the constant data to type specific vectors such that the static evaluation of the expressions can access the data from these vectors.
The comparison of the scalar and vectorized implementation with the DSL AD tool yields the expected results. The recording time is slightly decreased since less statements are evaluated and stored. Overall, the general overhead from the tape handling still dominates the recording process. For the reverse evaluation of the tape the situation is the same. A greater decrease in time is seen here since the SIMD operations also increase the performance.
For CoDiPack and the scalar DSL AD implementation the memory is the same and around 4.9 Gb. The vectorized version requires only 3.7 Gb which is a reduction of 24%.
The paper demonstrates a new approach for the handling of Domain Specific Languages in the context of Algorithmic Differentiation. Instead of inserting the AD tool into the library of the DSL, a new AD type is generated for every type of the DSL. The results show that the presented approach is as fast as state of the art AD tools for general applications. In an applications that uses SIMD operations the performance of the new approach is increased since the SIMD optimizations can still be used. Furthermore, this leads to a reduced amount of memory.
Since the results are very promising the AD tool for DSLs will be developed further. The code will be cleaned up and documentation will be written. It is also necessary to include the handling of multiple output values. For the test cases in this paper it was not necessary and therefore omitted in the current implementation. Further steps are then to add additional helper methods for the convenience of the user.
- Apostol (1969) T. M. Apostol. 1969. Calculus, Volume 2. John Wiley & Sons, Inc.
- Aubert et al. (2001) P. Aubert, N. Di Césaré, and O. Pironneau. 2001. Automatic differentiation in C++ using expression templates and application to a flow control problem. Computing and Visualization in Science 3, 4 (2001), 197–208. https://doi.org/10.1007/s007910000048
- Biazar and Aminikhah (2009) J. Biazar and H. Aminikhah. 2009. Exact and numerical solutions for non-linear Burger’s equation by VIM. Mathematical and Computer Modelling 49, 7 (2009), 1394–1400.
- Bischof et al. (1996) C. H. Bischof, A. Carle, P. Khademi, and A. Mauer. 1996. ADIFOR 2.0: Automatic Differentiation of Fortran 77 Programs. IEEE Computational Science & Engineering 3, 3 (1996), 18–32.
- Bray et al. (1997) T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. 1997. Extensible markup language (XML). World Wide Web Journal 2, 4 (1997), 27–66.
- Dunford and Schwartz (1958) N. Dunford and J.T. Schwartz. 1958. Linear Operators: General theory. Interscience Publishers. https://books.google.de/books?id=i2EPAQAAMAAJ
- Firasta et al. (2008) N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo. 2008. Intel AVX: New frontiers in performance improvements and energy efficiency. Intel white paper 19 (2008), 20.
- Griewank and Walther (2008) A. Griewank and A. Walther. 2008. Evaluating Derivatives, second edition. SIAM.
- Hascoët and Pascual (2013) L. Hascoët and V. Pascual. 2013. The Tapenade Automatic Differentiation tool: Principles, model, and specification. ACM TOMS 39, 3 (2013). http://dx.doi.org/10.1145/2450153.2450158
- Hogan (2014) R. J. Hogan. 2014. Fast reverse-mode Automatic Differentiation using expression templates in C++. ACM TOMS 40, 4 (2014), 26:1–26:24. http://doi.acm.org/10.1145/2560359
- iMatix Corporation (2018) iMatix Corporation. 2018. GSL. https://github.com/imatix/gsl. (2018). [Online; accessed 10-January-2018].
- Kretz (2015) M. Kretz. 2015. Extending C++ for explicit data-parallel programming via SIMD vector types. Ph.D. Dissertation.
- Naumann (2012) U. Naumann. 2012. The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM. https://books.google.de/books?id=VpheBMaDf4kC
- Sagebaum et al. (2017) M. Sagebaum, T. Albring, and N. R. Gauger. 2017. High-Performance Derivative Computations using CoDiPack. arXiv preprint arXiv:1709.07229 (2017). https://arxiv.org/abs/1709.07229
- Sagebaum et al. (2018) M. Sagebaum, T. Albring, and N. R. Gauger. 2018. Expression templates for primal value taping in the reverse mode of Algorithmic Differentiation. submitted to Optimization Methods & Software (2018).
- Van Deursen et al. (2000) A. Van Deursen, P. Klint, and J. Visser. 2000. Domain-specific languages: An annotated bibliography. ACM Sigplan Notices 35, 6 (2000), 26–36.
- Veldhuizen (1995) T. Veldhuizen. 1995. Expression templates. C++ Report 7 (1995), 26–31.
- Walther and Griewank (2009) A. Walther and A. Griewank. 2009. Getting started with ADOL-c.. In Combinatorial scientific computing. 181–202.