Fast gradient evaluation in C++ based on Expression Templates.
There are several AD tools available, which all implement different strategies for the reverse mode of AD. The major strategies are primal value taping (implemented e.g. by ADOL-c) and Jacobi taping (implemented e.g. by adept and dco/c++). Especially for Jacobi taping, recent advances by using expression templates make this approach very attractive for large scale software. The current implementations are either closed source or miss essential features and flexibility. Therefore, we present the new AD tool CoDiPack (Code Differentiation Package) in this paper. It is specifically designed for a minimal memory consumption and optimal runtime, such that it can be used for the differentiation of large scale software. An essential part of the design of CoDiPack is the modular layout and the recursive data structures, which do not only allow the efficient implementation of the Jacobi taping approach, but will also enable other approaches like the primal value taping or new research ideas. We will also present the performance value of CoDiPack on a generic PDE example and on the SU2 code.READ FULL TEXT VIEW PDF
This paper presents reverse-mode algorithmic differentiation (AD) based ...
We present the new software OpDiLib, a universal add-on for classical
We present ForwardDiff, a Julia package for forward-mode automatic
Automatic differentiation (AD) is an important family of algorithms whic...
In this work we present useful techniques and possible enhancements when...
With more efficient structures, last trends in aeronautics have witnesse...
No single Automatic Differentiation (AD) system is the optimal choice fo...
Fast gradient evaluation in C++ based on Expression Templates.
Algorithmic Differentiation (AD) describes the mathematical theory how a computer program can be differentiated. A basic introduction to AD will be given in the second section of this paper. However, we already state here the fundamental result of the reverse mode of AD, namely, that for each statement with in the software the corresponding adjoint statement
must be evaluated. When the software is written in Fortan, source-code transformation [Griewank (2000)] is the method of choice to automate the generation of code to compute (1). The application of source-code transformation to C++ code is limited due to the complex language structure. Instead, Operator Overloading (OO) has been proven to be more appropriate. All of the available tools that implement AD based on the OO approach, have the problem of storing the data and evaluating equation (1) in a number of different ways depending on the programming language, field of application or just the personal preference. However, they all have in common, that since the flow of data is reversed, it is required to either store the values of or to directly store the Jacobian during the evaluation. This corresponds to what we refer to as the Primal taping or the Jacobian taping method, respectively.
The most well known representative that implements the former method is ADOL-C [Walther and Griewank (2012)]. It is released under the EPL and GPL2 license and is widely used in science for small and medium sized applications. Its shortcomings are the runtime and memory overhead when applied to large-scale problems. Still, its level of maturity and robustness makes it an excellent aid for validation and verification, when developing new operator overloading AD tools.
The Jacobian taping method is quite popular in combination with expression templates (ET) [Veldhuizen (1995), Aubert et al. (2001)] to increase the runtime efficiency. Two examples of tools that make use of this approach are for example adept [Hogan (2014)] and dco [Leppkes et al. (2016)]. adept
is released as open-source and offers high performance and a relatively low memory footprint, but it lacks important features like higher-order differentiation, a vector-mode or a (tapeless) forward mode implementation. Althoughdco offers all these features, we think that its proprietary license makes it unattractive for individuals and organizations in science and industry to use it in their in-house or open-source software. Furthermore, in our opinion, both tools have in common that their structure is quite inflexible so that they are not easily extensible. These were the initial reasons for starting the development of the Code Differentiation Package (CoDiPack). Table 1 shows a comparison in terms of taping methods, implementation and license of the tools mentioned above.
Today, the main focus of CoDiPack is the application of AD to high-performance computing (HPC) and industrial-grade software, while still maintaining its open-source philosophy. Hence, the following properties are always considered during the development:
One of the top priorities is to write an efficient tool for the application of AD and to reduce the runtime overhead of the differentiated software. Similarly, the memory should be used as efficient as possible and the layout of the memory access should conform to modern HPC standards. CoDiPack offers a number of different ways such that the characteristics of the code base on which it is applied can be maintained. This will improve the performance of the differentiated software. Examples are the support for c-like memory operations (e.g. memcpy) or a data layout that is optimized for running several reverse sweeps in a row (see section 2).
For the tool to be accepted by the users it is important that the time to get familiar with the API is quite short. Thus first results should be obtained in a minimal amount of time. This is accomplished by having distinct and well documented user interfaces as well as simple tutorials. Furthermore, to apply CoDiPack usually no major changes in the software are required.
To increase the pace in AD related research, CoDiPack is designed in a modular way such that new features (e.g. new taping strategies), high-level operations or the memory layout can be easily modified or added.
This paper serves the purpose to highlight in more detail the structure of the implementation and show how it affects the resulting performance. First an introduction to AD is given in section 2, which is followed by an explanation of expression templates in section 3. The modular approach of the CoDiPack implementation is then described in section 4. Here, the basic idea for the efficient application of AD through operator overloading is described and the design choices for the efficient and extensible memory structures are explained. The performance is investigated by applying CoDiPack to a generic PDE example and to the open-source computational fluid dynamics suite SU2 [Economon et al. (2015)] in section 5.
The part in each software for which the derivatives are needed can be described in all cases as a function with the header void func(const double x, double y). In most cases the input variables and the output variables will be not straight arrays but collections of several variables. These variables can also be members of structures that are in a local or global scope. In an extreme case the header for func will be just void func(void), but internally func accesses millions of input and output variables from the global scope. Nevertheless, func can always be described as a the mathematical function with
We assume that the numerical evaluation of can be represented as a sequence of statements . Each of this statements represents a local evaluation procedure with an arbitrary complex right-hand side, for example
It is then possible to write any numerical evaluation in an arbitrary computer program using the general procedure shown in Table 1.
We have , where the precedence relation means that variable depends directly on variable . That is, the vector is the concatenation of the on which depends. It is important to note, that the mathematical representation of and the implementation func is not a one to one relation. represents all the statements that are evaluated during one call of func which implies, that e.g. loops are unrolled in . For large applications the number of statements can become quite large and can vary for different configurations.
The representation of can also be written it as the composition
where the state transformation sets to and keeps all other components for unchanged. is the state space of the evaluation of func. It contains the input values , the intermediate values and the output values as the last values of the intermediate values. and are the matrices that project an -vector onto its first and last
components, respectively. By applying the chain rule for differentiation we get
where . Using the analytic representation of [Griewank (2000)], it is possible to write the tangent relation in equation (5) as the evaluation procedure shown in table 2. The matrix vector products are calculated in the same order as for the evaluation procedure in table 1 and can be computed alongside the primal evaluation. It is then possible to compute the matrix-vector product of the Jacobian and an arbitrary direction , i.e. equation (5), by evaluating the tangent interpretation.
The tangent relation in equation (5) gives also the alternative representation of the Jacobian of which can be written as
By transposing the product we obtain the adjoint relation
We can again use the analytic representation of [Griewank (2000)] to write the adjoint relation as the evaluation procedure shown in table 3. All matrix-vector products are calculated for thus we have to go in reverse through the sequence of statements in Table 1. We can then get the matrix-vector product involving the transposed of the Jacobian and an arbitrary seed vector , i.e. equation (7), by evaluating this adjoint interpretation.
The Jacobi taping approach, that we want to implement in this paper, will store the gradient of each statement. An efficient handling of the statements and how the gradient can be computed will be discussed in the next section. How the data is stored, is then explained in the CoDiPack implementation.
One big issue for an efficient operator overloading AD tool implementation is that C++ only supports the overloading of unary and binary operators. A regular implementation that follows the layout where Real is the overloaded type would split the statement
which needs to store 7 Jacobies and 4 statements in total. But the original statement would only need 4 Jacobies with 1 statement. It is therefore much more efficient to treat the whole statement at once.
In order to achieve this, the expression template technique is used. We change the layout of the operator to where is some class that stores information about the operation that is performed. It is very important how this information is stored. If a regular inheritance scheme is used, then virtual functions calls would be required in every statement of the program. When virtual function calls are used in high level data structures, their performance impact can be neglected, because they will occur not very often. The situation changes, when they are used in every statement, then they are called millions of times in a second and the cpu is just loading the addresses for the virtual function. Hence, we use expression templates to avoid the need of virtual functions.
The basic idea of expression templates is to define a template class that gets the extending class as a template argument. Figure 1 shows this in an example.
The base class provides a cast method in line 3 to return a reference to the extending class. Every method in the base class uses this method to call the implementation of the extending class. In line 8 the structure MULT is implemented and the Expression interface is used as a base class. If the type for the variable is called Real then the statement from (8) is represented by the structure
It represents the full statement and AD can use it to perform arbitrary computation on the whole statement.
For a Jacobi taping approach it is now necessary to compute the derivatives of the statement and the Expression interface from figure 1 needs to be extended in order to provide this information. The implementation will be the AD reverse mode of the value method in the MULT expression example. The getValue method can be viewed as the algorithm
where and represent the expressions of the two arguments from the multiplication. and represent the variables that are used in the expressions. If reverse AD is applied to this algorithm, then the resulting algorithm is
The reverse algorithm states, that the Jacobies with respect to the arguments need to be computed and then the adjoint values are given recursively to the arguments. Therefore, we can extend the expression interface with a new method calcGradient that implements this algorithm. Figure 2 shows a general implementation for an arbitrary binary operator. The two methods, derivativeA and derivativeB, calculate the derivatives with respect to the first and second argument, they evaluate the first two lines in algorithm (10). For the last two lines, it is now necessary to do recursive calls of calcGradient on the arguments of the expression. The recursion is terminated in the variables of the expression, there the multiplier argument of the method will contain the derivative of the whole expression with respect to this argument.
For the example statement (8) a call of calcGradient(1.0) on the expression template will evaluate the code in figure 3. This should also be the code, that is generated by the compiler, after everything is inlined.
“generate” the code in figure 3, because the compiler can see and inline all the functions of the expressions.
The remaining calls to calcGradient are then implemented such that they access the tape and store the data, which will be covered in the next section.
As it is stated in the introduction, one of the main goals of CoDiPack is to be as fast and memory efficient as possible in order to be able to use CoDiPack in HPC environments. Therefore, it needs to be carefully considered which data is stored and how this is done.
Which data is required, can be analyzed with the help of the reverse AD equation for a statement, as it is shown in table 3:
In order to evaluate the Jacobi taping approach for this equation, the required data are the Jacobi and the means to access the adjoint variables for and .
The adjoint variables could be directly stored in the overloaded type, however, this is not a feasible solution since this information would be deleted when they run out of scope. It is a common practice to use the identification, that is provided by the AD theory. When the function is separated into the elemental operations , every operation gets an index that runs from one to the maximum number of operations . Each variable can therefore be identified with the index of the elemental operation. This index is implemented as a global counter, which is incremented each time a statement is stored in the AD tool and the current value of the counter is used as the index for the value on the left hand side of the assignment.
The memory for the data of one elemental operation with input values is therefore double values for the Jacobi, indices for the arguments and one index for the output value. In addition one byte is needed to store the number of arguments. This assumes that a statement has no more than 255 arguments, which is a reasonable assumption for normal code. The total memory requirement for each statement is then bytes, but this not yet optimal.
The indexing scheme that we use increases the index of the left hand side by one for each statement and the index is stored directly for the reverse evaluation. During the reverse evaluation all statements are evaluated in the exact same order but just reversed. The index of the left hand can therefore be computed by decrementing it one by one and it is no longer necessary to store the index of the left hand side. This reduces the memory requirement by bytes, which is then for each statement. We call this indexing scheme “linear indexing”, which is also used by dco [Leppkes et al. (2016)].
The next subsections will now introduce the efficient computation and storing of the required data for the Jacobi taping.
The computation of the Jacobi for the elemental operation is described in the expression template implementation. The CoDiPack interface for them is shown in figure 4.
The cast method and the getValue method are the same as in figure 1, but an additional template parameter for the class is introduced. It removes the constraint that only double types can be used in the computation, such that for example higher order derivatives can be computed. The calcGradient method is also extended by a template parameter, which defines some user data that can be used in the computations. Furthermore, two versions of calcGradient are defined, a two argument version and a version with one argument which assumes that the multiplier is equal to . This addition to the interface is made in order to give the compiler as much information as possible and to prevent unnecessary multiplications with 1.0.
The implementation of the interface for the unary operations is very straight forward and can use the code from the figures 1 and 2. For each operation only the function derivativeA is different and therefore a general template file unaryExpression.tpp is written. The template file expects, that the macros NAME, FUNCTION and PRIMAL_FUNCTION are defined. They provide the names of the structure, the operator and a function that calls the operator respectively. From the name of the structure, the file derives the name gradNAME for the function that computes the derivative. In order to make it more convenient to use the template file multiple times, the three macros are undefined at the end of the file. The implementation of an unary operation is shown in figure 5 for the sine.
For binary operators, the same principle can be used, but it requires more predefined functions. binaryExpressions.tpp expects the same predefined macros but also expects the functions derv(11|11M|10|10M|01|01M)_NAME to be defined. They cover all the cases how a binary operator can be called: with a constant as the first argument, with a constant as the second argument and for all cases when the multiplier is or different. This requires more effort to implement a new binary operator but gives the developer all possible options for optimizations. An example implementation for the multiplication is show in figure 6.
Macro definitions are avoided in the implementation as much as possible in order to ease debugging. With the chosen implementation most of the code has a specific line and is not defined in a multi line macro.
The missing ingredients are now the implementation of the calculation types, which the users can use in their software and the tapes that store the Jacobi data.
The calculation type is called ActiveReal, which represents the lvalues in a program. This type has to implement the Expression interface and acts therefore as a termination point for the expression templates. The question is now how much logic the ActiveReal structure should contain. If it contains some logic that is special for a specific tape implementation, then an extra ActiveReal implementation is needed for each tape, which would make it quite involved to add new tapes. We opted for removing all logic from the ActiveReal implementation. It stores only the primal values for the computation, the specific data for the tape and all function calls are forwarded to the tape interface. Because of this choice no extra interface needs to be defined for the ActiveReal. In addition it defines the regular convenience functions to retrieve the data.
The interfaces for the tapes are separated into one that is called from the ActiveReal and an interface for the user interaction. The interaction with ActiveReals requires functions for the creation and destruction of the tape specific data and functions that trigger the storing of the statements. A statement is stored with the function store, here, a tape implementation can access the data from the expression and perform the necessary actions for AD. The design principle behind CoDiPack is, that the store method calls the calcGradient method of the expression. This causes a recursive call of the calcGradient method until the ActiveReal types are reached. As discussed above, we did not want to add any logic to the ActiveReal implementation and therefore the call of calcGradient is forwarded to the pushJacobi method on the tape. The call hierarchy is then:
which shows that the tape calls itself, through the calcGradient method. In the pushJacobi method the tape implementation can therefore gather information about the Jacobian of the expression.
The user interface is more involved and is only required for the reverse mode. The most important functions are registerInput, registerOutput, evaluate, setActive and setPassive. They are needed in order to declare what are the inputs and outputs of the area that is taped. evaluate performs the evaluation of equation (11) for all the statements in the area that is marked with setActive and setPassive.
The implementation of the forward AD mode described in table 2 is now very simple. The full logic is shown in figure 7 with a few abbreviations. All assignment operators will call the store method in line 8 from the tape interface. This method initializes a zero gradient and calls the calcGradient method on the expression in order to compute the gradient for the whole statement. The recursive nature of the calcGradient implementation will cause a call of pushJacobi for each argument of the statement, which contains the gradient of the statement with respect to this variable. According to the forward mode in table 2, we only need to multiply this gradient value with the dot value and add it to the dot value of the left hand side, which is implemented in line 18 of figure 7. Because no external resources are used in the method, the compiler is able to optimize the code as aggressive as possible, which should yield optimal performance results.
The real challenge is the efficient implementation of the reverse AD mode. We identified three different data items at the beginning of the section, that we need to store. There are the number of arguments for each statement, the Jacobi and the index for each argument. This data is represented in three different data streams, that have two different running indices. The number of arguments are written per statement and the Jacobi and index are written per argument which makes the these streams run faster than the first one. In addition we add two data stream for user defined functions and the position where the user defined function needs to be evaluated. These two streams have a third running index that is even slower, than the one of the arguments per statement. The management of five different data streams with three different running indices is already quite involved and we anticipate, that other tape developments will require additional data streams with other running indices. Therefore, we implemented a generalized solution in chunkVector.hpp, that can handle an arbitrary set of data streams with different running indices.
The implementation in chunkVector.hpp is a recursive data structure which is called chunk vector. It has a template argument for a child vector and uses the child vector’s position type to define its own position, therefore every time the position of the parent is queried, the position of the child is also returned. The same is true when the position of the parent is set (e.g. reset), then the same method is called on the child with the child’s position. This has the advantage, that all data streams can be handled as one data object. All operations are only applied to the root vector and recursively evaluated on all child vectors. If a new data stream needs to be added, then only a new vector is inserted in the recursive structure and no additional logic needs to be implemented.
The data streams are now stored in chunks of data. Each chunk can have multiple entries like the double
for the Jacobi and the corresponding index, which makes it possible to handle data streams with the same running index in one chunk vector. Because no array of structures is required by this implementation, the memory layout will be optimal for caching and no dead memory is generated because of padding bytes. This will yield optimal performance on HPC clusters that are optimized for continuous memory access.
With these implementations the five arrays can be easily defined. The user functions and the positions for them are always accessed at the same time, therefore they have the same running index. The chunk is instantiated as Chunk2<ExtFuncData, ChildPosition> where ExtFuncData is the data for the user defined functions and the ChildPosition is the position type of the child vector. The Jacobi entries and indices for the arguments have also the same running index and can therefore be represented by the chunk Chunk2<double, int>, which creates separate vectors for the Jacobies and indices. The fifth array is the number of arguments for each statement. It can be represented by the chunk Chunk1<uint8_t>. With these three chunk definitions the corresponding three chunk vectors User Function Vector, Argument Vector and Statement Vector are created and linked together as
with the linear index as the terminator of the dependency chain. The linear index just defines a counter for the current statement and has therefore no associated array. The dependency is now illustrated in figure 8. It illustrates the data chunks, the vector implementation and the recursive nature. Each chunk boundary is illustrated by a blue bar. The lowest lane contains the user function data and has the slowest running index. Whereas, the mid lane contains the statement data and has a faster running index and the top lane contains the data for each argument and is written therefore most often and has the fastest running index. The linear index contains no data, it is just used for position calculations.
When the data is evaluated from position a to position b, then 6 chunk boundaries are crossed. At these points, the interpretation is stopped, the next chunk is loaded and then the interpretation is continued. Because of the chunk vector design, the chunk management is fully automatic.
There are now several implementations of the reverse AD mode. The most general one is the RealReverse type, which uses a configuration of the chunk vectors that allocates new memory on the fly. Because of the required bounds checking, this implementation requires at least two if-statements that are evaluated when an operation is stored on the tape.
There is also the RealReverseUnchecked type, that is configured such that no bounds checks are performed. This has the advantage, that no if-statements are required when an operation is stored on the tape and should yield faster code. The disadvantage of this type is, that the user has to allocate the required memory in advance.
For both types holds, that they are compatible with c-like memory operations (e.g. memcpy). They can also be configured by various global parameters, that enable or disable additional checks or optimizations. A short overview of the parameters is given in the results section of this paper.
At the start of the chapter the linear indexing model was introduced. This model allows the types of the AD tool to be compatible with c-like memory operations, but it produces large adjoint vectors where most of the memory is used just once. Several things have to be considered for an implementation that reuses indices.
The ActiveReal types require now a destructor in order to release the index. Therefore, an index manager is implemented, that stores the freed indices in a list and tracks the maximum amount of live indices. A best practice for the index handling has yet to be determined (e.g. use a sorted list, store ranges of free indices), the current implementation simply uses a stack for the index management in order to be as fast as possible. If e.g. a sorting logic is used in the index handling, then it impacts the evaluation of each statement which has a huge performance impact on the runtime.
Furthermore, because the indices are reused, every ActiveReal needs a distinct index. If two ActiveReal’s would have the same index, it would be released twice and can then be given to two different variables which would use the same memory location in the reverse evaluation. With this in mind, the ActiveReals with an index reuse technique can no longer support c-like memory operations.
The advantage of the index reuse is the reduced size of the adjoint vector. Instead of having the size of the total amount of variables, it shrinks to the size of the maximum amount of used variables. Because of the reduced size, the reverse evaluation is usually faster.
The CoDiPack types with an index reuse have a Index suffix in there type names. This results into the two new types RealReverseIndexUnchecked and RealReverseIndex.
The coupled burgers equation [Biazar and Aminikhah (2009), Bahadır (2003), Zhu et al. (2010)] is chosen as lightweight test, that can be used to do a rapid evaluation how some changes in CoDiPack affect the performance.
are discretised with an upwind finite difference scheme. The initial conditions are:
and the exact solution is ([Biazar and Aminikhah (2009)])
The computational domain is the unit square and the boundary conditions are taken from the exact solution. As far as the differentiation is concerned, we chose as the input parameters the initial solution of the time stepping scheme and as the output parameter we take the norm of the final solution.
For the implementation of the program, all methods are programmed in such a way that they can be inlined and the requested memory is allocated and initialized before the time measurement starts. Especially the required memory for the tape is computed beforehand and allocated. This yields a very stable time measurements which are run on one node of the Elwetritsch cluster at the TU Kaiserslautern, which consists of two Intel E5-2670 CPU’s with a total of 16 cores and 128 GB of Memory. The general setup calculates the burgers equations on a 601x601 grid with 32 iterations and the time measurement is repeated 10 times. For the time measurements two different configurations are tested:
The multi test configuration runs the same process on each of the 16 cores. This setup simulates a use case where the full node is used for computation and every core uses the memory bandwidth of the socket.
The single test configuration runs just one process on the whole node. This eliminates the memory bandwidth limitations and provides a better view on the computational performance.
Both test configurations will be evaluated with the default CoDiPack settings and then with special configurations options in order to see how these options affect the performance of CoDiPack.
An important application where derivatives are nowadays frequently needed is numerical optimization. When constraints are defined using partial differential equations (PDE), this usually requires efficient adjoint methods. In that case AD facilitates the development of those solvers in the discrete setting. This is especially valuable in computational fluid dynamics, where the analysis, i.e. the evaluation of the constraining PDE, is achieved by highly complex algorithms.
Recently, CoDiPack was applied by the authors to the open-source framework SU2 [Economon et al. (2015)]. The latter is a collection of tools for the analysis and optimization of internal and external aerodynamic problems using a Finite-Volume method. The differentiated code was used to generate a flexible and robust discrete adjoint solver for the Reynolds Averaged Navier-Stokes equations [Albring et al. (2015)] optionally coupled with the Ffowcs-Williams-Hawkings equation [Zhou et al. (2016)] for aeroacoustic optimizations. The adjoint solver is based on the fixed-point formulation of the underlying solver for the discretized state equation [Christianson (1994)]. That is, we assume that feasible solutions of the state equation
are computed by the iteration for . represents the design variables and some (pseudo) time-stepping scheme like the explicit or implicit Euler method. By applying the first order necessary conditions on the optimization problem to minimize some scalar objective function with the constraint that the state equation (18) is fulfilled, we end up with the following fixed-point equation for the adjoint state :
This equation can be solved using the iterative scheme
The above scheme can be easily constructed by applying AD to the code that computes and . It is important to note, that since we need the right hand side of equation (20) repeatedly at the fixed-point , we only need to tape the Jacobians of the statements once at the last fixed-point iteration. Hence the time required for the interpretation determines the overall run-time. In contrast to other implementations of adjoint solvers, where AD is only applied to certain part of the code, we replaced all occurences of the default computation type with the AD-type provided by CoDi. Although this leads to a small overhead in run time (one if-statement per expression), the maintainability and on-the-fly differentation of new features typically outweigh this disadvantage.
As the testcase we consider the inviscid flow over the LM1021 supersonic aircraft. The computational mesh consists of interior elements and the aircraft is discretized using boundary elements. For the spatial integration we use a central scheme with artifical dissipation in combination with an explicit Euler method for the pseudo-time stepping.
The first analysis discusses the general timings of the implementation and the comparison between the Intel and gcc compiler. The recording times for the coupled burgers equation are always averaged over all processors and runs and are shown in figure 9. Every circle in the plot shows the averaged time and the bars show the minimum and maximum times. The results are shown for the Intel and gcc compiler in the single and multi configuration. For each compiler and configuration the four different available tapes are checked. These correspond to the RealReverseUnchecked (Unchecked), the RealReverseIndexUnchecked (UncheckIndex), the RealReverse (Chunk) and the RealReverseIndex (ChunkIndex) types of CoDiPack.
As we can see, there is quite some discrepancy between the singe and the multi variant of the burgers test case. The single variant runs approximately 80% faster than the multi variant, because of the memory bandwidth limitation of the computing node. In the single test case only one process uses the full bandwidth, while in the multi test case 16 processes share the memory bandwidth. This has some interesting additional effects on the run time for the Intel and gcc compiler, as they differ for the single configuration, which is no longer seen in the multi case. This shows that a certain extend of compiler and implementation differences, can be hidden in the memory bandwidth.
It can also be seen that for the single test case the required recording time increases with the complexity of the tape implementation. The Unchecked tape is the simplest one and only a small increase is seen for the UncheckIndex version, that additionally uses an index handler. The jump from the Unchecked variants to the Chunk variants, which allocate memory on the fly and perform bound checking, is around 10% computation time. Interestingly, the index reuse is faster than the linear indexing in that case, probably due to caching effects. In the multi test case all these effects are no longer visible. The UncheckedIndex and ChunkIndex types require more taping time because they need to store additional data.
The interpretation time for the burgers equation is shown in figure 10. Here, the difference between the Unchecked and Chunk types is quite small, because the code for both tape types is nearly the same. Only the index reuse types show a different timing. They require less memory in the reverse interpretation and therefore they are faster in the interpretation. The Intel compiler is again a little bit slower in the single test case, which is no longer seen in the multi case with the memory bandwidth limitation. Figure 10 also shows, that the reduced memory of the Index tapes have a large impact on the reverse interpretation time. We can therefore reach the conclusion, that everything which saves memory can also improve the required time for the AD process.
As was already shown in [Hogan (2014)], the performance of AD not only depends on the structure of the code and its memory footprint, but also on the number of transcendental functions. Usually the compiler can aggressively optimize code that contains only non-transcendental functions, but the application of AD interferes in general with this optimizations. Hence the more transcendental functions occur in the code, the better the relative performance of recording and interpretation. Since the algorithm for the Burgers equation represents an extreme case with no transcendental functions at all, we can except relatively high numbers when comparing the runtime of the derivative computation with the runtime of the primal code (indeed we get a factor and for the single and mult case, respectively).
The next analysis considers the default block size for the chunk tapes. The data in table 6.1 shows that the recording of the tape is not influenced by the size of the blocks. This is however different when the evaluation of the tape is considered. Here, the multi test configuration shows only an overhead for very small blocks. On the contrary, for the single case a large influence is seen on the block size up to 0.5 million entries, which is equivalent to a size of 6 MB for 12 byte entries. A small increase in time is also seen for blocks with more than 33 million entries. The reason why the influence is only seen in the tape interpretation of the single test case is that for the other cases the memory bandwidth is the dominating factor. In order to have an optimal block size for all cases the default is set in CoDiPack to 2 million entries.
The implementation of CoDiPack contains several switches that can be used to fine tune the tapes to the needs of the user. The data in table 6.1 shows the switches separated into the ones that influence the recording of the tape and the ones that influence the interpretation of the tape:
With the Check expression arguments switch enabled, CoDiPack performs checks if the arguments for the elemental functions are in the differentiable domain.
Ignore invalid Jacobians and Ignore zero Jacobians prevent zero and invalid Jacobians to be recorded on the tape.
Check tape activity enables the test if the tape is currently active and should therefore record statements.
The reverse switch Skip zero adjoints prevents CoDiPack from performing the update in the reverse interpretation.
Because the test case is highly optimized, only the ignoring of zero Jacobies changes the tape, but to margin, that can be ignored. The memory bandwidth limited case shows only a small increase in the recording and evaluation time when all switches are enabled because most of the logic can be hidden in the memory latency. Therefore, for large cases all checks can be enabled and the additional time will be minimal. The single case shows an increase of 10% in the computation time for the recording and evaluation of the tape. This shows, that if CoDiPack is used on very small portion of a code, where the tape size is very small, it can improve the performance when specific checks are disabled.
Figure 11 shows the relative time and memory factors for 100 iterations of equation (20) with respect to one evaluation of equation (18). Here we also distinguish between the serial run and the parallel run utilizing a full compute node with 16 cores and the Chunk and ChunkIndex tapes. Due to its structure, CoDi allows the direct access to the Jacobian values stored on the tape. This enables the use of advanced AD techniques like the preaccumulation of Jacobians [Utke (2005)] where parts of the tape are already interpreted during the recording. In SU2 this method is employed in certain parts of straight-line code to significantly reduce the memory and interpretation time at the cost of an increased recording time. We include this approach in the discussion, since it represents the common setting when using the adjoint solver.
First of all there is no performance degradation noticeable for the parallel computations compared to the serial computations. This indicates that the latter is already bandwidth limited. In fact, the parallel case even consistently shows a slightly lower factor. Although the majority of operations in SU2 are multiplications and additions, in some parts of the code also transcendental functions are used which intervene with compiler optimizations. Indeed, unlike the test case presented in the previous subsection, we get relative run-time factors between 3.1 to 4.47 for the recording and between 0.4 to 0.86 for the interpretation of one flow iteration without preaccumulation. Again we emphasize that this represents a "black-box" differentiation that contains all routines to compute the flow update in equation (18), even if they do not contribute to the final gradient and can therefore be considered as passive. Hence little knowledge of the code is required in that case. The application of preaccumulation requires slightly more knowledge of the code to identify inputs/outputs of certain regions. Still, once identified, the modifications to apply the method are minor. Figure 11 also shows the effect of preaccumulation on the run-time and memory. Although the time for taping increases by roughly , the interpretion time reduces by almost , which in turn reduces the overall run-time of the adjoint solver by nearly . In addition the memory consumption is also reduced by a great amount.
In contrast to the Burgers test case in section 6.1, here the non-linear indexing scheme (represented by the type ChunkIndex) does not offer a better performance compared to the Chunk type with linear indexing. A detailed analysis showed that this is due to a high amount of "active copies" in SU2, i.e. copy/assignment operations with just one active type on the right hand side. For the linear indexing a simple copy including the index value is sufficient in that case. For the non-linear indexing type a more complex treatment is necessary since each index needs to be unique in the application. Therefore a statement has to be written for each copy operation.
In this paper we demonstrated the efficient implementation of a novel AD tool that employs Jacobi taping with expression templates. Unlike other tools the layout of CodiPack is designed in such a way that other taping approaches can be easily added to further enhance its capabilities, while still maintaining an easy to use interface and high performance. Furthermore, the recursive layout of the data streams allow for a simple addition of new streams and several design choices such as the avoidance of preprocessor macros make the code easy to understand. CoDiPack provides a useful basis that hopefully encourages people to join the currently stagnating research in implementation strategies for AD.
The performance of the Expression Template approach heavily relies on the capability of the compiler to inline certain routines. Hence to assure that CoDiPack offers a compiler independent behavior we first compared the performance between the Intel and gcc compiler. Here it was shown that both compilers give indeed similar performance, though the gcc compiler was slightly faster when just a single process is used. This difference diminishes however when occupying all available cores of a node because the memory bandwidth becomes the limiting factor. Similar results were reported for the indexing and memory management schemes for both compilers. Using the same compiler, it was shown that there is only a moderate performance degradation during the recording when the more complex chunk memory management implementation is used compared to the simple management without any bound checking. The former offers a more user friendly behavior without the need to know the tape size a-priori. Furthermore the types that use an index handler need less run-time for the interpretation due to the reduced tape size.
The application to the CFD software SU2 certified CoDiPack’s high performance even for the case of a "black-box" differentiation of complex code and in parallel. High-level tuning using preaccumulation techniques can help to further decrease the memory consumption.
Future research is devoted to the improvement of the performance using novel indexing and memory management schemes. We will also expand the research to the field of analysis of codes, to provide guidelines which CoDi type is optimal for what kind of code structure. Additionally the user interface in CoDiPack is constantly extended to provide convenient routines to easily handle user differentiated functions or preaccumulation.