Event-Based Automatic Differentiation of OpenMP with OpDiLib

02/23/2021 ∙ by Johannes Blühdorn, et al. ∙ 0

We present the new software OpDiLib, a universal add-on for classical operator overloading AD tools that enables the automatic differentiation (AD) of OpenMP parallelized code. With it, we establish support for OpenMP features in a reverse mode operator overloading AD tool to an extent that was previously only reported on in source transformation tools. We achieve this with an event-based implementation ansatz that is unprecedented in AD. Combined with modern OpenMP features around OMPT, we demonstrate how it can be used to achieve differentiation without any additional modifications of the source code; neither do we impose a priori restrictions on the data access patterns, which makes OpDiLib highly applicable. For further performance optimizations, restrictions like atomic updates on the adjoint variables can be lifted in a fine-grained manner for any parts of the code. OpDiLib can also be applied in a semi-automatic fashion via a macro interface, which supports compilers that do not implement OMPT. In a detailed performance study, we demonstrate the applicability of OpDiLib for a pure operator overloading approach in a hybrid parallel environment. We quantify the cost of atomic updates on the adjoint vector and showcase the speedup and scaling that can be achieved with the different configurations of OpDiLib in both the forward and the reverse pass.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

OpDiLib

Universal add-on for operator overloading AD tools that enables the differentiation of OpenMP parallel codes.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic differentiation (AD) deals with the evaluation of derivatives of computer codes, in a machine accurate manner and for a computational cost that is often only a small multiple of the cost of the original program. AD has repeatedly proven its value in scientific computing, especially for large simulation codes where symbolic derivatives are not feasible and finite differences produce numerically unstable and unreliable results. Areas of application include discrete adjoint solvers [32, 1], shape optimization [11, 24], inverse problems [2]

, and machine learning

[13]. A comprehensive introduction to AD is given in [12].

We view a computer program with input variables and output variables as a mathematical function . While an execution of this program for a specific input might involve millions of code statements, it always amounts to a sequence of elementary operations like sums, products, or standard math library function like or . For all these operations, the derivatives are known; hence, the derivative of the full code in

can be obtained by a rigorous application of the chain rule. The

reverse mode of AD is a specific strategy for the structured traversal of the chain rule. The code is executed in its original structure in a forward pass; throughout, it is recorded in a suitable representation on a stack-like structure called tape. In the subsequent reverse pass, the tape is evaluated and derivative information is propagated from the outputs to the inputs in reverse direction compared to the original data flow. To formalize this, we introduce for each variable of the code, whether input, output, or intermediate variable, an adjoint variable that fulfills the relation . For the adjoint inputs , we obtain

so that suitable choices of yield various derivatives of the computer code. Choosing is known as seeding. In an implementation, it is practical to assemble the adjoint variables in an incremental fashion on the statement level. Let a code statement be given by

(1)

where is an intermediate variable or an output, consists of inputs and intermediate variables from previous statements, and is composed of one or multiple elementary operations. The adjoint updates for the statement read

(2)

All information required for the reversal of a statement must be stored on the tape, that is, identification for the adjoint variables and that have to be used [29], and information for the computation of the terms [17, 27]. The order in which all statements (1) occurred in the forward code is preserved in their recording on the tape, and unwinding the tape from end to start produces the corresponding adjoint updates (2) in reverse order.

A key objective of AD tools is the automatic augmentation of the original source code by additional statements required for the recording of tapes and the computation of derivatives. The two main approaches for this are source transformation and operator overloading. Source transformation tools are essentially compilers that take the original source code and transform it to an augmented source code. Besides the original computations, the augmented source code contains also statements for the computation of derivatives. TAPENADE [14] is an example for a source transformation tool. Operator overloading tools rely on a replacement of the floating point data type by a custom AD type. All elementary operations are overloaded for the AD type. From the user’s point of view, the AD type behaves like a floating point type; internally, it performs anything that is required for differentiation, for example the recording in the reverse mode of AD. Examples for operator overloading AD tools are ADOL-C [33] and CoDiPack [28].

In order to achieve maximum performance on today’s multicore architectures, simulation codes typically employ shared memory parallelism, for example by means of OpenMP [23]. Hence, AD tools should be capable of differentiating codes that contain OpenMP directives. Research on combining AD and OpenMP started in the realm of source transformation tools when [6, 7] suggested the generation of OpenMP parallel code for the simultaneous computation of multiple derivatives of an otherwise serial code with the forward vector mode. This is extended in [8] to the case of an already parallel code by using nested parallelism. First reports on source-to-source reverse mode differentiation of an OpenMP parallel simulation code are given in [15, 16]. Subsequently, support for reverse mode AD for OpenMP parallel codes was also established in operator overloading tools, for example ADOL-C [19, 5]. However, the fact that OpenMP parallelism is specified in large parts via #pragma directives posed a serious challenge for operator overloading tools. Such directives are outside the scope of the programming language and inherently inaccessible to overloading techniques. This limited the supported set of OpenMP features in operator overloading to parallel regions and parallel loops. Source transformation tools, on the other hand, have never been subject to this limitation. Transformation rules for many additional OpenMP directives where established in [10]. We summarize some key insights from previous research. Transitioning from the reverse mode of AD in a serial context as described above to parallel codes, we face the problem that taping is an inherently sequential process. Therefore, computations that are performed simultaneously in different threads have to be recorded on different tapes. Furthermore, it was observed that the parallelization of the forward pass can be used to deduce a corresponding parallelism for the reverse pass. However, this is — except for special cases — not as simple as evaluating the tapes in parallel. First, any additional synchronization that occurred in the forward pass has to be transformed to corresponding reverse synchronization; second, while simultaneous read access to the same variable from different threads is fine in the forward pass, the corresponding adjoint updates (2) require then simultaneous write access to the same adjoint variable, which introduces data races in the reverse pass. Given the incremental nature of the adjoint updates (2), it was shown in [10] that using only atomic updates is a general solution for this. However, atomic updates are associated with a high performance cost and previous research has therefore identified forward data access patterns that do not require reversal with atomic updates [19, 10, 18].

In this paper, we present a novel event-based approach for the reverse mode automatic differentiation of OpenMP parallel codes with operator overloading tools and its implementation in the new open source tool OpDiLib

111https://www.scicomp.uni-kl.de/software/opdi/. Starting with version 5.0 of the OpenMP specification [23], internal parts of the OpenMP runtime are formalized by execution model events. Furthermore, OMPT becomes part of the specification, an API that allows tool developers to observe and react to these events. OMPT was originally designed to provide a standardized interface for the performance monitoring of OpenMP applications [9]; more generally, it allows for the coupling of the OpenMP runtime with tools that execute in the address space of the original program [23], which is precisely the case for operator overloading AD tools. In this work, we identify and demonstrate the applicability of OMPT for automatic differentiation. Specifically, it allows us to implement augmentations required for AD handling as side effects of OpenMP constructs, without any additional modifications of the original source code. We start in Section 2 with an overview over OpDiLib’s architecture. It has evolved from the OMPT core concept to a point where it supports all the different configurations and use cases presented in this paper.

In order to support OpenMP in an operator overloading AD tool, two kinds of adaptions must be made. First, the AD tool itself must be thread-safe. This is an AD tool specific property that requires an individual implementation. Second, an appropriate handling for OpenMP mechanisms must be provided. While certain OpenMP support has been crafted into some operator overloading AD tools by now, we demonstrate with OpDiLib for the first time that the second part can be implemented independently of a specific operator overloading AD tool. Moreover, this implementation is based on a limited set of features that many AD tools already have, which allows us to design OpDiLib as a universal add-on for classical operator overloading AD tools. Support for OpenMP in such a classical operator overloading tool is then achieved by making it safe to use in an OpenMP environment in a first step, and coupling it with OpDiLib in a second step. This continues a philosophy of AD tool agnostic support for parallelism in operator overloading that was previously established for MPI222https://www.mpi-forum.org/ in the tools AMPI [30] and MeDiPack333https://www.scicomp.uni-kl.de/software/medi/. The coupling is described in Section 3. OpDiLib bindings are already implemented for CoDiPack [28]. We hope that this paper is also a useful starting point for coupling OpDiLib with other operator overloading AD tools in the future.

OpDiLib’s event-based differentiation logic is presented in Section 4. It is formulated in terms of AD events, and the AD handling is implemented as reactions to these events. This corresponds to an AD-centered point of view that allows for a unified handling of similar OpenMP mechanisms and hides many OpenMP and OMPT specific details that are not relevant for AD. Furthermore, the differentiation logic itself does not depend on OMPT at all. While OMPT execution model events can be used to generate AD events, the differentiation logic can also be reused together with other AD event generation mechanisms. With this approach, we establish in Section 4.1 AD support for many OpenMP constructs in operator overloading, including several that were previously only treated by source transformation tools.

From a user’s point of view, OpenMP is very easy to apply. Introducing parallelism to a code can be as straightforward as supplying a loop with a #pragma omp parallel for directive. To achieve maximum parallel performance, on the other hand, additional restructuring of the code and the application of more advanced OpenMP features might be required. OpDiLib follows the same philosophy. With OMPT, a first differentiated version of an OpenMP parallel code can be achieved without any additional code changes and without restrictions on the data access patterns. We achieve this by defaulting to atomic updates on adjoint variables. Starting from there, a user may employ additional optimizations. An important part of improving the performance of the parallel reverse pass is the elimination of atomic updates, which might only be possible after a revision of the data access patterns. Then, OpDiLib’s adjoint access control tools which we present in Section 4.2 can be used to disable atomic updates for selected parts of the code.

Section 5 presents two different mechanisms for the generation of AD events. Section 5.1 details the generation of AD events via OMPT execution model events. OpDiLib is designed to be used with these modern OMPT features. Nonetheless, compilers are still in the process of establishing support for OMPT. As many projects rely on older compiler versions or compilers without OMPT features, we provide a second backend that does not require OMPT. It is based on an alternative interface to OpenMP functionality that consists of replacement macros for OpenMP #pragma directives and clauses as well as replacements for OpenMP’s runtime functions. The only additional step required for its application is the replacement of OpenMP constructs. As the macro interface is fully compatible with the OMPT backend, these code changes do not have to be revisited for a transition from the macro backend to the OMPT backend. Section 5.2 provides an overview over the macro interface and highlights some implementational details of the macro backend.

In Section 6, we conduct a performance study of OpDiLib applied to an OpenMP-MPI hybrid parallel code. We use CoDiPack444https://www.scicomp.uni-kl.de/software/codi/ [28] as the underlying AD tool and MeDiPack3 for the differentiation of MPI. While a mixed approach between operator overloading and source transformation was established for hybrid parallel codes in [31], we demonstrate with OpDiLib that they can also be handled by pure operator overloading, which admits a simplified toolchain without additional source transformation compilers and does not involve interfaces between different AD paradigms. We compare OpDiLib’s backends, different compilers, quantify the cost of atomic updates on the adjoint vector and demonstrate the impact of OpDiLib on the parallel performance of the forward and reverse pass. We conclude our work in Section 7.

2 Architecture

OpDiLib is divided into three parts, backend, logic and tool. These layers are illustrated together with their relations in Figure 1.

Figure 1: Layers of OpDiLib’s architecture.

The tool layer couples OpDiLib with a classical operator overloading AD tool. It provides access to AD tool features required by OpDiLib while abstracting away the complexity of the AD tool. OpDiLib’s logic layer implements an OpenMP differentiation logic that is formulated in terms of AD events. These occur in pairs that mark the beginning and end of OpenMP constructs, respectively. On the one hand, the event handling consists of augmentations of the forward pass, for example preparations of the parallel recording environment. On the other hand, the reverse OpenMP logic is embedded into the tapes. For both parts, the logic layer relies on the AD tool features exposed by the tool layer. OpDiLib’s backend layer is responsible for the generation of AD events alongside the OpenMP workflow, for example at the beginning or end of a parallel region, or upon encountering or leaving a synchronization barrier. Furthermore, the backend layer is responsible for AD data exchange between matching begin and end events. Any data registered by the logic layer for a begin event is also provided to the matching end event. The backend layer also provides low-level information such as identifiers of locks or data associated with the current parallel region. We discuss these layers in detail in Sections 3, 4, and 5, respectively.

The intention of the design is that each layer can easily be exchanged without affecting the others, thus increasing the reusability of all parts of OpDiLib. For each type of layer, an interface is defined that guides its implementation. Coupling OpDiLib with another operator overloading AD tool, for example, is realized via a new implementation of the tool layer interface. In Section 5, we present two backend implementations that employ different mechanisms of AD event generation. The only part of OpDiLib that depends on OMPT is the specific realization of the backend layer presented in Section 5.1.

While the backend layer is compiled into the application, both tool and logic layer can also be exchanged at runtime. For the tool layer, this is important if different AD types are used in different parts of the application. Since the tapes used by the logic layer usually depend on the AD type, this use case requires also an exchange of the logic layer. Specifically for the logic layer, we could imagine the use case of an optimized implementation tailored to a specific way of using OpenMP; to improve the performance, one would switch to the other logic layer for such parts of the code.

3 Tool

In this section, we present the assumptions on the AD workflow that guide the implementation of OpDiLib and collect the properties and features that an operator overloading AD tool must have in order to be coupled with OpDiLib. We showcase the interface that has to be implemented for the coupling and close with the discussion of some advanced use cases. To begin with, we assume that an AD workflow such as depicted in Listing 1 is used, and that the AD tool supports the classical operations involved.

1// within serial code
2AD::masterTape->registerInputs(inputs);
3AD::masterTape->setActive(); // start recording
4
5// function to be differentiated
6// involves parallel code
7f(inputs, outputs);
8
9AD::masterTape->setPassive(); // stop recording
10AD::masterTape->registerOutputs(outputs);
11AD::masterTape->seedOutputs(seeds);
12AD::masterTape->evaluate(); // reverse pass
13AD::masterTape->reset();
Listing 1: AD workflow.

Particularly, we assume that the recording is controlled by the initial implicit task, that is, serial parts of the code. In Listing 1, all parallel regions are assumed to be contained in f. Hence, when transitioning to parallel code, the general AD workflow set up for the serial version can be kept and as before, the user interacts only with the tape of the initial implicit task, to which we refer as master tape. Any additional treatment required for parallel code is embedded by OpDiLib into the master tape in a fully automatic fashion. In order to be coupled with OpDiLib, an AD tool must additionally

  • allow for creation and destruction of tapes at runtime,

  • treat tapes as thread-local objects that can be exchanged at runtime, and on which the classical operations such as recording or evaluation can be performed individually,

  • provide access to tape positions and support positional evaluation and positional reset of tapes,

  • have capabilities to embed custom function calls into the tape evaluation, commonly referred to as external functions [19].

Furthermore, the AD tool must be thread-safe for use in an OpenMP parallel environment.

  • Recordings that are performed in parallel on different tapes must be thread-safe. Particularly, the simultaneous distribution of indices to active variables in multiple threads must be safe.

  • The AD tool must support atomic operations on the adjoint vector. It is sufficient if each evaluation uses atomic access by default. However, to allow for additional performance optimizations, it is preferred that the adjoint access mode can be specified for each individual positional tape evaluation. The most obvious optimization is that serial parts of the code are not evaluated with atomic updates.

The full interface that has to be implemented for coupling OpDiLib with an operator overloading AD tool is shown in Listing 2. It reflects the AD tool requirements identified above. Note that any specific types used by the AD tool for tapes and tape positions are abstracted away be the tool layer, hence the various void* pointers in the interface.

1struct ToolInterface {
2  public:
3    virtual ~ToolInterface() {}
4
5    virtual void init() = 0;
6    virtual void finalize() = 0;
7
8    // tape creation and deletion
9    virtual void* createTape() = 0;
10    virtual void deleteTape(void* tape) = 0;
11
12    // management of thread local tape of caller
13    virtual void* getThreadLocalTape() = 0;
14    virtual void setThreadLocalTape(void* tape) = 0;
15
16    // position handling
17    virtual void* allocPosition() = 0;
18    virtual void freePosition(void* position) = 0;
19    virtual size_t getPositionSize() = 0;
20    virtual std::string positionToString(void* position) = 0;
21    virtual void getTapePosition(void* tape, void* position) = 0;
22
23    // tape handling
24    virtual bool isActive(void* tape) = 0;
25    virtual void setActive(void* tape, bool active) = 0;
26    virtual void evaluate(void* tape, void* start, void* end,
27        bool useAtomics = true) = 0;
28    virtual void reset(void* tape, bool clearAdjoints = true) = 0;
29    virtual void reset(void* tape, void* position,
30        bool clearAdjoints = true) = 0;
31    virtual void pushExternalFunction(void* tape, Handle const* handle) = 0;
32
33    // tape editing
34    virtual void erase(void* tape, void* start, void* end) = 0;
35    virtual void append(void* dstTape, void* srcTape, void* start,
36        void* end) = 0;
37};
Listing 2: Interface to AD tool features.

In the remaining parts of this section, we discuss some refinements of the AD workflow from Listing 1 for advanced use cases.

Multiple recordings and evaluations.

The AD tool may support multiple recordings on the same tape and multiple evaluations of one recorded tape, but OpDiLib does not depend on these features. However, it differentiates them correctly if they are present.

Positional evaluation of the master tape.

If the AD workflow in the serial part of the code involves positional evaluations of the master tape, additional small modifications of the workflow are required in order to get correct results with OpDiLib. As detailed in Section 4, OpDiLib maintains internal states for the differentiation. State changes are tracked alongside the recording on the classical tapes, and reverted accordingly during evaluation. Hence, if parts of the master tape are evaluated starting from a position that does not match OpDiLib’s current state, the state corresponding to that position must be restored first. To that end, OpDiLib’s logic layer offers functions for the export and recovery of internal states. These have to be used alongside the export of tape positions and prior to positional evaluation, respectively. The resulting workflow is depicted in Listing 3.

1AD::Position startPosition = AD::masterTape->getPosition();
2
3AD::masterTape->startRecording();
4
5// function to be differentiated
6// involves parallel code
7f(inputs, outputs);
8
9AD::masterTape->endRecording();
10
11AD::Position endPosition = AD::masterTape->getPosition();
12OpDiLib::State endState = OpDiLib::logic->exportState();
13
14... // other code
15
16OpDiLib::logic->recoverState(endState);
17AD::masterTape->evaluate(endPosition, startPosition);
Listing 3: Modified workflow for positional evaluation of the master tape.

Tape activity changes within parallel code.

There are some edge cases that — unlike we assumed in Listing 1 — require changes of the tape activity during the recording phase. External functions that are executed on the AD type instead of a scalar type, for example, require an interruption of the recording. OpDiLib handles AD events only if the tape of the thread that encounters them is active. Hence, care must be taken that changing the tape activity does not produce incorrect results. For example, a parallel region that is encountered by a thread with inactive tape does not prepare a parallel recording environment. Recording may only be performed inside parallel regions that were already encountered by a recording thread. Also, AD events occur in pairs of which either both or none should be handled. The tape activity at any begin event must be identical to the tape activity at the matching end event. Even then, care must be taken that no important synchronization information due to OpenMP directives encountered during interruptions of the recording gets lost. An experienced user, on the other hand, can take advantage of this and eliminate synchronization from the tapes that is not required during the reverse pass. To give an example, the following code showcases a barrier that does not result in a barrier in the reverse pass.

1AD::threadLocalTape->setPassive();
2#pragma omp barrier
3AD::threadLocalTape->setActive();

External functions involving parallelism.

While OpDiLib uses external functions to embed state changes and reverse parallelization into classical tapes, they are originally designed as a tool for users to provide custom derivatives for code, say a function g, where blackbox differentiation is inaccurate or costly in terms of memory or runtime [19]. A typical example for this is the differentiation through a solver for a linear system of equations [26]. Instead of the statement-level recording, user-provided code for the differentiation of g is embedded into the tape. Transitioning to the parallel case, the user-provided derivative code does not only replace the derivative computation but also the reverse parallelism that would otherwise be deduced automatically by OpDiLib. This gives experienced users another opportunity to fine-tune the performance. OpDiLib is not aware whether the tapes it evaluates contain usual statement-level recordings or external functions. Hence, support for external functions involving parallelism is mostly an issue of the underlying AD tool. Particularly, its external function tools must be thread-safe, which involves declaration of inputs and outputs, tape activity changes, the registrations of the external function on the tapes, and where necessary, barrier synchronization between these steps. From OpDiLib’s point of view, each thread that registered the external function on its tape in the forward pass results in a thread that participates in the evaluation of the custom derivative code in the reverse pass. In typical use cases for external functions, the same degree of parallelism is used in the forward and reverse pass. Nonetheless, OpDiLib does not prevent alternative patterns here.

4 Logic

As outlined in the introduction, previous research on differentiation of parallel code made the observation that the parallelization of the forward code can be used to deduce a corresponding parallelization of the reverse pass. In Section 4.1, we collect previous results on the transformation of OpenMP constructs to their reverse counterparts, transfer them — where necessary — from the realm of source transformation to the realm of operator overloading, extend them to a larger class of OpenMP constructs, and detail their event-based implementations in OpDiLib.

OpDiLib achieves a thread-safe parallel evaluation by defaulting to atomic updates on the adjoint variables. It depends on the specific data access patterns whether this can be relaxed, but doing so can have a significant impact on the performance, see e. g. [18]. We defer the performance discussion to Section 6. In Section 4.2, we present the tools provided by OpDiLib with which an experienced user can disable atomic evaluation where appropriate.

4.1 Handling of OpenMP Constructs

In OpDiLib, the AD treatment of OpenMP constructs is implemented as reactions to events that indicate the beginning and end of OpenMP constructs alongside the OpenMP workflow. This is inspired by the execution model events defined in the OpenMP 5.0 specification [23]. However, there is a multitude of execution model events, and a lot of internal information is passed to the callbacks that can be registered for these events. Not all of it is relevant for differentiation. For the purpose of AD, we derive from the execution model events a reduced set of events with correspondingly adapted data, to which we refer as AD events in the following. Another advantage of the AD event formulation is that it does not depend on OMPT and can also be used together with other event generation mechanisms, as we will discuss in Section 5. The logic layer implements handling for the AD events. For each AD event, this may consists of a forward action and a reverse action that are accompanied by appropriate AD data. The forward action contains additional code for the forward pass and the reverse action contains reverse OpenMP logic in the shape of an external function. Upon encountering the event, the forward action is executed and the reverse action is pushed onto the thread’s tape. AD data encompasses any data required in the forward and reverse actions of matching begin and end events. It is usually created in the forward action of a begin event, passed to the matching end event, and provided to the reverse actions of either of both events as external function data. These patterns will become apparent in the examples throughout this section.

Parallel regions and implicit tasks.

Support for OpenMP in an operator overloading tool was already established on the level of parallel regions in [19], which lead to the implementation in ADOL-C [5]. The core idea is that concurrent recordings are performed on different tapes that are then also evaluated in parallel. In OpDiLib, the handling of the #pragma omp parallel directive is split into events for parallel regions and implicit tasks. The resulting event family is presented in Figure 2. The corresponding forward and reverse actions are detailed in Figure 3.

Figure 2: Parallel region related AD events in the intuitive ordering with respect to each other and the OpenMP workflow, as well as corresponding AD data flow. Data set according to OpenMP runtime information is tagged with (omp).
Figure 3: Forward and reverse actions of the parallel region AD event family. Actions without description are either empty (white) or contain management tasks like data allocation, data initialization, and the reverse action external function push (grey). The corresponding AD data is displayed in Figure 2.

The ParallelBegin event is received by the thread that encounters the OpenMP directive. It receives the requested number of threads, e. g. due to a num_threads clause. Since the actual number of threads is bounded from above by the number of requested threads [23], it can be used to prepare the parallel recording environment with the possibility of a small overallocation. Each thread that participates in the parallel region receives an ImplicitTaskBegin event before it executes the parallel region’s structured block. They receive the actual number of threads and their index in the team of threads that execute the parallel region. An ImplicitTaskEnd event is received by each thread after its execution of the structured block. The key objective of the ImplicitTask events is that each thread is equipped with a dedicated tape for the extent of the parallel region, which is realized via their forward actions, see Figure 3. The ParallelEnd event is again received by the thread that encountered the parallel region, after it received its ImplicitTaskEnd event. Its reverse action contains the parallel evaluation of the tapes that were recorded in the forward pass.

We do not want to create new tapes for every parallel region. Therefore, the tapes are acquired from and released to a tape pool and reused again in later parallel regions. In order to evaluate the correct parts of the tapes in the reverse parallel regions, we must remember the positions of the tapes at the boundaries of the structured block in the forward pass. This is implemented as part of the ImplicitTask event handling. As an advancement to what is reported on in [19], our implementation supports also nested parallelism. This requires additional care in the tape management. To see this, consider the following example.

1#pragma omp parallel num_threads(2)
2{
3  workA();
4ww #pragma omp parallel num_threads(1)
5  {
6    workB();
7  }
8}

Suppose the workload workA is imbalanced for the two threads, so that the first thread is finished with both workA and the inner parallel region before the second thread finishes workA. Then, the tape that thread one used in the inner parallel region is available again and might be acquired by thread two so that both inner parallel regions are recorded on the same tape. In the reverse pass, however, both inner parallel regions are most likely evaluated in parallel. If no indices are reused, this is not a problem. Otherwise, the same index pool is used for temporaries throughout workB in both threads, which creates data races in the reverse pass. To avoid this, it must be ensured that no tape is acquired twice within a parallel region, including all levels of nested parallelism. The resulting hierarchical arrangement of tapes can be seen as an extension of the nested taping described in [19].

Figure 2 displays the intuitive ordering of the parallel region related AD events with respect to each other and with respect to the AD workflow, which is sufficient for a concise presentation of the differentiation logic. In practice, however, we have to support all orderings of the corresponding OpenMP execution model events that are admissible according to the specification [23]. We revisit this in Section 5.1.

Sync regions.

The first class of OpenMP constructs that were only addressed by source transformation tools are sync regions, with the explicit #pragma omp barrier directive as a prominent example. According to [10], its reverse counterpart is, again, a barrier directive. This is handled in OpDiLib by a SyncRegionBegin event that occurs upon encountering the barrier, and a SyncRegionEnd event that occurs upon leaving it. No AD data is required, and the reverse action of either of both events may contain the reverse barrier. In addition to the explicit barrier directive, this logic covers also other types of barriers, for example implicit barriers due to OpenMP worksharing directives that do not specify a nowait clause, as we explain in the next paragraph.

Worksharing.

It is common to all worksharing directives that they define independent units of work that are scheduled for execution by specific threads at runtime. Since each thread is equipped with its own tape already, the specific worksharing is reflected in the recordings performed on the different tapes; without any additional treatment at all, the same worksharing is already achieved in the reverse pass simply by evaluating the tapes in parallel. This covers the #pragma omp for, #pragma omp single and #pragma omp sections directives. Nonetheless, care must be taken that synchronization due to the implicit barrier at end of such worksharing regions results in corresponding reverse barrier synchronization, as explained in the previous paragraph. Then, if the forward synchronization is skipped with a nowait clause, the reverse synchronization is omitted as well. Additional augmentations in the source transformation rules for worksharing directives in [10] are either specific to source transformation or part of a so-called joint reversal approach, where the memory footprint is reduced by checkpointing techniques already as part of the transformation. As is common in operator overloading tools, a user has to employ such techniques manually, and the same holds true if OpDiLib is applied. Still, a WorkBegin and a WorkEnd event are reserved for future augmentations of worksharing directives.

Mutexes.

There are two kinds of OpenMP constructs. The first kind does not require awareness of the ordering between threads, that is, upon encountering an OpenMP construct, it is sufficient to add a corresponding reverse construct to the tape. A barrier, for example, may be encountered by the team of threads in any order, and the same holds true for the corresponding reverse barrier. All examples we have discussed so far are of this kind. The second kind, however, requires that the order in which threads execute the OpenMP construct is also reversed, that is, the reverse of the order in which the threads executed the OpenMP construct in the forward pass must be identical to the order in which the reverse constructs are executed in the reverse pass. The #pragma omp critical directive is an example for this and the issue is reflected in the source transformation AD approach that is developed for it in [10]. Reformulated in terms of operator overloading, a global counter is associated with the critical region. A thread that enters the critical region increments the counter and remembers its subsequent value. In the reverse pass, a thread busy-waits until the counter matches the remembered value, then it evaluates the recording of the protected forward code, and decrements the counter afterwards. We make the observation that this concept is not limited to critical regions. More general, it can be used to differentiate any synchronization of mutex type, where entering the critical region corresponds to acquiring the mutex and leaving the critical region corresponds to releasing it. We apply this technique for critical regions, locks, nested locks, ordered clauses and reduction clauses. Counters are then associated with mutexes via the mutex kind and a unique mutex id that is provided by the backend layer as discussed in Section 5. The current values of all counters are part of OpDiLib’s state that we mentioned before in Section 3 in connection with Listing 3. The event based implementation consists of a MutexAcquired event that is encountered after acquiring a mutex, and a MutexReleased event that occurs after releasing it. This is displayed for the example of a critical region together with the corresponding AD data flow in Figure 4. Again, precise event ordering constraints, in particular with respect to acquisition and release of the same mutex from different threads, are inherited from the corresponding OpenMP execution model events. The forward and reverse action of the differentiation logic discussed above are displayed in Figure 5. We perform the increment of the counter as part of the MutexAcquired forward action because it is — unlike the MutexReleased forward action — protected by the corresponding mutex.

Figure 4: Mutex related AD events in the intuitive ordering with respect the OpenMP workflow, as well as corresponding AD data flow. counter is an abbreviation for global data that depends on kind and id.
Figure 5: Forward and reverse actions of the mutex AD events. The omitted action (grey) contains only the external function push for the corresponding reverse action. The corresponding AD data is displayed in Figure 4.

Reductions.

Since the OpenMP 4.0 specification [22], user-defined reductions are supported. This allows reduction clauses involving data types that are used in operator overloading AD tools. However, a declaration of a custom reduction alone is not sufficient to obtain correct reverse mode derivatives. As described in detail in the OpenMP 5.0 specification [23], reductions are performed in OpenMP in two stages. First, each thread that participates in the reduction accumulates its results in a private variable. At the end of the OpenMP construct with the reduction clause, the threads’ individual results are then accumulated into the shared reduction target, usually involving several threads and shared intermediate variables, for example in a tree-like reduction. From an AD point of view, the operations in the course of the reduction are recorded just as any other operations on the tapes of the threads that execute them. However, access to shared intermediate variables or the final reduction target requires synchronization, which must be turned into reverse synchronization again. To that end, we require that a thread receives a MutexAcquired event before contributing to a shared variable, and a MutexReleased event afterwards; just as if access to the shared variable was safeguarded by a mutex, regardless of the actual synchronization mechanism that is used by the OpenMP runtime. We must further ensure that no thread begins with the reversal of the structured block of the OpenMP construct before the reversal of the second reduction stage has progressed to a point where all contributions have been propagated to the private reduction variable of that thread. The implementational details of this are discussed in the course of AD event generation in Section 5.

Directives that do not require treatment.

There are several directives that do not require any additional treatment for AD in the framework presented above. The #pragma omp master directive, for example, does not involve any synchronization; it only declares that the code will be executed by a specific thread, and hence, recorded on a specific tape. As it is only recorded by one thread, it will also be evaluated by only one thread. This is in line with the results on source transformation from [19]. The same holds true for the #pragma omp section directive. The essential augmentations are already performed by the handling of parallel regions, implicit tasks and the sections worksharing directive. The section directive marks then only parts of the code that can be executed in parallel. The specific assignment of section to threads, and hence tapes, depends on the scheduling in the forward pass and is automatically preserved for the reverse pass. Furthermore, OpenMP allows combined constructs, usually a parallel directive combined with a worksharing construct, for example #pragma omp parallel for. Such combined constructs are supported automatically by the logic presented above as long as all involved mechanisms generate their corresponding events. Furthermore, we do not have to add special treatment for data-sharing attribute clauses such as private or firstprivate or data copying clauses such as copyprivate. The resulting copy operations are recorded in the same fashion as any other operations. Source transformation tools, on the other hand, have to implement treatment for such clauses [10, 18].

Atomics.

In principle, synchronization due to atomics could also be handled by mutex AD events; in fact, atomics generate mutex-acquired and mutex-release execution model events. However, the OpenMP specifications only allow #pragma omp atomic directives for scalar types, which excludes the custom types that are used for active variables in an operator overloading code. Therefore, we do not add AD support for atomic directives. Atomics can still be used in parts of the code that are not differentiated and otherwise for inactive variables. This does not affect AD in typical use cases like incrementing a counter from within a worksharing loop where the synchronization due to the atomics has no impact on the correctness of the derivative. In general, the missing AD handling implies that any synchronization due to atomics is not transformed to corresponding reverse synchronization. Therefore, custom synchronization patterns that rely on atomics, for example busy-waits, are not captured correctly. If OpDiLib is used, synchronization via atomics has to be replaced by one of the mutex AD event generating constructs described above, or by barriers. Alternatively, the user may generate mutex AD events manually for custom synchronization mechanisms. Expanding on the example of a busy-wait, this has the advantage that mutex events do not have to be generated for each individual atomic statement, which would quickly blow up the mutex counters. Specifically in the reverse synchronization in Figure 5, one MutexAcquired event could be emitted upon a completed busy-wait, with the corresponding MutexReleased event placed after decrementing the counter.

We remark that missing AD support for atomics does not prevent us from using them in the reverse pass, for example for atomic updates on the adjoint variables or for synchronization as displayed in Figure 5. The reason is that the tape is not in recording mode during reversal.

Flush directive.

The #pragma omp flush directive enforces a consistent view of the threads on the memory, either in general or with respect to specific variables. As can be seen e. g. in the examples in the OpenMP 2.5 specification [21], it can be used to implement rather irregular synchronization patterns. Like with atomics in the previous paragraph, this is not supported in OpDiLib. Also, we see no way to handle the flush directive without turning it implicitly into a stronger kind of synchronization like atomics, mutexes, or barriers, effectively eliminating its advantages. Regarding the reverse pass, no issues with inconsistent view of memory should occur if atomic updates are used on adjoint variables. Otherwise, we offer a reverse flush that the user may push to the tapes manually, in the same spirit as the reverse barrier described in Section 4.2.

Higher order derivatives.

There are different strategies for the computation of higher order derivatives involving the reverse mode of AD. In the adjoints of tangents approach, the forward mode of AD is used for all derivative orders except the outermost one. This approach is not only favoured for its efficiency properties [12] but also supported by the differentiation logic presented above. We recommend using this strategy with OpDiLib for higher order derivatives. The adjoints of adjoints approach, on the other hand, applies the reverse mode to the reverse mode. This requires recordings during reversal and in order to support it, a reverse mode AD tool must have the closure property that it can be applied to the code it produces [10]. In the differentiation logic presented above, this leads to conflicts. On the one hand, there is no AD support for atomics. On the other hand, atomics are used during reversal for updates on adjoint variables (see Section 4.2) and synchronization (see Figure 5). Hence, this strategy cannot be supported without suitable adaptions of the differentiation logic. It also gives rise to further implementational issues in the backend layer that are discussed in Section 5.

4.2 Shared Reading and Atomic Adjoints

In typical parallel applications, data is read from different threads simultaneously. As we explained already in Section 1, active variables that are read from multiple threads simultaneously in the forward pass lead to data races in the reverse pass and require additional synchronization, which is for example achieved by atomic updates on the corresponding adjoint variables. However, atomic updates come at a performance cost, especially if they are used for all adjoint variables [18]. The challenge is hence to identify the variables that need atomic updates. Source transformation tools have similarities to compilers and may attempt to address this issue with static program analysis techniques like the exclusive read analysis introduced in [10]. In the research on operator overloading in [19] and the implementation in ADOL-C [5], a compromise is made. At the beginning of a parallel region, threadprivate copies of all active variables are made. This way, simultaneous read access is prevented by design and no atomic updates are required. However, no write access to variables that served already as inputs to the parallel region is allowed from within the parallel region, a variable can be written by at most one thread and during reversal, an additional step is needed to reduce the contributions to the adjoint input values from the different threads. In [18], a pattern of so-called self-adjoint memory access is discussed. It admits reversal of certain loops without atomics; the user has to identify and mark such loops so that they can be handled appropriately by a source transformation tool.

In order to make OpDiLib as applicable as possible, we do not want to impose any restrictions on the data access patterns and we do not want to require users to restructure their code to get a first working version. Therefore, we use atomic updates on the adjoint variables by default. On the other hand, a user should have as much flexibility as possible to disable atomic adjoints where they are not needed. To that end, OpDiLib tracks an adjoint access mode per thread alongside the recording. The available options are Atomic, which is the default for OpDiLib, or Classical, which is the non-atomic counterpart. The user may control the adjoint access mode by calls to the setAdjointAccessMode routine of the logic layer at arbitrary points in the code. The recording that is performed after the call until the next change of the adjoint access mode in that thread is then evaluated in the reverse pass according to the chosen mode. This means that whole parallel regions can be marked as safe for evaluation without atomics, for example if the user decides to make thread-private copies of all inputs to the parallel region as in [19, 5], or if a parallel loop has self-adjoint memory access [18]. In addition, one may isolate any parts of large parallel regions as safe for evaluation without atomics in a fine-grained manner, even across the boundaries of parallel regions. The following example demonstrates that additional synchronization might be required during the reverse pass.

1ADType shared_data[2] = ...;
2ADType result[2] = ...;
3#pragma omp parallel num_threads(2)
4{
5  int i = omp_get_thread_num();
6  OpDiLib::logic->setAdjointAccessMode(OpDiLib::Classical);
7
8  // part without shared reading
9  result[i] *= shared_data[i];
10
11  OpDiLib::logic->addReverseBarrier();
12  OpDiLib::logic->setAdjointAccessMode(OpDiLib::Atomic);
13
14  // part with shared reading
15  result[i] += shared_data[0] * shared_data[1];
16}

The change of the adjoint access mode in line 12 separates adjacent parts of the code with and without shared reading. During the forward pass, shared reading is not a problem and no synchronization is required. During the reverse pass, however, one thread might finish first with the atomic evaluation of the recording for line 15 and begin with unsafe updates on the adjoint of shared_data[0] while the other thread is still busy with the safe evaluation of its recording for line 15, resulting in simultaneous safe and unsafe writes to the same variable. This is prevented by introducing a barrier exclusively in the reverse pass by the additional call in line 11. It pushes an external function containing a #pragma omp barrier statement to the tape, just as in the handling of the SyncRegion events in Section 4.1. Hence, one must make sure to follow the rules for OpenMP barriers, that is, either all threads of a team encounter a reverse barrier or none [23]. When using reverse barriers in nested parallel regions, one must be aware that they affect only the team of threads of the corresponding nested reverse parallel region.

5 Backend

The AD events introduced in Section 4.1 have to be generated alongside the OpenMP workflow. In principle, the required augmentation of the source code by AD events could be performed by hand. The objective of the backend layer is to automatize this tedious task, up to the point where no changes to the original source code are required. This is achieved with the OMPT backend that is presented in Section 5.1. It makes use of modern OpenMP features around OMPT that allow for an observation and augmentation of the OpenMP workflow. This way, we solve the principal problem that OpenMP #pragma directives are inaccessible to overloading techniques, which previously limited support for OpenMP in operator overloading tools and required manual annotations of the source code [19, 5].

As a fallback for compilers without OMPT support, we provide a second backend that hides the augmentations of the source code inside replacement macros for OpenMP #pragma directives and clauses as well as replacements for OpenMP’s runtime functions. They form an alternative interface to OpenMP functionality which we present in Section 5.2. It provides access to the same set of AD features as the OMPT backend. Clearly, the macro backend requires the rewriting of all OpenMP directives and runtime function calls. In principal, given the structural similarities of the original constructs and their replacements, this task could be performed by a source transformation preprocessing script in an automatic fashion. Code that is written with the replacements is also fully compatible with the OMPT backend. This means that no additional rewriting is required for a transition to the modern features. As long as the OMPT backend is used, original OpenMP constructs and their respective replacements can be used in the same code.

5.1 OMPT Backend

The OpenMP 5.0 specification [23] formalizes the OpenMP workflow by defining execution model events. For each event, a custom callback function can be registered that is invoked by the OpenMP runtime at each occurrence of the corresponding event in the respective thread. The callback function receives detailed information about the OpenMP mechanisms and objects involved. In many cases, it also offers possibilities to associate custom data with events. This allows developers to build their own tools on top of OpenMP and integrate them seamlessly into the workflow of the OpenMP runtime. We use these features for the generation of AD events without any modifications of the original source code. Since the AD events introduced in Section 4.1 are derived from the execution model events, a key part of a callback implementation is always the transformation to the corresponding AD event. Additionally, OMPT is powerful enough to fulfill further requirements identified throughout Section 4, for example AD data transport, mutex identifier generation or general access to parallel AD data.

We illustrate how a backend can be implemented with functionality provided by OMPT using the example of a parallel region and its implicit tasks. The OpenMP 5.0 specification [23] defines the following order of events for a parallel region.

  • A thread that encounters a parallel construct receives a parallel-begin event before any implicit task associated with the parallel region is created.

  • Guided by the requested degree of parallelism, implicit tasks are generated, all of which execute the structured block of the parallel construct. Each implicit task is executed by a dedicated thread. The thread that encountered the parallel region executes the implicit task with index zero.

  • Each thread that participates in the parallel region receives an implicit-task-begin event before it starts executing the structured block. After it has finished executing the structured block and has received events associated with the implicit barrier at the end of the parallel region, it receives an implicit-task-end event.

  • After the thread that encountered the parallel region has received its implicit-task-end event, it receives a parallel-end event. After that, it resumes execution of the task in which it encountered the parallel region.

This order is inherited by the corresponding AD events. We remark that no ordering is specified for the parallel-end event and implicit-task-end events of implicit tasks other than the one with index zero. The extreme case is that they occur no sooner than immediately prior to the next implicit-task-begin in that thread, in particular after the next parallel-begin event. We observed this in the LLVM OpenMP runtime. The implementation of the differentiation logic has to be correct also for these extreme cases. Listing 4 displays implementations of the callbacks for parallel-begin and parallel-end events that match the ompt_callback_parallel_begin_t and ompt_callback_parallel_end_t signatures as defined by the OpenMP 5.0 specification [23].

1void on_parallel_begin(ompt_data_t*, const ompt_frame_t*,
2    ompt_data_t* parallelData, unsigned int requested,
3    int, void const*) {
4  parallelData->ptr = OpDiLib::logic->onParallelBegin(requested);
5}
6
7void on_parallel_end(ompt_data_t* parallelData, ompt_data_t*, int,
8    void const*) {
9  OpDiLib::logic->onParallelEnd(parallelData->ptr);
10}
Listing 4: Callbacks for parallel-begin and parallel-end events.

We omitted the names of arguments that we do not use. The callback for the parallel-begin event receives information about the requested parallelism, that is, the number of threads that was specified with the num_threads clause if present, or the default number of threads instead. As explained in Section 4.1, this information is needed to prepare the recording environment and hence forwarded to the AD event. Furthermore, the OpenMP runtime allocates a parallel data object of structured type ompt_data_t that is associated with the parallel region. It has a member void* ptr that can be used to add custom data. We use it to store the parallel AD data object returned by the logic layer. The callback for the corresponding parallel-end event receives the same OMPT data object with the ptr member as it was assigned in the parallel-begin callback. Hence, we can use it to transport the AD data between the matching begin and end events as displayed in Figure 1 and discussed in Section 4.1. OMPT’s data object associated with the current parallel region can also be queried at arbitrary points in the code via the ompt_get_parallel_info runtime entry point. Thus, we implement access to the parallel AD data object of the current parallel region as is needed in Section 4.2 by querying OMPT’s parallel data object, and extracting the parallel AD data object from it in the same fashion as in Listing 4. Listing 5 displays the implementation of the callback for implicit-task-begin and implicit-task-end events. It has the ompt_callback_implicit_task_t signature as required by the OpenMP 5.0 specification [23].

1void on_implicit_task(ompt_scope_endpoint_t endpoint,
2  ompt_data_t* parallelData, ompt_data_t* taskData,
3  unsigned int actual, unsigned int index, int flags) {
4
5  if (flags & ompt_task_initial) {
6    return;
7  }
8
9  if (ompt_scope_begin == endpoint) {
10    taskData->ptr = OpDiLib::logic->onImplicitTaskBegin(
11      actual, index, parallelData->ptr);
12  }
13  else {
14    OpDiLib::logic->onImplicitTaskEnd(taskData->ptr);
15  }
16}
Listing 5: Callback for implicit-task-begin and implicit-task-end events.

It filters out the special case of the initial implicit task, that is, the task that executes the serial parts of the program. It does not require additional AD treatment. The callback receives all parameters from the OpenMP runtime that are needed for the implicit task AD events, that is, the actual degree of parallelism (as opposed to the requested one), the index in the team of threads, and the parallel AD data object. The task data object provided by OMPT is again used to exchange AD data with the corresponding end event.

Table 1 provides an overview over all callbacks that we use to generate the AD events with OpDiLib’s OMPT backend.

OMPT callback type AD events
ompt_callback_parallel_begin ParallelBegin
ompt_callback_parallel_end ParallelEnd
ompt_callback_implicit_task ImplicitTaskBegin, ImplicitTaskEnd
ompt_callback_mutex_acquired MutexAcquired
ompt_callback_mutex_released MutexReleased
ompt_callback_sync_region SyncRegionBegin, SyncRegionEnd
ompt_callback_work WorkBegin, WorkEnd
ompt_callback_reduction MutexAcquired, MutexReleased
Table 1: Mapping between OMPT callbacks and AD events.

Not each execution model event has a dedicated callback function. In contrary, a single callback often handles a large number of events. For example, the critical-acquired event due to a #pragma omp critical directive and the lock-acquired event caused by the runtime function omp_set_lock both result in invocations of the ompt_callback_mutex_acquired callback. Likewise, an ompt_callback_sync_region callback handles (amongst others) explicit barrier directives and implicit barriers at the end of, e. g., worksharing directives that do not specify a nowait clause. It is beneficial for AD that similar types of events are collected in one callback for two reasons. First, this coincides often with the fact that all execution model events collected by the same callback also require the same treatment from an AD point of view and can be forwarded to the same module of the logic layer. For example, this is the case for mutex type events. Second, we do not have to distinguish whether the execution model event is caused by a #pragma directive, the presence or absence of clauses, a runtime function call, or other mechanisms. This allows us to conveniently establish AD support for a large range of OpenMP features.

In connection to what we discussed in Section 4.1, the OpenMP runtime already passes unique mutex identifiers to mutex callbacks, which we can reuse immediately to identify and track mutexes in the logic layer. We remark that it is not guaranteed that pending mutex-released events in one thread occur prior to mutex-acquired events for the same mutex in another thread. Unlike the handling of the mutex-acquired event, the handling of the mutex-released event is not protected by the corresponding mutex. The implementation of the differentiation logic has to support this. The reduction callback is special among the cases in Table 1 in the sense that it produces mutex AD events. According to the OpenMP 5.0 specification [23], the reduction-begin event is received before the task begins to perform load and stores associated with the reduction, and the reduction-end is received after these are completed. The reduction is usually implemented in a tree-like fashion so that the reduction events might mark updates both to intermediate shared variables and the final reduction target. By mapping the reduction events to AD events for an imaginary mutex, we record the reduction operations as if they were performed in an equivalent serial manner, that is, we choose a total order for all reduction events that preserves the partial order indicated by the reduction tree. Reversal is then performed according to this total order. Conveniently, the LLVM OpenMP runtime produces sync region events for a so-called implementation specific barrier that separates the execution of the structured block and the accumulation in private reduction variables from the part of the reduction that actually involves synchronization. These events are detected and handled by OpDiLib like any other sync region. Hence, when using this OpenMP runtime, it is ensured that no thread begins with the reversal of the structured block before all reduction operations are reverted. This is important for threads that did not participate in the reduction and did not receive mutex events. This implementation specific barrier is not guaranteed by the OpenMP specification. Besides its usefulness for AD, it provides insights in the reduction process that should also be of interest to performance monitoring tools. Therefore, we think that it is a desirable feature in general and would like to see it in future versions of the OpenMP specification. If OpDiLib is used together with other OMPT implementations, it might be necessary to add a reverse barrier by hand, or to use semi-automatic techniques similar to those in the macro backend. We close this section with a discussion of additional issues due to the use of OMPT.

Execution model events due to internal synchronization.

OpDiLib uses OpenMP’s lock facilities for internal synchronization purposes. From the OpenMP runtime’s point of view, these are not any different from locks in user code and hence produce the same execution model events and cause invocations of the same callbacks. Besides the unnecessary performance cost of differentiating the logic layer’s internals, there is also a more imminent issue. If the user successfully sets a lock, an ompt_callback_mutex_acquired callback is dispatched and generates a MutexAcquired AD event. The handling of this event, however, requires itself the acquisition of an internal lock that invokes the same callback, resulting in an infinite recursion. To fix this, the logic layer keeps track of inactive mutexes. These are then filtered out in the AD event handling. An experienced user can also use these mechanisms to mark mutexes from user code as inactive by a call OpDiLib::logic->registerInactiveMutex(mutexKind, mutexId). These are then disregarded for the differentiation. This is also recommended for mutexes introduced in the underlying AD tool for the purpose of thread-safety.

Execution model events during reversal.

Since OpDiLib embeds OpenMP directives into the tape, execution model events are also produced during the reverse pass. However, the tapes are not in recording mode at this point. This is detected in the logic layer and the AD handling is skipped. We remark — despite the limited practical relevance — that execution models during reversal might have an application for higher order derivatives that are computed in an adjoints of adjoints fashion, see the previous discussion at the end of Section 4.1. In an application of OpDiLib to itself, the execution model events during an evaluation for the higher order type could be used to perform the recording for the lower order type. This relates also to the next topic of discussion.

Multiple OMPT tools.

Suppose we want to apply OpDiLib to an OpenMP parallelized code that uses an OMPT tool already, for example for performance monitoring purposes. The simultaneous use of multiple OMPT tools is beyond the scope of the OpenMP 5.0 specification [23]. To overcome this limitation, there was work in the direction of multiplexing of OMPT tools [25] that is now part of clang. We will conduct further tests to see whether it can be used for our purposes. Otherwise, the macro backend may serve as a fallback solution.

5.2 Macro Backend

Without OMPT, we are not aware of any way to change the behaviour of the original OpenMP constructs in the scope of the program. Instead, we provide an alternative interface to OpenMP functionality. Code that is written according to this interface can then also be differentiated if a compiler without OMPT support is used. Specifically, we introduce macros that have to be used instead of the usual OpenMP directives and clauses, as well as functions that have to be used instead of the usual OpenMP runtime functions. This provides a semi-automatic way of differentiating OpenMP parallel code with operator overloading. The replacements always contain the original OpenMP construct but embed it into additional code that generates the required AD events, and the replacements follow mostly the same structure as the original OpenMP constructs.

Runtime functions.

For each runtime function omp_*, we introduce a replacement opdi_*. For example, omp_set_lock is replaced by opdi_set_lock. The replacement has the same signature as the original function. Its implementation always forwards the call to the original runtime function, but surrounds it with appropriate generation of AD events, as showcased in Listing 6.

1void opdi_set_lock(omp_lock_t* lock) {
2  omp_set_lock(lock);
3  OpDiLib::logic->onMutexAcquired(OpDiLib::Lock,
4      OpDiLib::backend->getLockIdentifier(lock));
5}
Listing 6: omp_set_lock runtime function replacement.

As can be seen in Listing 6, the generation of mutex identifiers that was previously performed by the OMPT runtime requires now a dedicated custom implementation in the backend layer. For reasons of consistency, we offer replacements also for runtime functions that do not require AD treatment. A user code that should compile both with and without OpDiLib can then use a global macro to decide whether the omp_ or opdi_ prefix is used for the runtime function calls, without keeping track of the necessity of this switch for each individual runtime function.

Directives.

We replace each OpenMP directive #pragma omp directive ... by a custom macro OPDI_DIRECTIVE(...). The idea is that the same clauses and hints that were passed to the original directive can also be passed to the macro, possibly again in the shape of their replacements. The macro expands to the orignal OpenMP directive surrounded by corresponding AD treatment. This is showcased for the barrier directive in Listing 7.

1#define OPDI_BARRIER(...) \
2  OpDiLib::logic->onSyncRegionBegin(OpDiLib::Barrier); \
3  _Pragma(omp barrier __VA_ARGS__) \
4  OpDiLib::logic->onSyncRegionEnd(OpDiLib::Barrier);
Listing 7: Barrier replacement macro.

The situation is more involved for OpenMP directives that are followed by a structured block, for example #pragma omp for. This has two reasons. First, it might be necessary to generate AD events within the structured block. Second, it might be necessary to spawn AD events after the structured block. We make heavy use of a privatization technique that was already applied to address these issues for the handling of parallel regions in ADOL-C [19, 5]. Augmentations at the beginning and end of the structured block can be generated as side effects of constructors and destructors of objects that are privatized via data-sharing clauses. These privatizations can be generated as part of both replacement macros and replacement clauses. However, this technique is not applicable or practical in all situations. First, it covers only OpenMP directives that support data-sharing clauses. Furthermore, the OpenMP specification leaves the respective order of constructor and destructor calls unspecified, does not support multiple privatizations of the same variable on one directive, and restricts subsequent privatization in nested directives to some extent [23]. Privatization is the only approach we are aware of to design replacement clauses with side effects on the recording environment. Thus, their limitations give rise to some special cases which we discuss below. To find a remedy for directives, we complement each OPDI_DIRECTIVE(...) replacement that is followed by a structured block by a corresponding OPDI_END_DIRECTIVE macro that has to be placed after the structured block. This is showcased in Listing 8 and resembles the Fortran way of using OpenMP directives rather than the C++ one.

1OPDI_DIRECTIVE(...)
2{
3  ...
4}
5OPDI_END_DIRECTIVE
Listing 8: Schematic application of replacement macro and the corresponding end macro.

End macros provide also important degrees of flexibility in our implementation with respect to future adaptions. Table 2 provides an overview over all replacement directives and the corresponding end macros.

OpenMP directive replacement macro end macro
#pragma omp barrier ...
OPDI_BARRIER(...)
#pragma omp parallel ...
OPDI_PARALLEL(...)
OPDI_END_PARALLEL
#pragma omp for ...
OPDI_FOR(...)
OPDI_END_FOR
#pragma omp sections ...
OPDI_SECTIONS(...)
OPDI_END_SECTIONS
#pragma omp single ...
OPDI_SINGLE(...)
OPDI_END_SINGLE
#pragma omp single nowait ...
OPDI_SINGLE_NOWAIT(...)
OPDI_END_SINGLE
#pragma omp critical ...
OPDI_CRITICAL(...)
OPDI_END_CRITICAL
#pragma omp critical (name) ...
OPDI_CRITICAL_NAME(name, ...)
OPDI_END_CRITICAL
#pragma omp ordered ...
OPDI_ORDERED(...)
OPDI_END_ORDERED
#pragma omp master ...
OPDI_MASTER(...)
OPDI_END_MASTER
#pragma omp section ...
OPDI_SECTION(...)
OPDI_END_SECTION
Table 2: Replacement macros and corresponding end macros.

Among the replacement directives there are two special cases. The first are named critical regions. The critical directive does not support data-sharing clauses. However, the supplied name is essential for the generation of a unique mutex identifier so that is has to be provided via a dedicated macro. Second, while the single directive supports data-sharing clauses, these apply only to the single thread that executes the structured block, which is not sufficient for the treatment of the implicit barrier that regards all threads. Hence, a dedicated macro for single directives with a nowait clause is introduced. Note that the special replacements still use the same respective end macro.

Clauses.

Most clauses and hints do not require replacements and are passed to the replacement directives in their original form. Besides the special replacement directives discussed above, we provide a OPDI_NOWAIT replacement clause that has to be used instead of nowait for correct treatment of the corresponding implicit barrier. Furthermore, any directive that specifies reduction(...) clauses has to specify the OPDI_REDUCTION macro once alongside the otherwise unmodified reduction clauses. It provides the reverse barrier that separates the threadprivate reduction phase from the shared, tree-like reduction phase. We remark that this macro has to be provided as a supplement instead of a replacement because multiple reduction clauses on one directive are allowed but multiple privatizations of the same variable on one directive are forbidden [23].

Reductions.

While the OPDI_REDUCTION macro handles the required reduction barrier, it does not handle the mapping of reductions to mutex AD events. In the macro backend, we associate reductions with actual nested locks so that the required mutex AD events are produced by setting and unsetting these locks. This logic is hidden inside a replacement macro for the declaration of custom reductions. Since any reduction involving AD types is of custom type, this covers all the relevant cases. The replacement macro reads OPDI_DECLARE_REDUCTION(OP_NAME, TYPE, OP, INIT) and expands to code as displayed in Listing 9.

1#pragma omp declare \
2  reduction(OP_NAME : TYPE : \
3    OpDiLib::Reducer(omp_out) \
4      = OpDiLib::Reducer(omp_out) OP OpDiLib::Reducer(omp_in)) \
5  initializer(omp_priv = INIT)
Listing 9: Schematic reduction declaration.

It involves casts of omp_in and omp_out to a Reducer object that is associated with a global nested lock. Note that a nested lock has the distinguished property that it can be set multiple times from within the same thread without causing deadlocks. We implement the Reducer such that the nested lock is set in constructor calls, which happens three times in the course of one reduction statement. The three set operations are complemented by three unset operations as part of the overload of operator = for the Reducer. This ensures that the whole reduction statement is exclusively performed by one thread at a time. This produces a serialization of the reduction tree in the forward pass that is preserved by the mutex AD handling for the reverse pass. In addition, we ensure via templating techniques that different custom reductions use different Reducer objects and hence, different nested locks.

6 Performance Study

To study the performance of OpDiLib, we apply it to a PDE solver for coupled Burgers’ equations

(3)

[3, 4, 34] that was previously used for the performance evaluation of various aspects of automatic differentiation in [27, 28, 29]. We solve (3) on where is a scaling of the unit square , together with boundary conditions and initial conditions at which are taken from the exact solution

that is given in [4]. We solve the equations with an upwind finite difference scheme, for which we discretize by an equidistant grid and advance in time steps of fixed size .

(a)
(b)
Figure 6: Decomposition of the computational domain. Column splitting for MPI processes and row splitting for OpenMP threads. Shared reading might occur at block boundaries between adjacent OpenMP threads in (a). This is avoided in (b) by two separate OpenMP sweeps.
Figure 7: Pattern of subsequent memory access with the five-point stencil in a sweep along a row of the solution field, starting at the left column boundary. Showcases the read-write pattern in the forward pass as well as close and distant memory locations with a row-major memory layout. The reverse pass follows an analogous pattern with reversed order of access and exchanged read/write roles.

The solver is implemented in a hybrid parallel manner. The spatial domain is first decomposed into columns of equal width, one for each MPI process. Halo exchange via MPI communication is required at the column boundaries. Second, each MPI process works on its column with multiple OpenMP threads. To that end, the column is further subdivided into rows of equal height, resulting in one block for each OpenMP thread. This is visualized in Figure 5(a). Each MPI process stores its part of the solution field in row-major order and an OpenMP thread performs row-wise sweeps with the finite difference stencil. This is illustrated in Figure 7. We use distributed memory parallelization between NUMA domains and employ shared memory parallelization within these. To that end, we bind MPI processes to NUMA domains with hwloc555https://www.open-mpi.org/projects/hwloc/ and tie OpenMP threads to cores using OpenMP environment variables [23].

We use the reverse mode of AD to differentiate the norm of the solution after the final time step with respect to the initial data. CoDiPack666https://www.scicomp.uni-kl.de/software/codi/ [28] is used as the underlying AD tool, configured for Jacobi taping [17] with a reuse index handler [29]. For the differentiation of OpenMP, we couple it with OpDiLib as explained in Section 3. CoDiPack is also coupled with MeDiPack777https://www.scicomp.uni-kl.de/software/medi/ for the differentiation of MPI.

In the splitting strategy displayed in Figure 5(a), shared reading might occur at the boundary between two adjacent OpenMP threads. In order to evaluate the impact of atomic updates on the performance of the reverse pass, we include a configuration that prevents shared reading. This is achieved by doubling the number of blocks in the OpenMP decomposition, see Figure 5(b)

. In the first sweep, the blocks with even index are processed in parallel by the OpenMP threads; in the second sweep, the ones with odd index are treated. This way, no adjacent blocks are processed in parallel and no data is read simultaneously with the finite difference stencils. Atomic updates are then disabled as explained in Section

4.2, and a reverse barrier is required between the two sweeps.

We display results obtained with the compilers clang++ 10.0.1888https://releases.llvm.org/ and g++ 10.1999https://gcc.gnu.org/gcc-10/. The latter does not support OMPT yet and is hence an important use case for the macro backend presented in Section 5.2.

The simulations in this paper were carried out with , , on a grid with cells and for 20 time steps of width . The computations were performed on a dual-socket node of the Elwetritsch cluster at TU Kaiserslautern with two AMD EPYC 7351 processors. Each processor consists of 4 NUMA domains with 4 cores each, giving a total of 32 cores. The node is equipped with 128 GB of RAM. We remark that we confirmed the observed trends on a second type of node of the Elwetritsch cluster with two Intel Xeon Gold 6126 processors.

We begin with a comparison of absolute performance values of different AD configurations within one NUMA domain, in which we execute the simulation in various AD configurations with either one or four OpenMP threads. The results are displayed in Figure 8.

Figure 8: Forward pass (a) and reverse pass (b) performance comparison of different parallel configurations within one NUMA domain. Includes setups with classical AD, with the macro backend and with the OMPT backend, compiled with g++ and clang++ . In (a), we display also primal performance values. Numbers denote averages of 20 runs after five discarded warm-up runs.
Figure 9: Memory high water marks of different parallel configurations within one NUMA domain. Includes setups with classical AD, with the macro backend and with the OMPT backend, compiled with g++ and clang++ .

The different configurations are denoted on the second axis whereas the timings obtained with different compilers and AD setups are denoted on the first axis. In the forward pass, we display also the timings obtained with the primal, undifferentiated simulation. We begin with some general observations. If we compare for each configuration the impact of the different compilers (squares versus circles) and the impact of the backend choice (blue versus green), we see that neither of both has a significant influence on the performance, with the mild exception of a parallel reverse pass with atomic updates on adjoint variables, which we revisit as we look into the individual configurations in greater detail. Furthermore, the shared reading optimization does not have an impact on the performance of the forward pass. In the first configuration (“no OpenMP”), we compile without OpenMP, execute no OpDiLib logic and do not use atomic updates on the adjoint vector. Since the OpDiLib setups use the thread-safe version of CoDiPack, this configuration allows us to evaluate the impact of the CoDiPack modifications on the performance of the standard CoDiPack type. We see that it is mostly preserved, especially in the reverse pass. The second configuration compiles with OpenMP but fixes the number of threads to one (“1 OpenMP thread”). Compared to the first configuration, we see a certain performance overhead in the forward pass that is not visible in the primal performance values. The major part of this is due to using a thread-safe CoDiPack type, with minor parts due to using OpenMP and OpDiLib. The performance overhead in the reverse pass, on the other hand, is significant. The third configuration does not require any shared reading in the forward pass and is hence reversed without atomic updates on the adjoint vector (“1 OpenMP thread, no shared reading”). Given that the reverse performance of the classical serial runs is fully recovered, we conclude that the performance overhead of the second configuration in the reverse pass is solely due to atomic updates on the adjoint vector, which demonstrates the cost of atomic updates in practice and underlines the importance of eliminating them for the performance of the reverse pass. Compared to the configurations with one OpenMP thread, the ones with four OpenMP threads yield a speedup in the forward pass in the expected order of magnitude of slightly less than four. In terms of AD, this is also an improvement of more than a factor of two compared to the classical serial versions. While shared reading has no impact on the forward pass with multiple threads, the effect on the reverse pass is, again, significant. A parallel reverse pass with atomic updates (“4 OpenMP threads”) is on average notably slower than the serial reverse run, and exhibits otherwise an huge variation in the timings, resulting in the most unreliable performance among all configurations. Even if this effect is reduced with the OMPT backend, the comparison is still in favour of serial reverse runs. Without shared reading (“4 OpenMP threads, no shared reading”), on the other hand, we obtain the expected speedup of slightly less than four with respect to the serial configurations also in the reverse pass. Figure 9 displays memory high water marks for the different AD setups from Figure 8. There is almost no difference between all setups and configurations. Compared to the classical serial configuration, any additional memory consumption due to OpDiLib’s differentiation logic and backend is negligible.

Next, we investigate the strong scaling of the different AD setups up to the full node in an OpenMP-MPI hybrid parallel manner. Given the hardware specifics, we fix the number of OpenMP threads to four per MPI process and subsequently increase the number of MPI processes until the full node is used. For each setup, we use the respective configuration of one MPI process with one OpenMP thread as the baseline for the strong scaling. Note that the scaling plots cannot be used to compare absolute performance values. The results for the forward and the reverse pass are displayed in Figure 10.

Figure 10: Scaling results of OpDiLib for the forward pass (a) and the reverse pass (b). Includes for comparison primal scaling results in the forward pass and the ideal scaling curves.

The results for the forward pass include also the scaling results for the primal, undifferentiated code and we see that all AD setups achieve better scaling than the primal ones. This is often observed in AD and is attributed to the fact that AD improves the computation to memory ratio in the forward pass [20]. Neither the backend choice nor the shared reading optimization have any influence on the scaling in the forward pass. Among the AD setups, best scaling is achieved with clang++. In the reverse pass, all AD configurations achieve good scaling. With clang++, configurations without atomic updates outscale their shared reading counterparts in the reverse pass and exhibit even superlinear behaviour. With g++, on the other hand, the reverse pass scaling is better if atomic updates are used. To understand how this is possible, we observe that there are two ways how atomics influence the performance of the parallel reverse pass. First, there is the absolute performance impact of performing operations atomically. Second, the performance may decrease due to additional serialization in the reverse pass, which should be the most significant factor of influence on the scaling curve. In the splitting strategy displayed in Figure 6, however, shared reading — and hence serialization in the reverse pass — can only occur at the cells at block boundaries between adjacent OpenMP threads. The number of these cells is furthermore invariant with respect to the number of MPI processes so that we cannot observe serialization at all in our scaling plots. This explains why we achieve good scaling despite atomic updates. Furthermore, the fact that configurations without atomics are much faster in terms of absolute performance implies that they operate closer to the memory bandwidth of the socket, which hinders scaling. We attribute the overall good scaling results also to the splitting strategy displayed in Figure 6. Increasing the number of MPI processes corresponds to decreasing the width of the column for each process. This, in turn, implies that the distant memory locations that are accessed during one application of the finite difference stencil move closer to each other, see Figure 7. This applies analogously to the corresponding pattern of adjoint variable access in the reverse pass. Hence, memory locality improves with an increasing number of MPI processes.

To summarize, parallel performance is — as always — driven by multiple factors. For now, the most important observation is that the application of OpDiLib does not, in principle, prevent very good parallel performance in practice.

7 Summary and Conclusion

With OpDiLib, we have designed a tool for the automatic differentiation of OpenMP parallel codes that is fast, flexible, reusable and easy to apply. Our interpretation of AD augmentations as events alongside the OpenMP workflow allowed us to cover an unprecedented extent of OpenMP mechanisms in an operator overloading AD tool. We provide this functionality in the new open source software OpDiLib101010https://www.scicomp.uni-kl.de/software/opdi/ as a universal add-on for operator overloading AD tools. We publish it together with bindings for CoDiPack and hope that it is also a useful addition to other operator overloading AD tools. In order to couple OpDiLib to another AD tool, that tool must be made thread-safe in a first step. The coupling performed in the second step provides capabilities for handling OpenMP constructs including an automatic deduction of a parallel reverse pass. With the OMPT backend, we achieve full automatization, that is, we do not impose need for additional code changes. We do not restrict the data access patterns either. Hence, if a serial code that was already differentiated with the classical AD tool is parallelized by means of OpenMP, an initial parallel differentiated code is obtained immediately. While we apply the conservative strategy of atomic updates on the adjoint variables, we do not prevent further optimization and equip the user with tools that can disable atomic evaluation for any parts of the code. We complemented the OMPT features by an alternative macro backend, which serves as a fallback solution and covers also compilers without OMPT support. OpDiLib’s flexible design allowed us to support the different use cases and modes of operation presented in this paper and will likely benefit future modifications and applications. We have demonstrated in the performance tests that all configurations of OpDiLib preserve the serial AD memory footprint and that very good parallel performance can be achieved in practice. We will enhance and improve OpDiLib in the future and work on additional features and configurations is in progress. OpDiLib’s OpenMP support as presented in this paper covers all directives, clauses and runtime functions of the OpenMP 2.5 specification [21] with the exception of atomics and the flush directive, as we explained in Section 4.1. This resembles the core feature set of OpenMP for classical CPU architectures. We demonstrated additionally OpDiLib’s applicability for the pure operator overloading automatic differentiation of OpenMP-MPI hybrid parallel codes. Hence, already now, OpDiLib should be of interest to many simulation codes executed on CPU computing clusters.

References

  • Albring et al. [2015] T. Albring, M. Sagebaum, and N. R. Gauger. Development of a Consistent Discrete Adjoint Solver in an Evolving Aerodynamic Design Framework. In 16th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, AIAA Aviation. American Institute of Aeronautics and Astronautics, 2015. doi: 10.2514/6.2015-3240.
  • Auroux and Groza [2017] D. Auroux and V. Groza. Optimal parameters identification and sensitivity study for abrasive waterjet milling model. Inverse Problems in Science and Engineering, 25(11):1560–1576, 2017. doi: 10.1080/17415977.2016.1273916.
  • Bahadır [2003] A. Bahadır. A fully implicit finite-difference scheme for two-dimensional Burgers’ equations. Applied Mathematics and Computation, 137(1):131–137, 2003. doi: 10.1016/S0096-3003(02)00091-7.
  • Biazar and Aminikhah [2009] J. Biazar and H. Aminikhah. Exact and numerical solutions for non-linear Burger’s equation by VIM. Mathematical and Computer Modelling, 49(7):1394–1400, 2009. doi: 10.1016/j.mcm.2008.12.006.
  • Bischof et al. [2008] C. Bischof, N. Guertler, A. Kowarz, and A. Walther. Parallel Reverse Mode Automatic Differentiation for OpenMP Programs with ADOL-C. In C. H. Bischof, H. M. Bücker, P. Hovland, U. Naumann, and J. Utke, editors, Advances in Automatic Differentiation, pages 163–173. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-68942-3_15.
  • Bücker et al. [2001] H. M. Bücker, B. Lang, D. an Mey, and C. H. Bischof. Bringing Together Automatic Differentiation and OpenMP. In Proceedings of the 15th International Conference on Supercomputing, ICS ’01, pages 246–251, New York, NY, USA, 2001. Association for Computing Machinery. doi: 10.1145/377792.377842.
  • Bücker et al. [2002] H. M. Bücker, B. Lang, A. Rasch, C. H. Bischof, and D. an Mey. Explicit loop scheduling in OpenMP for parallel automatic differentiation. In Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications, pages 121–126. IEEE, 2002. doi: 10.1109/HPCSA.2002.1019144.
  • Bücker et al. [2004] H. M. Bücker, A. Rasch, and A. Wolf. A Class of OpenMP Applications Involving Nested Parallelism. In Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04, pages 220–224, New York, NY, USA, 2004. Association for Computing Machinery. doi: 10.1145/967900.967948.
  • Eichenberger et al. [2013] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, D. Lorenz, et al. OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis. In A. P. Rendell, B. M. Chapman, and M. S. Müller, editors, OpenMP in the Era of Low Power Devices and Accelerators, pages 171–185. Springer Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-40698-0_13.
  • Förster [2014] M. Förster. Algorithmic Differentiation of Pragma-Defined Parallel Regions. Springer Vieweg, 2014. doi: 10.1007/978-3-658-07597-2. PhD thesis, RWTH Aachen University.
  • Gauger et al. [2008] N. R. Gauger, A. Walther, C. Moldenhauer, and M. Widhalm. Automatic Differentiation of an Entire Design Chain for Aerodynamic Shape Optimization. In C. Tropea, S. Jakirlic, H.-J. Heinemann, R. Henke, and H. Hönlinger, editors, New Results in Numerical and Experimental Fluid Mechanics VI, pages 454–461. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-74460-3_56.
  • Griewank and Walther [2008] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, volume 105. Siam, 2008. doi: 10.1137/1.9780898717761.
  • Günther et al. [2020] S. Günther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger.

    Layer-Parallel Training of Deep Residual Neural Networks.

    SIAM Journal on Mathematics of Data Science

    , 2(1):1–23, 2020.
    doi: 10.1137/19M1247620.
  • Hascoët and Pascual [2013] L. Hascoët and V. Pascual. The Tapenade Automatic Differentiation Tool: Principles, Model, and Specification. ACM Trans. Math. Softw., 39(3), 2013. doi: 10.1145/2450153.2450158.
  • Heimbach et al. [2002] P. Heimbach, C. Hill, and R. Giering. Automatic Generation of Efficient Adjoint Code for a Parallel Navier-Stokes Solver. In P. M. A. Sloot, A. G. Hoekstra, C. J. K. Tan, and J. J. Dongarra, editors, Computational Science — ICCS 2002, pages 1019–1028. Springer Berlin Heidelberg, 2002. doi: 10.1007/3-540-46080-2_107.
  • Heimbach et al. [2005] P. Heimbach, C. Hill, and R. Giering. An efficient exact adjoint of the parallel MIT General Circulation Model, generated via automatic differentiation. Future Generation Computer Systems, 21(8):1356–1371, 2005. doi: 10.1016/j.future.2004.11.010.
  • Hogan [2014] R. J. Hogan. Fast Reverse-Mode Automatic Differentiation Using Expression Templates in C++. ACM Trans. Math. Softw., 40(4), 2014. doi: 10.1145/2560359.
  • Hückelheim et al. [2019] J. Hückelheim, P. Hovland, M. M. Strout, and J.-D. Müller. Reverse-mode algorithmic differentiation of an OpenMP-parallel compressible flow solver. The International Journal of High Performance Computing Applications, 33(1):140–154, 2019. doi: 10.1177/1094342017712060.
  • Kowarz [2008] A. Kowarz. Advanced concepts for automatic differentiation based on operator overloading. PhD thesis, Techn. Univ. Dresden, 2008. URL https://nbn-resolving.org/urn:nbn:de:bsz:14-ds-1206719130404-22306.
  • Nemili et al. [2016] A. Nemili, E. Özkaya, N. R. Gauger, F. Kramer, and F. Thiele. Discrete Adjoint Based Optimal Active Control of Separation on a Realistic High-Lift Configuration. In A. Dillmann, G. Heller, E. Krämer, C. Wagner, and C. Breitsamter, editors, New Results in Numerical and Experimental Fluid Mechanics X, pages 237–246, Cham, 2016. Springer International Publishing. doi: 10.1007/978-3-319-27279-5_21.
  • OpenMP Architecture Review Board [2005] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 2.5, May 2005. URL https://www.openmp.org/specifications.
  • OpenMP Architecture Review Board [2013] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0, July 2013. URL https://www.openmp.org/specifications.
  • OpenMP Architecture Review Board [2018] OpenMP Architecture Review Board. OpenMP Application Programming Interface Version 5.0, November 2018. URL https://www.openmp.org/specifications.
  • Özkaya and Gauger [2010] E. Özkaya and N. R. Gauger. Automatic transition from simulation to one-shot shape optimization with Navier-Stokes equations. GAMM-Mitteilungen, 33(2):133–147, 2010. doi: https://doi.org/10.1002/gamm.201010011.
  • Protze et al. [2019] J. Protze, T. Cramer, S. Convent, and M. S. Müller. OMPT-Multiplex: Nesting of OMPT Tools. In C. Niethammer, M. M. Resch, W. E. Nagel, H. Brunst, and H. Mix, editors, Tools for High Performance Computing 2017, pages 73–83, Cham, 2019. Springer International Publishing. doi: 10.1007/978-3-030-11987-4_5.
  • Sagebaum et al. [2013] M. Sagebaum, N. R. Gauger, U. Naumann, J. Lotz, and K. Leppkes. Algorithmic Differentiation of a Complex C++ Code with Underlying Libraries. Procedia Computer Science, 18:208–217, 2013. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2013.05.184. 2013 International Conference on Computational Science.
  • Sagebaum et al. [2018] M. Sagebaum, T. Albring, and N. R. Gauger. Expression templates for primal value taping in the reverse mode of algorithmic differentiation. Optimization Methods and Software, 33(4-6):1207–1231, 2018. doi: 10.1080/10556788.2018.1471140.
  • Sagebaum et al. [2019] M. Sagebaum, T. Albring, and N. R. Gauger. High-Performance Derivative Computations using CoDiPack. ACM Trans. Math. Softw., 45(4), 2019. doi: 10.1145/3356900.
  • Sagebaum et al. [2020] M. Sagebaum, J. Blühdorn, and N. R. Gauger. Index handling and assign optimization for Algorithmic Differentiation reuse index managers, 2020. URL https://arxiv.org/abs/2006.12992. Preprint arXiv:2006.12992.
  • Schanen et al. [2010] M. Schanen, U. Naumann, L. Hascoët, and J. Utke. Interpretative adjoints for numerical simulation codes using MPI. Procedia Computer Science, 1(1):1825–1833, 2010. doi: 10.1016/j.procs.2010.04.204. ICCS 2010.
  • Schanen et al. [2012] M. Schanen, M. Förster, J. Lotz, K. Leppkes, and U. Naumann. Adjoining Hybrid Parallel Code. In B. H. V. Topping, editor, Proceedings of the Eighth International Conference on Engineering Computational Technology, volume 100 of Civil-Comp Proceedings, Kippen, Stirlingshire, 2012. Civil-Comp Press. doi: 10.4203/ccp.100.7.
  • Schlenkrich et al. [2008] S. Schlenkrich, A. Walther, N. R. Gauger, and R. Heinrich. Differentiating Fixed Point Iterations with ADOL-C: Gradient Calculation for Fluid Dynamics. In H. G. Bock, E. Kostina, H. X. Phu, and R. Rannacher, editors, Modeling, Simulation and Optimization of Complex Processes, pages 499–508. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-79409-7_36.
  • Walther [2009] A. Walther. Getting Started with ADOL-C. In U. Naumann, O. Schenk, H. D. Simon, and S. Toledo, editors, Combinatorial Scientific Computing, number 09061 in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 2009. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Germany. URL https://drops.dagstuhl.de/opus/volltexte/2009/2084.
  • Zhu et al. [2010] H. Zhu, H. Shu, and M. Ding. Numerical solutions of two-dimensional Burgers’ equations by discrete Adomian decomposition method. Computers & Mathematics with Applications, 60(3):840–848, 2010. doi: 10.1016/j.camwa.2010.05.031.