1 Introduction
The adjoint method has proved to be an indispensable tool in computational modelling and optimization, enabling sensitivity analysis, goaloriented error estimation and data assimilation, for example (see
[9]). It is used to compute the derivative of an objective function with respect to parameters of interest, with a cost independent of the number of parameters. Algorithmic differentiation (AD), as a traditional engineering approach, has been developed to automatically produce an adjoint code that computes the derivatives. AD tools take as input a forward model that users implement in a lowlevel language, and derive the adjoint model in a linebyline fashion, through sourcetosource transformations, operator overloading or a combination thereof. While this blackbox approach gives the highest degree of automation and requires least knowledge of the mathematical models, it suffers from many lowlevel implementationspecific difficulties such as memory allocation, pointers, I/O and parallel communication (e.g. MPI and OpenMP). Recently new highlevel AD tools/libraries such as libadjoint [6], FATODE [17] and the TSAdjoint component in PETSc [1, 16] have been developed to operate on highlevel systems; therefore concerns on lowlevel implementations can be avoided.In the highlevel adjoint approach, one has to solve an adjoint equation that is derived either before discretization or after discretization, corresponding to the continuous adjoint method and the discrete adjoint method, respectively. The derivation is fully automated in libadjoint if one specifies the forward model in a highlevel language similar to mathematical notation. In contrast, FATODE and PETSc TSAdjoint
provide a builtin implementation of the discrete adjoint model derived based on the time stepping algorithms for solving ordinary differential equations (ODEs); simulation of timedependent PDEs is abstracted as a sequence of time steps and the libraries differentiate each time step essentially.
Nevertheless, the core task of these highlevel AD tools is to assemble usersupplied callback functions that calculate function derivatives (a.k.a. Jacobians). The Jacobian functions may be written manually, approximated by finite differences, or generated with an AD tool.
Writing an analytical Jacobian matrix evaluation function is an errorprone, tedious and timeconsuming task. PETSc offers tools to calculate a finite difference approximation of the Jacobian matrix suitable for some classes of problems. With the aim of developing a framework for performing automatic and efficient large scale adjoint calculation, the present work extends PETSc’s functionality to automate the process of Jacobian computation, by applying ADOLC [15] within the higher level time stepping solver. Although we focused on the discrete adjoint, the Jacobian automation is also applicable and useful for the continuous adjoint approach.
Previous efforts have been made to use sourcetosource transformation AD for the numerical solution of PDEs using PETSc (see [10]). The present work goes further  with a focus on adjoint problems  and computes the required Jacobian transpose. Futhermore, we exploit the reverse mode of AD to compute the matrixfree
Jacobian transpose vector product.
The paper is organized as follows. We begin by introducing the AD tool ADOLC. In Section 3, we briefly introduce the PETSc adjoint solver. In Section 4, we describe an exemplar adjoint problem based on a nonlinear PDE. Section 5 presents our strategies to automate the Jacobian computation in PETSc. Numerical results with the benchmark problem are shown and discussed in Section 6. Conclusions and future work are given in Section 7.
2 AdolC
ADOLC is an operator overloading AD tool intended for differentiation of C programs using C++ objects. Other notable operator overloading tools include CoDiPack [14], Sacado [13] and dco [12]
. Based on how chain rule is applied in AD tools, there are generally two differen modes, namely
forward mode and reverse mode.Consider a program P which evaluates a mathematical function . By decomposing into a sequence of elementary operations, the forward mode of AD differentiates P by repeated application of the chain rule. Propagating a seed vector through the differentiated program yields the Jacobian vector product (see (1)); appropriate selection of seed vectors (i.e. seed matrix) enables computation of the Jacobian matrix itself. The reverse mode of AD, on the other hand, propagates a seed vector backwards, yielding the Jacobian transpose vector product.
(1) 
Application of ADOLC to an existing program requires identifying a section of the code to be differentiated and its marking as an ‘active’ section. The user changes independent, dependent and intermediary variables of type double within the active section to be variables of the special ADOLC type, adouble. That is, these variables are marked as ‘active variables’. Operations performed on active variables are recorded to a tape, which is to be read again by ADOLC. Upon reading this tape for some input vector (of the same dimension as the number of independent variables), ADOLC computes derivatives by overloading the operations recorded on the tape with their associated derivatives and invoking the chain rule appropriately.
3 PETSc Adjoint
The adjoint approach used by PETSc works in a similar way as the reverse mode of AD, with the main difference being that the primitive operation becomes a time step instead of a line of source code. A comprehensive reference to the discrete adjoint approach in PETSc can found in [16]. Without loss of generality, we denote a multistage time stepping algorithm (e.g. RungeKutta methods) as
(2) 
where are the initial values and the solution at the end of the simulation is given by . The normal goal is to compute sensitivities of an objective function with respect to initial values or system parameters .
By initializing the adjoint variable with
(3) 
and solving the discrete adjoint model
(4) 
backward in time, we obtain the gradient Taking the simplest implicit method, backward Euler, for example. The forward propagation equation for an autonomous ODE system is
(5) 
where is the stepsize at the th time step, and is the ODE righthand side function. The associated adjoint model will be
(6) 
It can be seen that the essential ingredient in the adjoint computation is the Jacobian transpose. When solving the original PDE problem, users need to supply Jacobian functions for solving the nonlinear system arising at each time step or use the finitedifference approximations constructed by PETSc; these Jacobians are reused in the adjoint computation as much as possible. Extra Jacobians may be required by the adjoint solver depending on the problem settings (see [16]). The PETSc adjoint solver assembles the adjoint calculation with the Jacobians at the timestepping level, thus avoiding a full differentiation of the whole library code, which consists of complicated data structure, iterative linear solvers and parallel infrastructure. However, it can still be a significant burden for users to provide callback functions to efficiently evaluate the Jacobians, especially when nonlinear PDE problems with complicated boundary conditions are involved.
4 Benchmark Problem
For the purposes of numerical experimentation and illustration of the automatic differentiation process we now introduce a nonlinear, timedependent PDE.
The GrayScott model describes the diffusion and reaction of two chemical species with concentrations and , as in [8]. The model may be expressed as a coupled system of scalar, nonlinear, timedependent PDEs in two spatial dimensions [11]:
(7) 
Here and are diffusion and reactivity constants, respectively.
Consider a doubly periodic, square, spatial domain and a time interval . As on ([11], pp. 2122), consider initial conditions
(8)  
(9) 
For the adjoint analysis, we seek the sensitivity of the final solution at the domain centre with respect to the initial conditions given by (8)–(9). It is essential to note that, given the nonlinear nature of the problem, the adjoint problem is directly dependent on the forward variables at every time step.
A numerical solution is obtained using a centred finite difference discretization on a uniform quadrilateral mesh and CrankNicolson time integration [5].
One could argue that, due to the simplicity of Equation (7), the Jacobian may be easily derived by hand, rendering AD unnecessary. However, since we intend to construct a fully automated framework, we take advantage of this opportunity and derive also the analytic Jacobian to serve as a verification tool.
5 Automating Jacobian Computation in PETSc
5.1 Differentiated Program
We consider an application of ADOLC to generate the Jacobian of the discretized PDE residual at each timestep. The discretized residual code used by PETSc in solving the GrayScott model (7) is given in Listing 1. Given an array of Field structs containing the independent variables, and , at some timestep, this function applies the spatial finite difference approximation in order to give the associated residual.
In the PETSc TS solver, data is stored using higher level structures such as matrices (Mat) and vectors (Vec). Data may be accessed from these objects as arrays through the DM data management framework. In the case of Equation (7), we have a uniform, structured grid and so use the distributed array, DMDA. In Listing 1, arrays of PetscScalars (i.e. doubles) are extracted from the DMDALocalInfo and modified according to the residual evaluation.
From the user perspective, all that is required in order to apply AD to an existing PETSc code is to establish an ‘active version’ of the residual, wherein active variables are defined and marked as independent or dependent, as appropriate. Consider now Listing 2, which illustrates the active version of the code given in Listing 1. The calls to trace_on and trace_off indicate the active section of the code, which is to be differentiated and is identified by the global integer tag. Marking of the independence and dependence of variables occurs at the beginning and end of the active section, respectively.^{1}^{1}1
The use of dummy variables in the marking of dependent variables corresponds to ghost points. Computations involving ghost points are required for parallelism, but these points should not be marked as dependent, as this would contribute additional nonzeros to the Jacobian.
In between, the loops which evaluate the residual take exactly the same form as in Listing 1, except that the double variables involved in the residual evaluation have been replaced by active variables of adouble type.For each instance in which the Jacobian is to be evaluated, ADOLC overloads operations on the tape and propagates seed vectors through the forward (or reverse) mode of AD, thereby computing derivatives. In the case of (7), the same tape can be used for all timesteps, since the form of the discretized residual does not explicitly depend on the time level. As such, the active version of the residual is evaluated once, before the time integration, with the original, ‘passive’ version used thereafter. This means the computational cost can be reduced by only generating the tape once. In the general case, the ‘active’ version of the residual is used at every function evaluation.
In the following two subsections, we investigate effeciency considerations for the Jacobian computation using ADOLC. In the implementations of both Subsection 5.2 and 5.3, Listing 2 is used to generate the tape. The difference between these methods is in how the tape is used, not how it is generated.
5.2 Jacobian Assembly using ADOLC
Let be the number of degrees of freedom (DOFs) at each timestep of the discretization of problem (7). Assembling the Jacobian using ADOLC, requires a seed matrix , where
. The ‘natural’ approach is to choose the identity matrix,
(whence ). Propagating the entire identity matrix through the forward mode of AD is typically both prohibitively expensive and unnecessary, as illustrated in Section 6. Selecting a seed matrix with second dimension would induce great computational savings.If the Jacobian sparsity pattern, , is known, reductions in can be achieved by coloring [7]: columns of the Jacobian may be assigned the same color if there exist no rows where both have nonzero entries. By this color identification, we are able to losslessly compress the Jacobian, reducing its size by columns. A seed matrix, , is chosen with columns corresponding to the colors used, entries of unity where the row index corresponds to an accordingly colored column in the Jacobian and zeroes elsewhere. Propagating through the forward mode of AD yields a compressed Jacobian, . This equates to propagating , as opposed to , vectors, thereby reducing the cost of the propagation. Computational savings induced by this compressed approach are inherently problem and discretization dependent: the smaller the stencil, the smaller can be chosen.
The Jacobian can be recovered by lossless decompression, as first observed by [2]. Nonnegative entries of the resulting ‘recovery matrix’, , hold column indices for the corresponding entries in , whilst negative entries correspond to zeroes. For , the additional computational cost associated with applying the index mapping of to assemble is assumed to be much less than the savings of the propagation step.
Algorithm 1 outlines the compressed Jacobian computation procedure. The first step is performed by the graph coloring package, ColPack [7]. Given the boundary conditions and stencil used, PETSc is able to generate a coloring for the second step. Forward propagation of in the third step is performed by ADOLC, using what was written to tape. The final step amounts to decompressing to give the Jacobian, by application of the index mapping associated with .
Jacobian evaluation for the approach described in this subsection is achieved in this work using a problemindependent PETScADOLC utility function which performs the required propagation and recovery steps and assembles the associated Mat object.
5.3 JacobianFree Implementation using ADOLC
For large problems, (compressed or uncompressed) Jacobian assembly, as described in Subsection 5.2, can become memory intensive. Furthermore, the Jacobian may only be used a few times, whereby its assembly might be seen as wasteful of computational resources.
In the numerical solution of timedependent, nonlinear PDEs, a nonlinear solve is performed at each timestep, which in turn relies on the solution of linear systems involving the Jacobian. For Krylov subspace methods, the Jacobian is not explicitly needed  only its action upon a vector. This motivates matrixfree methods, wherein the seed vector is chosen as the vector the Jacobian acts upon. For the adjoint solve, the Jacobian transpose vector product can be provided by the reverse mode of AD.
The implementation of the matrixfree Jacobian evaluation differs from that described in Subsection 5.2. The PETSc MATSHELL matrix type is used to endow the Mat object with additional context information, whereupon we may overload the MatMult operations associated with Jacobian vector products so that they apply the forward mode of AD directly to the given input vector. Similarly, the MatMultTranspose operations associated with Jacobiantranspose vector products are overloaded so as to directly apply the reverse mode of AD.
6 Experimental Results
This section compares the strategies discussed above using performance analysis. Due to the nontriviality of effectively preconditioning an ADgenerated matrixfree algorithm, preconditioning has not been used in any of the following results. The adjoint problem is solved using checkpointing to memory.



Figure 0(a) illustrates that, even for a grid, using ADOLC to propagate the entire identity matrix through the forward mode of AD (i.e. ‘uncompressed’ Jacobian assembly) is prohibitively expensive, although performance is greatly increased by parallelism. For higher resolutions, the approach is completely infeasible, since the number of active variables becomes too large. As such, it is imperative that the approaches of Subsections 5.2–5.3 are considered.
Applying coloring techniques can significantly reduce this cost. In Fig. 1, the compressed Jacobian approach exhibits strong scalability. The runtime of this approach is less than twice that of the handcoded approach and this relative performance actually improves as problem size increases. Since the assembled Jacobian approaches have the same workflow, we may directly compare runtimes for their constituent operations. Figure 2 highlights Jacobian evaluation as the computational bottleneck, particularly in cases where AD is applied. As such, it is important that the Jacobian is computed as efficiently as possible.




As revealed in Subsection 5.2, there are just two differences between the analytic and compressed approaches. Firstly, in the latter, the sparsity pattern, seed matrix and recovery matrix must be established before the time integration begins. Secondly, whenever the Jacobian is to be assembled, the seed matrix is to be propagated through the forward mode of AD and then decompressed. As such, the discrepancies in absolute runtime illustrated in Fig. 1 and proportions of runtime spent in Jacobian evaluation illustrated in Fig. 2 may be explained as being due to the AD propagation and compression algorithm.



Figure 3 further breaks down the runtimes of the compressed Jacobian evaluation steps, indicating that the cost of generating the sparsity pattern and seed matrix are negligible. This is to be expected, since they are generated only once, whilst forward propagation and Jacobian recovery occur at every timestep. We may deduce from Fig. 3 that the discrepancies between the analytic and compressed approaches shown in Fig. 1 are almost entirely due to the propagation and recovery steps of Algorithm 1. The simple recovery routine used for the final step is shown to be more costly than the forward propagation, indicating potential inefficiencies in the simple recovery technique used. The discrepancy between analytic and compressed approaches would be reduced through implementation of a more efficient recovery technique. Nonetheless, the performance shown in Fig. 1 illustrates that the inefficient recovery technique does not pose a serious problem in this particular application.
On the grid, the matrixfree approach performs almost as well as assembling the analytic Jacobian. For higher grid resolution, Subfigures 0(b)–0(c) indicate that this relative performance diminishes somewhat. Table 1 show linear solver iterations growing with spatial resolution, indicating the problem becoming increasingly illconditioned. Variations in the number of linear solver iterations displayed in each column of Table 1 can be explained as being due to roundoff error, the influence of which is essentially unpredictable. Whilst Jacobian assembly occurs at each iteration of the nonlinear solver, the matrixfree approach applies AD at each iteration of the linear solver.
Proc.  Assembled  Matrixfree  Proc.  Assembled  Matrixfree  Assembled  Matrixfree 

1  1763  1763  4  8320  8113  17350  15953 
2  1763  1766  16  8322  8165  17255  15960 
4  1763  1762  64  8323  8167  17251  16053 
8  1763  1762 
For the compressed approach and for each grid resolution, there are two nonlinear solver iterations at each timestep, each involving the forward propagation of five vectors (corresponding to the five colors used). This gives ten forward propagations per timestep. For the matrixfree approach, on the other hand, a single vector is propagated forwards whenever PETSc’s MatMult function is invoked, corresponding to the computation of a Jacobianvector product. This occurs, on average, 36 and 76 times per timestep, for the and grids, respectively. Similar logic can be applied to the reverse propagations in the adjoint steps.
Given that the application of AD is the computational bottleneck, these remarks indicate why, despite the matrixfree approach often using fewer Krylov iterations, runtimes are longer than for the compressed Jacobian. A preconditioning strategy which ensures that the number of linear solves is independent of mesh resolution would help to alleviate this disadvantage. A key observation is that the compressed approach is preferable for nonlinear problems wherein the Jacobian needs to be updated frequently. Nevertheless, Figure 1 shows that both compressed and matrixfree approaches have runtime of the same order as an analytic Jacobian and are therefore both valuable strategies for automatic Jacobian computation.
7 Conclusion
We illustrated an application of ADOLC to automatically provide Jacobian transforms for timedependent, nonlinear PDEs and their adjoints, solved using PETSc. Whilst the GrayScott model (7) considered is fairly simple, the differentiation procedure may be reused for any complex problem.
Efficiency considerations are made, including compressed and matrixfree methods. These enable ADOLC to automatically generate Jacobian transforms for larger problems, within a timeframe of the same order as the case where the Jacobian is given. We illustrate the effectiveness of the compressed approach for a finite difference problem with a relatively small stencil. The compression and propagation algorithm used exhibits strong scalability for the numbers of processors and grid sizes considered. Whilst the Jacobianfree approach performs promisingly at low resolution, it suffers when applied to larger, unconditioned problems and would greatly benefit from an effective preconditioning strategy.
In the case of Jacobian assembly using ADOLC, the task of preconditioning is only as difficult as for an analytically derived Jacobian; a wide variety of preconditioners are already implemented in PETSc for this purpose. One downside of the matrixfree approach in PETSc is that we cannot use any of the builtin preconditioners, meaning users must provide their own, using the PCSHELL preconditioner type. Furthermore, effective preconditioning for the transposed solve in the adjoint model is challenging and remains an open research problem.
Suppose we choose to implement Jacobi preconditioning  one of the simplest forms of preconditioning. In a matrixfree approach, it is not possible to directly compute the action of the (inverse) Jacobian diagonal upon a seed vector using the same trace as for the Jacobian application. Assembling the Jacobian diagonal alone using a compressed method requires the same number of seed vectors to be propagated through AD as would be required to assemble the Jacobian itself. Assembling the diagonal in this way would of course defeats the point of using matrixfree. However, one advantage of Jacobi preconditioning is that the preconditioner can be reused for the adjoint time integration. Depending on the problem at hand, it may be possible to make a separate trace of the PDE residual, with different independent and dependent variables, in order to implement a matrixfree preconditioner with ADOLC.
Whilst Jacobian compression is more performant for the GrayScott model with low order centred finite differences, this would not necessarily be the case for PDEs and discretizations which have larger stencils and therefore require more colors. Additionally, larger stencils would make the suboptimal Jacobian recovery routine highlighted in Figure 3 more problematic. Replacing this routine by one provided by ColPack would likely improve performance.
Note that the PDE residual code illustrated in Listings 1 and 2 involve doubly nested loops which operate on the components of arrays directly. In some applications, it is beneficial to avoid accessing these array components directly and instead perform operations on the higher level PETSc Vec or Mat structures. Future work might also consider raising the abstraction level (see [4]), in order to differentiate such expressions, thereby simplifying the process of applying AD to complex PETSc codes. When raising the abstraction level, further computational savings can be attained through awareness of the PDE discretization; appropriate data abstraction allows the preservation of efficiency choices made in the problem discretization.
Acknowledgements
Many thanks to Argonne’s Autodiff research group and PETSc development team, particularly Sri Hari Krishna Narayanan, Michel Schanen, Jan Hückelheim, Barry Smith and Navjot Kukreja.
References
 [1] Abhyankar, S., Brown, J., Constantinescu, E.M., Ghosh, D., Smith, B.F., Zhang, H.: PETSc/TS: A Modern Scalable ODE/DAE Solver Library. arXiv preprint arXiv:1806.01437 (2018)
 [2] Averick, B.M., Moré, J.J., Bischof, C.H., Carle, A., Griewank, A.: Computing Large Sparse Jacobian Matrices using Automatic Differentiation. SIAM Journal on Scientific Computing 15(2), 285–294 (1994)
 [3] Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Dener, A., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., May, D.A., McInnes, L.C., Mills, R.T., Munson, T., Rupp, K., Sanan, P., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc Users Manual. Tech. Rep. ANL95/11  Revision 3.10, Argonne National Laboratory (2018), http://www.mcs.anl.gov/petsc
 [4] Bischof, C.H., Haghighat, M.R., et al.: Hierarchical Approaches to Automatic Differentiation. Computational Differentiation: Techniques, Applications, and Tools pp. 83–94 (1996)
 [5] Crank, J., Nicolson, P.: A Practical Method for Numerical Evaluation of Solutions of Partial Differential Equations of the HeatConduction Type. In: Mathematical Proceedings of the Cambridge Philosophical Society. vol. 43, pp. 50–67. Cambridge University Press (1947)
 [6] Farrell, P., Ham, D., Funke, S., Rognes, M.: Automated Derivation of the Adjoint of HighLevel Transient Finite Element Programs. SIAM Journal on Scientific Computing 35(4), C369–C393 (2013). https://doi.org/10.1137/120873558
 [7] Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: ColPack: Software for graph coloring and related problems in scientific computing. ACM Transactions on Mathematical Software (TOMS) 40(1), 1 (2013)
 [8] Gray, P., Scott, S.: Autocatalytic Reactions in the Isothermal, Continuous Stirred Tank Reactor: Isolas and Other Forms of Multistability. Chemical Engineering Science 38(1), 29–43 (1983)
 [9] Gunzburger, M.D.: Perspectives in Flow Control and Optimization, vol. 5. SIAM (2003)
 [10] Hovland, P., Norris, B., Smith, B.: Making automatic differentiation truly automatic: coupling PETSc with ADIC. Future Generation Computer Systems 21(8), 1426 – 1438 (2005). https://doi.org/https://doi.org/10.1016/j.future.2004.11.008, http://www.sciencedirect.com/science/article/pii/S0167739X04001852
 [11] Hundsdorfer, W., Verwer, J.G.: Numerical Solution of TimeDependent AdvectionDiffusionReaction Equations, vol. 33. Springer Berlin Heidelberg (2003). https://doi.org/10.1016/00104655(84)900596
 [12] Lotz, J., Naumann, U., Ungermann, J.: Hierarchical Algorithmic Differentiation a Case Study. In: Recent Advances in Algorithmic Differentiation, pp. 187–196. Springer (2012)
 [13] Phipps, E.T., Bartlett, R.A., Gay, D.M., Hoekstra, R.J.: LargeScale Transient Sensitivity Analysis of a RadiationDamaged Bipolar Junction Transistor via Automatic Differentiation. In: Advances in automatic differentiation, pp. 351–362. Springer (2008)
 [14] Sagebaum, M., Albring, T., Gauger, N.R.: HighPerformance Derivative Computations using CoDiPack. arXiv preprint arXiv:1709.07229 (2017)
 [15] Walther, A., Griewank, A.: Getting Started with ADOLC. In: Combinatorial scientific computing. pp. 181–202 (2009)
 [16] Zhang, H., Abhyankar, S., Constantinescu, E., Anitescu, M.: Discrete Adjoint Sensitivity Analysis of Hybrid Dynamical Systems With Switching. IEEE Transactions on Circuits and Systems I: Regular Papers (2017). https://doi.org/10.1109/TCSI.2017.2651683
 [17] Zhang, H., Sandu, A.: FATODE: a Library for Forward, Adjoint, and Tangent Linear Integration of ODEs. SIAM Journal on Scientific Computing 36(5), C504–C523 (2014). https://doi.org/10.1137/130912335
Comments
There are no comments yet.