ACORNS: An Easy-To-Use Code Generator for Gradients and Hessians

07/09/2020 ∙ by Deshana Desai, et al. ∙ 0

The computation of first and second-order derivatives is a staple in many computing applications, ranging from machine learning to scientific computing. We propose an algorithm to automatically differentiate algorithms written in a subset of C99 code and its efficient implementation as a Python script. We demonstrate that our algorithm enables automatic, reliable, and efficient differentiation of common algorithms used in physical simulation and geometry processing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

Code Repositories

acorns

https://arxiv.org/pdf/2007.05094.pdf An Easy-To-Use Code Generator for Gradients and Hessians


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The calculation of derivatives is used in many applications, including machine learning, scientific computing, geometry processing, computer vision, natural language processing, and many more. The calculation of derivatives of large and complex computational graphs is a computational bottleneck for these applications. Thus the efficient computation of derivatives is an area of active research interest. Extensive research and engineering efforts have been spent in the last years to support applications in deep learning, that led to the development of libraries such as TensorFlow or PyTorch, which target applications with large and dense tensors. The availability of easy-to-use differentiation libraries is one of the reasons for the massive advancement in deep learning. They allow researchers to focus on the design of new network architectures and loss functions, while automating and ensuring correctness in the low-level tasks required to minimize these functionals. However, they are specialized for computing first derivatives only, and they are optimized for large, dense tensors, which are uncommon in applications outside of deep learning. The novelty in these libraries is not in the automatic differentiation technique used, but in the ease of use, cross-platform availability, and ease of integration in new research projects.

Our goal is to provide a similar solution for scientific applications requiring first and second derivatives of expressions involving small to medium tensors, using an approach based on symbolic differentiation. When paired with the optimization algorithms available in modern compilers, our approach generates code around an order of magnitude faster than existing methods while allowing users to write their functions directly in the C language. The algorithm is easy to integrate into existing build systems and produces dependency-free code.

The core parts of the library are implemented in Python using a C99 parser (Bendersky et al., [n.d.])

. We perform symbolic differentiation with unrolled derivatives, stored as expression trees, and generate an efficient kernel to evaluate them using an existing C99 compiler, addressing the potentially long compilation times with a file-splitting approach. State-of-the-art compilers are capable of sophisticated optimizations such as loop and expression trees vectorization, elimination of redundancy in expressions, and reusing computations performed in previous iterations of loops. Our approach benefits from these features, and the performance of our method will likely increase as compiler technology progresses. The expression unrolling also allows us to trivially parallelize computations not only for different data points but also for a single data point over multiple CPU cores, which are common in modern architectures. Finally, we provide support efficient differentiation of algorithms with nested loops, arrays, and functions. A key difference with existing libraries, is that we create kernels to efficiently evaluate the expression itself, its gradient, and its Hessian, all optionally supporting parallel evaluation.

Our open-source implementation makes the integration of the algorithm straightforward in modern C and C++ applications, thanks to its ability to directly differentiate code and to produce standard, dependency-free C99 code that can be easily used in existing applications.

To demonstrate the practical impact of our approach, we integrated it into two applications, one in geometry processing and one in scientific computing. We show that our library provides important speedups in the overall running time of mesh parametrization algorithms and in the quasi-static simulation of non-linear elastic deformations.

2. Related Work

Automatic (or algorithmic) differentiation has been introduced in the pioneering works of (Wengert, 1964; Speelpenning, 1980). We briefly review the most relevant work and refer readers to various books and survey for further detail (Naumann, 2012; Griewank and Walther, 2008; Bischof and Bücker, 2000; Margossian, 2019) and www.autodiff.org for a collection of implementations. We provide comparisons against representative methods for operator overloading and source transformation for both gradients and higher order derivatives in Section 4.

2.1. Operator Overloading

The most common way of implementing automatic differentiation is through defining a new object class, which associates the gradient information to the data. In forward mode, the gradient information is updated throughout the trace of the computation. And in reverse mode, a tape of computation is recorded, and the gradient information is then (back)propagated following the recorded tape. Due to the versatility and freedom of prototyping, operator overloading based approaches have gained a lot of attraction, and notable examples include (Paszke et al., 2019; Paszke et al., 2017; Abadi et al., 2016; Bell, 2012; Griewank et al., 1996; Jakob, 2010). However, the efficiency of the code often suffers due to the additional runtime overhead, which is especially noticeable when the same computation is executed repeatedly.

2.2. Source Transformation

Source transformation approaches parse the input source code (most commonly C and Fortran (Hascoet and Pascual, 2013; Voßbeck et al., 2008)) and generate a new algorithm that computes the derivatives of the input function. Compared to operator overloading approaches, obtaining the source code file a priori opens the doors to global data flow analysis and optimization, thus reducing the running time. Additionally, this approach enables the developers to directly debug the code for derivatives computation, a crucial feature to identify and address potential issues with numerical round-off or overflows. The reliance on compilers to compile the generated source-code makes these approaches future-proof, in the sense that the generated code will benefit from progress on compiler design and optimization.

A limitation of source transformation is that, since the flow is assumed fixed to allow for the above optimization, it is challenging to support conditionals or unbounded loops. In our algorithm, we restrict to a subset of C99 covering common use-cases in scientific computing, finite element analysis, and computer graphics.

A minor, yet practically relevant, drawback of source code transformation is the presence of the additional source files, posing challenges on the build system and version control of the evolving software. We show that our library, thanks to its permissive open-source license, minimal dependencies, and availability in the Conda package system, is effortless to integrate into a CMake-based build system (Martin and Hoffman, 2010). We show how to automatically generate the required files as part of the build process, making the automatic differentiation transparent to the user (Section 3.5).

2.3. Higher Order Derivatives

Although valuable in scientific computing and numerical optimization, computing higher-order derivatives is challenging due to the need of memory management and sparsity pattern detection(Gower and Mello, 2010; Walther, 2008). Therefore, only few existing libraries support second or higher-order derivatives computation. Our method supports dense Hessian computation, taking advantage of the symmetric structure to reduce computational times.

3. Method

The input of our algorithm is an algorithm written in a subset of C99(Ritchie et al., 1988): we support arrays, for loops, nested loops, binary assignments, functions, and variable declarations. The output is a set of self-contained, multi-threaded C99 functions to evaluate the function in the input file, its first derivative, and its second derivative. We designed our library to become part of a build system, seamlessly generating the derivative code as part of the build procedure of a software package (Section 3.5).

We overview our algorithm in Section 3.1, and show a complete execution on a prototypical expression in Section 3.2. We then discuss how to parallelize the evaluation in Section 3.3 and how to make our approach scale to large expressions in Section 3.4. We show the performance of the algorithm on synthetic expressions (Section 4), comparing it with state-of-the-art differentiation libraries, and integrate it into a physical simulation library for a realistic benchmark.

3.1. Overview

Primer on Automatic Differentiation.

Given a target function , the corresponding Jacobian matrix has entries . We use the pyc parser (Bendersky, 2020) to parse the input C99 code to produce an Abstract Syntax Tree (AST) of the input algorithm. The leaves of the tree are either numerical constants or variables and the internal nodes follow a hierarchical structure built from the code. Let us consider an example, with a composite function: , with , with

From the example we see that the auto-differentiation of a function

breaks down the action of the Jacobian matrix on a vector into simple components, which are evaluated sequentially. We store the operations required for the calculation of the derivative of each of these pieces as a string. The final derivative is the accumulation of the piece-wise derivatives of the sub-operations following the chain rule.

For computing second order derivatives, we differentiate the tree twice. We further optimize over the number of calculations by taking advantage of the symmetry of the Hessian: we compute only the lower triangular part of the matrix and copy it over the upper triangular half.

Implementation Specific Details.

The derivative formulas are unrolled in a sequence of assignments (no loops or conditional are used) and saved on a C file. The file is then compiled with full code optimization (-O3 with gcc (Stallman, 2002) and clang (Lattner and Adve, 2004)). This leverages the compiler’s ability to perform redundancy elimination on trees, reusing computations (especially memory loads and stores), and automatically vectorizing the code when appropriate.

In the case of loops or conditional statements being present in the program, we perform an additional parsing step to expand these structures. We perform loop unrolling to produce a sequence of operations and eliminate other instructions that control the loop. Conditional statements are removed by evaluating their condition during parsing. The output of this phase is a list of assignment instructions, which are stored in an intermediate binary format. This file is then parsed to create the AST, differentiated, and the differentiated AST is exported as a C file.

3.2. Example

To further illustrate the mechanisms of ACORNS, we consider a typical example from statistics. We will compute the gradient of the cross entropy

loss function, between two probability distributions

and . The loss function can be given by the following piece of C99 code:

double cross_entropy(const double **a, const double **b){
    double loss = 0;
    for(int i = 0; i < 2; i++){
        for(int j = 0; j < 2; j++ ){
            loss = loss - (b[i][j] * log(a[i][j] + 0.00001));
        }
    }
    return loss;
}

The function is unrolled to produce the following code:

double cross_entropy(const double **a, const double **b){
    double loss = 0;
    loss = (loss) - ((b[0][0]) * (log((a[0][0]) + (0.00001))));
    loss = (loss) - ((b[0][1]) * (log((a[0][1]) + (0.00001))));
    loss = (loss) - ((b[1][0]) * (log((a[1][0]) + (0.00001))));
    loss = (loss) - ((b[1][1]) * (log((a[1][1]) + (0.00001))));
    return loss;
}

An Abstract Syntax Tree (Figure 1) is constructed from the unrolled code.

Figure 1. Expression graph for cross entropy loss function

The syntax tree is differentiated using back-propagation to construct the derivative equations. This unrolled code is finally compiled by a C compiler with full code optimization enabled.

for(int p = 0; p < num_points; ++p)
{
    ders[i*2+0]+= ((((((((0) - (((log((a[0][0] + 0.00001))) * (b[0][0]) + b[0][0] *
    ((1/((a[0][0] + 0.00001))*0)))))) - (((log((a[0][1] + 0.00001))) * (b[0][1]) +
    b[0][1] * ((1/((a[0][1] + 0.00001))*0)))))) - (((log((a[1][0] + ...
}

3.3. Parallelization

To accelerate the evaluation of large gradients or Hessians, we observe that the evaluation of the different entries is independent, and can thus be trivially parallelized using OpenMP(Chandra et al., 2001) directives. We experimentally study the scaling of our generated code in Section 4.

Time (s)

Number of Files

Number of Files

Number of Files

Figure 2. Compile times for Hessians of a function of 12, 30, and 60 parameters.

3.4. Large Expressions

As the complexity of the expressions and/or the number of outputs grows, the size of the generated C file grows accordingly. While this opens more opportunities for the compilers to optimize the evaluation, it also increases memory consumption and compilation time. For large expressions, such as the Hessian of a cubic Lagrangian element, this might become impractical. We propose a simple, yet effective and practical, strategy to overcome this issue: we split the output into multiple C files that are compiled independently. This has the downside of reducing the optimization opportunities of compilers. However, in our experiments, the performance drop is negligible and we thus use this option aggressively to reduce compilation times.

To evaluate the effect of the split on performances, we consider a function for calculating the Neo-Hookean Elasticity energy (Rivlin, 1948) for a linear, quadratic, and cubic Lagrange element (12, 30, and 60 variables respectively). The code generated by these function is large (29 GB for the function with 60 variables) and the compilation with gcc9 does not terminate in 160 hours.

We tested splitting into different number of files, measuring compilation time, evaluation time, and binary size after compilation.

As expected, the compilation time decreases as the files become smaller (Figure 2) while the size of the binary generated is not noticeably affected by the split (Figure 3). Surprisingly, the evaluation time of the generated code is also not noticeably affected by the split (Figure 4). As a default, we thus opt for splitting so that each file had a size of around 16 MB, which is a good compromise between number of files and compilation time.

Size (Mb)

Number of Files

Number of Files

Number of Files

Figure 3. Size of the generated binary files (.o) for Hessians of a function of 12, 30, and 60 variables.

Time (s)

Number of Files

Number of Files

Number of Files

Figure 4. Run times for Hessians of a function of 12, 30, and 60 variables.

3.5. Integration in a CMake Build System

One of the key design goal of our system is the easy and fast deployment in numerical based methods. We show how ACORNS can be integrated within a CMake build system for quick and easy deployment. In the following we describe the process and we refer to our github repository for the complete example https://github.com/deshanadesai/acorns. Let us consider a simple problem: we want to minimize the following function (saved in the file function.c) using gradient descent.

int function_0(double x){
    double energy = pow(x, 4) - 3*pow(x, 3) + 2;
    return 0;
}

To use ACORNS to its derivative, we need to add the code generation to the CMake project file:

cmake_minimum_required(VERSION 3.1)
project(SimpleProject)
file(MAKE_DIRECTORY ${PROJECT_SOURCE_DIR}/ders)
add_custom_command(
OUTPUT
    ${PROJECT_SOURCE_DIR}/ders/der_0.c
    ${PROJECT_SOURCE_DIR}/ders/der_0.h
COMMAND
    acorns_autodiff
        ${PROJECT_SOURCE_DIR}/functions/function_0.c energy
        --vars x --func function_0
        --output_filename ${PROJECT_SOURCE_DIR}/ders/der_0
DEPENDS
    ${PROJECT_SOURCE_DIR}/functions/function_0.c
)
add_library(derivative "ders/der_0.c")
target_include_directories(derivative PUBLIC ders)
add_executable(main_target src/example.c)
# Link target derivative to main_executable
target_link_libraries(main_target PUBLIC derivative)

This will generate the ders_0.h and ders_0.c file, which will be directly compiled by CMake and linked with the other files in the project. To use the derivatives it is then sufficient to include the generated header file. We show an example of using the computed derivative to find the root by using gradient descent.

#include <der_0.h>
#include <stdio.h>
#include <math.h>
int main(int argc, char** argv) {
    double vals[1] = {6.0};
    double ders[1] = {0.0};
    int iteration;
    for (iteration = 0; iteration < 10000; iteration++) {
        // calculate derivative using ACORNS
        compute(vals, 1, ders);
        float step = 0.01 * ders[0];
        vals[0] -= step;
        if (fabs(step) <= 1e-5)
            break;
    }
    printf("Minimum:%f\n", vals[0]);
    printf("Iterations:%d\n", iteration);
    return 0;
}

4. Results

Our algorithm is implemented in Python. We run our experiments on a 2.35 GHz AMD EPYC™ 7452 running Ubuntu 19.10 GNU/Linux 5.3.0-29-generic x86_64. The reference implementation of ACORNS is freely available and can be easily obtained trough a conda-forge package111It can be installed with conda install acorns -c conda-forge . under MIT license. The source code and the scripts to generate the experiments are available at https://github.com/deshanadesai/acorns.

We compare our generated code against 4 state-of-the-art differentiation algorithms:

  1. Tapenade (Hascoet and Pascual, 2013) shares many similarities with our system. It converts C code into a C program computing its derivatives and Hessians222Hessian computation in Tapenade is not officially supported, we used the semi-manual construction proposed in the Q&A section of the documentation for the comparison..

  2. PyTorch (Paszke et al., 2019) is a backward auto-differentation library tailored for machine learning. It provides both a CPU and a GPU version, we compare against the CPU version only.

  3. Mitsuba Autodiff (Jakob, 2010; Grinspun et al., 2003) compute automatic differentiation using operator overloading and depends on Eigen (Guennebaud et al., 2010) for the linear algebra and storage. We run all our comparisons using only static stack-allocated matrices, since the dynamic mode is slower and unnecessary for the expressions we use for benchmarking.

  4. Enoki (Jakob, 2019) is a new library developed specifically to support differentiable rendering. Similarly to PyTorch, it supports both CPU and GPU evaluation, and we compare only with the CPU version and only for gradients, since it does not support Hessian computation.

Expression Types

We use three expression to evaluate different realistic scenarios where autodifferentiation is used:

  1. An high-order polynomial, a non-polynomial function, and procedural expressions with large number of variables:

    (1)
  2. A scalar trigonometric function:

    (2)
  3. A vector-valued polynomial with variables:

    (3)

For each expression, we randomly generate evaluation points and plot the average running times over 10 runs. ACORNS generates C code, Mitsuba and Enoki are both running in C++. All the C and C++ code is compiled with gcc9/g++9 with the flags -O3, -ffast-math, -flto and, only for C++, -std=c++11.

Time (s)

Long Polynomial in (1) Number of Evaluations

Trigonometric Function in (2) Number of Evaluations

Polynomial in (3) for Number of Evaluations

Figure 5. Average runtimes in log for our different test function for each method.

Gradient

In Figure 5 we report the timings, for different number of evaluation points, for the three functions. For the two polynomial functions we are roughly 5 faster than Mitsuba and Enoki and 2.9 times faster than PyTorch. For the trigonometric function, due to the increase in the cost of evaluating the trigonometric function, our advantage is slightly reduced. For gradients, our approach is very similar to Tapenade, and we obtain comparable timings.

Time (s)

Gradient Number of Variables

Hessian Number of Variables

Figure 6. Runtimes versus respect to number of variables for the polynomial in (3) for the different methods.

In Figure 6 left we fix the number of evaluations and change the parameter in (3), to evaluate how the methods scale with the number of variables. As expected, all methods scale linearly, but our method is still 3.97 7.95, and 3 faster than Mistuba, Enoki, and Pytorch respectively. As for Figure 5, Tapenade’s code has very similar running time as ours.

Hessian

We repeat the same experiments as in previous section to compare ACORNS with Mitsuba, PyTorch, and Tapenade to compute Hessians. Figure 7 shows that our code is very efficient, faster than Mitsuba and faster than PyTorch (which is not optimized to evaluate Hessians). Tapenade is faster than the other alternatives, but it is still around 5 times slower than our code. Note also that while the Hessian generation is completely automatic with ACORN, it does require manual interaction with Tapenade, since Hessian computation is not fully supported.

The scaling with respect to the number of variables for Hessian evaluation (Figure 6 right) has a similar trend as for the gradients.

Time (s)

Long Polynomial in (1) Number of Evaluations

Trigonometric Function in (2) Number of Evaluations

Polynomial in (3) for Number of Evaluations

Figure 7. Runtimes for computing the Hessian of our method for the different test functions.

Parallel Evaluation

We ran the function in (3) for and computed the gradient for 100000 evaluation points allowing OpenMP to use a different number of threads. We observe an almost linear scaling up to 15 threads (Figure 8, right) on our workstation, and it then flattens as more threads are added. We speculate that this is due to saturation of the memory bandwidth of the workstation, since our task is inherently massively parallelizable.

Speedup

Number of Threads

Time

Number of Threads

Figure 8. Time and speedup of our method over number of threads for computing the Hessian expression of  (3) for 100000 points.

5. Applications

We integrated ACORNS in two applications in geometry processing and in physical simulation to evaluate the gain in performance in a realistic setting. In both these cases, the algorithm needs to assemble a dense gradient and a sparse Hessian. For both applications, ACORNS is used to compute small dense blocks, which are then assembled in a larger sparse Hessian matrix. Directly supporting the automatic construction of the sparse Hessian is an interesting avenue for future work.

5.1. Parametrization

Given a surface embedded in three dimensional space, an important problem in geometry processing is to assign parametric values to it, i.e. computing a flattening of the surface on a plane. We refer the readers to (Floater and Hormann, 2005; Sheffer et al., 2007) for a comprehensive survey. In our application, we consider minimizing the non-convex symmetric Dirichlet energy (Smith and Schaefer, 2015) with projected Newton’s method (Teran et al., 2005): when the Hessian is not positive (semi-)definite, possibly at saddle points, we project it to the closest positive definite matrix through polar decomposition. The algorithm involves computing the full Jacobians and Hessians per element: our original implementation uses Mitsuba, and we replaced it with ACORNS. We summarize the statistics in Table 1. The autodiff time is reduce considerably, but in this application is not the bottleneck, thus resulting in a small reduction of the overall running time.

Input Output

Autodiff Time Total Time
Ours Mitsuba Ratio Ours Mitsuba
0.23s 0.40s 1.75 3.66s 4.27s
Table 1. Optimize the parameterization of a surface mesh.

5.2. Finite Element Simulation

The finite element method aims at solving partial differential equations (PDEs) describing physical phenomena. A typical example is to simulate the deformation of an elastic body subject to some external forces (boundary conditions); the algorithm involves minimizing an elastic energy, and requires computing derivatives. We integrate ACORNS in PolyFEM 

(Schneider et al., 2019)

and compared, for different finite element discretization order (ranging from linear to cubic), the timings of a Neo-Hookean simulation using the Mitsuba autodiff and with ACORNS to compute Hessian and gradient of the energy. In the finite element method, all computations are performed locally on each tetrahedron. The different order will have different local number of degree of freedom (which determine the size of the gradient and Hessian): a linear tetrahedron has 12, a quadratic has 30, and a cubic has 60. Our method provides an overall speedup of 2 to 4 times, depending on the degree, over the entire simulation time (Table 

2).

Input

Output

Hessian Time Total Time
Degree Ours Mitsuba Ratio Ours Mitsuba Ratio
1 0.14s 0.98s 7.1 12.43s 55.92s 4.5
2 1.18s 5.91s 5.0 110.36s 310.61s 2.8
3 8.14s 34.84s 4.2 955.71s 2076.08s 2.2
Table 2. Runtime of a Neo-Hookean elastic simulation for different FE degrees.

6. Concluding Remarks

We introduced ACORNS, a software library for automatic differentiation that generates efficient kernels for computing both gradients and Hessian and can be easily integrated in existing C or C++ projects. The core idea of our algorithm is to fully unroll the computational graph, and rely on modern compilers to optimize the resulting code. Compared with alternatives, our algorithm is faster at evaluation time, but slower during compilation: we believe that this tradeoff is very interesting for many scientific computing applications where the evaluation of small dense Hessian is repeated millions of times. The compilation time could be reduced by performing symbolic simplification on the code before generating the c files. In addition, we could avoid the parsing time of the compiler by directly generating bytecode (for example for the llvm compiler) and compiling it in memory, avoiding unnecessary disk access. Variable conditional are not support by ACORNS. Support for it could be added, but the performance will likely decrease, due to the reduced optimization options and more complex parallelization.

We would like to extend ACORNS to directly support the generation of sparse Hessians (Gebremedhin et al., 2009; Walther, 2008) and also extend the code generation to target GPU accelerators.

Our open-source reference implementation and a set of examples on how to use it is available at https://github.com/deshanadesai/acorns, and it is also released as a Conda package on conda-forge to allow an easy installation on all major operating systems.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265–283.
  • Bell (2012) Bradley M Bell. 2012. CppAD: a package for C++ algorithmic differentiation. Computational Infrastructure for Operations Research 57, 10 (2012).
  • Bendersky (2020) Eli Bendersky. 2020. pycparser. https://github.com/eliben/pycparser.
  • Bendersky et al. ([n.d.]) Eli Bendersky et al. [n.d.]. Pycparser.
  • Bischof and Bücker (2000) Christian H Bischof and H Martin Bücker. 2000. Computing derivatives of computer programs. Technical Report. Argonne National Lab., IL (US).
  • Chandra et al. (2001) Rohit Chandra, Leo Dagum, David Kohr, Ramesh Menon, Dror Maydan, and Jeff McDonald. 2001. Parallel programming in OpenMP. Morgan kaufmann.
  • Floater and Hormann (2005) Michael S Floater and Kai Hormann. 2005. Surface parameterization: a tutorial and survey. In Advances in multiresolution for geometric modelling. Springer, 157–186.
  • Gebremedhin et al. (2009) Assefaw H Gebremedhin, Arijit Tarafdar, Alex Pothen, and Andrea Walther. 2009. Efficient computation of sparse Hessians using coloring and automatic differentiation. INFORMS Journal on Computing 21, 2 (2009), 209–223.
  • Gower and Mello (2010) Robert Mansel Gower and Margarida P Mello. 2010. Hessian matrices via automatic differentiation. Universidade Estadual de Campinas, Instituto de Matemática, Estatística e ….
  • Griewank et al. (1996) Andreas Griewank, David Juedes, and Jean Utke. 1996. Algorithm 755: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++. ACM Transactions on Mathematical Software (TOMS) 22, 2 (1996), 131–167.
  • Griewank and Walther (2008) Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. Siam.
  • Grinspun et al. (2003) Eitan Grinspun, Anil N Hirani, Mathieu Desbrun, and Peter Schröder. 2003. Discrete shells. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation. Citeseer, 62–67.
  • Guennebaud et al. (2010) Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.
  • Hascoet and Pascual (2013) Laurent Hascoet and Valérie Pascual. 2013. The Tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software (TOMS) 39, 3 (2013), 1–43.
  • Jakob (2010) Wenzel Jakob. 2010. Mitsuba renderer. http://www.mitsuba-renderer.org.
  • Jakob (2019) Wenzel Jakob. 2019. Enoki: structured vectorization and differentiation on modern processor architectures. https://github.com/mitsuba-renderer/enoki.
  • Lattner and Adve (2004) Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75–86.
  • Margossian (2019) Charles C Margossian. 2019. A review of automatic differentiation and its efficient implementation. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, 4 (2019), e1305.
  • Martin and Hoffman (2010) Ken Martin and Bill Hoffman. 2010. Mastering CMake: a cross-platform build system. Kitware.
  • Naumann (2012) Uwe Naumann. 2012. The art of differentiating computer programs: an introduction to algorithmic differentiation. Vol. 24. Siam.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024–8035.
  • Ritchie et al. (1988) Dennis M Ritchie, Brian W Kernighan, and Michael E Lesk. 1988. The C programming language. Prentice Hall Englewood Cliffs.
  • Rivlin (1948) Ronald S Rivlin. 1948. Large elastic deformations of isotropic materials. 1. Fundamental concepts. Philosophical Transactions of the Royal Society of London Series a-Mathematical and Physical Sciences 240, 822 (1948), 459–508.
  • Schneider et al. (2019) Teseo Schneider, Jérémie Dumas, Xifeng Gao, Denis Zorin, and Daniele Panozzo. 2019. Polyfem. https://polyfem.github.io/.
  • Sheffer et al. (2007) Alla Sheffer, K Hormann, B Levy, M Desbrun, K Zhou, E Praun, and H Hoppe. 2007. Mesh parameterization: Theory and practice. ACM SIGGRAPPH, course notes 10, 1281500.1281510 (2007).
  • Smith and Schaefer (2015) Jason Smith and Scott Schaefer. 2015. Bijective parameterization with free boundaries. ACM Transactions on Graphics (TOG) 34, 4 (2015), 1–9.
  • Speelpenning (1980) Bert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms. Technical Report. Illinois Univ., Urbana (USA). Dept. of Computer Science.
  • Stallman (2002) Richard M Stallman. 2002. GNU compiler collection internals. Free Software Foundation (2002).
  • Teran et al. (2005) Joseph Teran, Eftychios Sifakis, Geoffrey Irving, and Ronald Fedkiw. 2005. Robust quasistatic finite elements and flesh simulation. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation. 181–190.
  • Voßbeck et al. (2008) Michael Voßbeck, Ralf Giering, and Thomas Kaminski. 2008. Development and first applications of TAC++. In Advances in Automatic Differentiation. Springer, 187–197.
  • Walther (2008) Andrea Walther. 2008. Computing sparse Hessians with automatic differentiation. ACM Transactions on Mathematical Software (TOMS) 34, 1 (2008), 1–15.
  • Wengert (1964) Robert Edwin Wengert. 1964. A simple automatic derivative evaluation program. Commun. ACM 7, 8 (1964), 463–464.