I Introduction
Adjointbased optimization problems typically consist of a simulation that is run forward in simulation time, producing data that is used in reverse order by a subsequent adjoint computation that is run backwards in simulation time. Figure 2 shows the resulting data flow. Many important applications in science and engineering have this structure, including seismic inversion.
In seismic inversion, the propagation of seismic waves through the earth’s subsurface is simulated and compared with data from field measurements. The model of the subsurface is iteratively improved by minimizing the misfit between simulated data and field measurement in an adjoint optimization problem. Figure 1 shows the setup of the field experiment that produces the measured data [18]. Storing all intermediate results as required for the data flow shown in Figure 2 is not feasible, since this would require tens of terabytes of memory, which is not commonly available on the machines used to solve these problems. To solve this problem, a number of strategies are used that are discussed in the following paragraphs. While this paper uses seismic inversion as a test case, most of these ideas, including the one presented in this work, are also applicable to other adjoint optimization problems.
Domain decomposition is often used not only to distribute the computational workload across more processors, but also to utilize the large amount of memory available in distributed systems. While this strategy is very powerful, the number of compute nodes and therefore the amount of memory that can be used efficiently is limited, for example by communication overheads that start to dominate as the domain is split into increasingly small pieces [22]. Secondly, this method is broadly used on conventional clusters but can be drastically more complicated to seup and slower due to the communication in a cloudbased setup. Another common strategy in seismic inversion is to only store values at the boundaries of the domain at each timestep, and reconstruct the rest of the wavefield when required [6, 24] with time reversal of the wave equation. However, this method is not applicable for wave equations that are not time reversible when for example physical attenuation is included.
Checkpointing is yet another strategy to reduce the memory overhead. Only a subset of the program states during the forward pass is stored (and the rest discarded). The discarded data is recomputed when needed by restarting the forward pass from the last available stored state. The Revolve algorithm [11] provides an answer to the question of which states should be stored and which states should be recomputed to minimize the total amount of recomputation work. Other authors have subsequently developed extensions to Revolve that are optimal under different assumptions [21, 23, 4, 19].
Data compression is also increasingly used to reduce the memory footprint of scientific applications. General purpose data compression algorithms like Zlib (which is a part of gzip) [9], and compression algorithms for video and image data such as JPEG2000 [20] have been presented in previous work. More recently, special purpose compression algorithms for floatingpoint scientific data have been developed, such as ZFP or SZ [12, 14, 10].
Lossless algorithms guarantee that the exact original data can be recovered during decompression, whereas lossy algorithms introduce an error, but often guarantee that the error does not exceed certain absolute or relative error metrics. Typically, lossy compression is more effective in reducing the data size. Most popular compression packages offer various settings that allow a tradeoff between compression ratio, accuracy, and compression and decompression time.
It is worth noting that another data reduction strategy is to typecast values into a lower precision format, for example, from double precision to single precision. This can be seen as a computationally cheap lossy compression algorithm with a compression ratio of .
Perhaps counterintuitively, compression can not only reduce the memory footprint, but also speed up an application. Previous work has observed that the compression and decompression time can be less than the time saved from the reduction in data that needs to be communicated across MPI nodes or between a GPU and a host computer [17].
Another way of using compression to speed up adjointbased methods is to use it instead of checkpointing. If the compression ratio is sufficient to fit the entire data in memory, checkpointing is no longer necessary. Previous work has discussed this in the context of computational fluid dynamics [7] and seismic inversion using compression algorithms specifically designed for wavefields [8, 5].
In this paper, we extend the previous studies by combining checkpointing and compression. This is obviously useful when the data does not fit in the available memory even after compression, for example for very large adjoint problems, or for problems where the required accuracy limits the achievable compression ratios.
Compared to the use of only checkpointing without compression, our combined method often improves performance. This is a consequence of the reduced size of stored program states, allowing more program states to be stored during the forward computation. This in turn reduces the amount of recomputation that needs to be performed. On the other hand, the compression and decompression itself takes time. We therefore provide a comprehensive performance model that predicts whether the combined method is beneficial, that is, whether the time spent compressing and decompressing is less than the time saved by reduced recomputations.
For benchmarking we used a dualsocket Intel(R) Xeon(R) Platinum 8180M @ 2.50 Ghz (28 cores each), henceforth called Skylake. We discuss performance results for varying error tolerances for the lossy ZFP compression algorithm ranging from to , and different types of forward and adjoint computations that vary in their ratio between compute cost and state size.
Our aim is to make this discussion independent of the hardware, application and compression algorithm. We therefore do not discuss whether or not the accuracy of the decompressed data is sufficient for our application, or whether other algorithms might achieve a better accuracy/compression tradeoff.
Ii Test case using Devito and pyRevolve
to solve forward and adjoint wave equation problems. Devito is a domainspecific language that enables the rapid development of finitedifference solvers from a highlevel description of partial differential equations. The simplest version of the seismic wave equation is the acoustic isotropic wave equation defined as:
(1) 
where is the squared slowness, the spatially dependent speed of sound, is the pressure wavefield, denotes the laplacian of the wavefield and is a source term. Some of our experiments in Section V are performed using a more accurate and more complex version of this equation called Tilted Transverse Isotropy (TTI) [25] that takes into account the anisotopic propagation of waves in the earth subsurface (directional dependency of the speed of sound). We leave the TTI equations out of this paper for brevity.
The solution to equation 1 forms the forward problem. The seismic inversion problem minimizes the misfit between simulated and observed signal given by:
(2) 
This optimization problem is usually solved using gradient based methods such as steepest descent, where the gradient is computed using the adjointstate method that involves the dataflow pattern from Figure 2.
The values of used in this work are derived from the Overthrust model [3] over a grid of points, including an absorbing layer of 40 points on each side. The grid spacing is in space. The propagation time is that corresponds to 2500 timesteps. The wave field at the final time is shown in Figure 3, and only this field was used for the compression experiments in this paper. The uncompressed size of this single time step field is just under 900MB.
To implement Revolve with Devito, we use pyRevolve [13] which is a python wrapper for the Revolve algorithm. The performance model in section IV assumes that the implementation is similar to pyRevolve, which stores a checkpoint by copying a portion of the operator’s working memory to the checkpointing memory and similarly loads a checkpoint by copying from the checkpointing memory to the operator’s working memory. Although a perfect implementation of checkpointing may be able to avoid these copies, the overhead attached to these copies can be ignored for an operator that is sufficiently computationally intensive. However, we include the overheads in the model to verify this assumption.
Iii Compression algorithms
Iiia Lossless
We use the python package blosc [1], which includes implementations for six different lossless compression algorithms, namely ZLIB, ZSTD, BLOSCLZ, LZ4, LZ4HC and Snappy. An exhaustive search over all available parameters resulted in a maximum compression ratio of 1.18x. We therefore focus on lossy compression in the remainder of this work.
IiiB Lossy
We use the lossy compression package ZFP [14] developed in C. To use ZFP from python, we developed a python wrapper for the reference implementation of ZFP.
ZFP supports three compression modes, namely fixedtolerance, fixedprecision and fixedrate. The fixedtolerance mode limits the absolute error, while the fixedprecision mode limits the relative error to some userspecified value. The fixedrate mode achieves a guaranteed compression ratio requested by the user, but does not provide any bounds on accuracy loss. Figure 5 shows compression ratios for a variety of settings for the three modes.
The compression ratio achieved per unit time spent during compressing and decompressing was seen to be highest in the fixedtolerance mode albeit with highly unpredictable compression ratios. Figure 4 shows the spatial distribution of the error after compression and decompression, compared to the original field, using fixedtolerance mode.
In our experiments using Devito, the time required to compress a checkpoint was in the same order of magnitude as the time taken to compute a timestep.
Iv Performance model for a combination of Revolve and compression
We assume that the computation of a single forward time step takes the same wall time as the computation of a single reverse time step. We denote this time as . For a simulation with timesteps, the minimum wall time required for the full forwardadjoint evaluation is given by
(3) 
If the size of a single timestep in memory is given by , this requires a memory of at least size . If sufficient memory is available, no checkpointing or compression is needed.
If the memory is smaller than , Revolve provides a strategy to solve for the adjoint field by storing a subset of the total checkpoints and recomputing the remaining ones. The overhead introduced by this method can be broken down into the recomputation overhead and the storage overhead . The recomputation overhead is the amount of time spent in recomputation, given by
(4) 
where is the minimum number of recomputed steps from [11], reproduced here in equation 5.
(5) 
In equation 5, M is the number of checkpoints that can be stored in memory. Note that for , would be zero. For , grows rapidly as M is reduced relative to N.
In an ideal implementation, the storage overhead might be zero, since the computation could be done “inplace”, but in practice, checkpoints are generally stored in a separate section of memory and they need to be transferred to a “computational” section of the memory where the compute is performed, and then the results copied back to the checkpointing memory. This copying is a common feature of checkpointing implementations, and might pose a nontrivial overhead in some scenarios. This storage overhead is given by
(6) 
where is the total number of times Revolve writes checkpoints for a single run, is the number of times checkpoints are read, and is the memory bandwidth of the target system. The total time to solution becomes
(7) 
By using compression, the size of each checkpoint is reduced and therefore the number of checkpoints available is increased ( in equation 5). This reduces the recomputation overhead , while at the same time adding overheads related to compression and decompression in . To be beneficial, the reduction in must offset the increase in , leading to an overall decrease in the time to solution .
Our performance model assumes that the compression algorithm behaves uniformly across the different time steps of the simulation, i.e. that we get the same compression ratio, compression time and decompression time, no matter which of the possible checkpoints we try to compress/decompress. The storage overhead now becomes
(8) 
where is the compression ratio (i.e. the ratio between the uncompressed and compressed checkpoint), and and are compression and decompression times, respectively. At the same time, the recomputation overhead decreases because times more checkpoints are now available.
V Results
We can distinguish three different scenarios, depending on the amount of available memory.

If the memory is insufficient even with compression to store the entire trajectory, one can either use checkpointing only, or combine checkpointing with compression.

If the available memory is not sufficient to store the uncompressed trajectory, but large enough to store the entire compressed trajectory, we study the two possible strategies: Either use compression only, or use checkpointing only.

If the available system memory is large enough to hold the entire uncompressed trajectory, neither compression nor checkpointing is necessary.
All three scenarios can be seen in Figure 6. The second scenario was studied in previous work [7], while the combined method is also applicable to the first scenario, for which previous work has only used checkpointing without compression.
We can identify a number of factors that make compression more likely to be beneficial compared to pure checkpointing: A very small system memory size and a large number of time steps lead to a rapidly increasing recompute factor, and compression can substantially reduce this recompute factor. This can be seen in Figures 6 and 8.
The extent to which the recompute factor affects the overall runtime also depends on the cost to compute each individual time step. If the compute cost per time step is large compared to the compression and decompression cost, then compression is also likely to be beneficial, as shown in Figure 7. As the time per time step increases and the compression cost becomes negligible, we observe that the ratio between the runtime of the combined method and that of pure checkpointing is only determined by the difference in recompute factors.
Vi Conclusions and Future work
We use lossy compression to reduce the computational overhead of checkpointing in an adjoint computation used in full waveform inversion, a common method in seismic imaging applications whose memory footprint commonly exceeds the available memory size in high performance computing systems. We also developed a performance model that computes whether or not the combination of compression and checkpointing will outperform pure checkpointing or pure compression in a variety of scenarios, depending on the available memory size, computational intensity of the application, and compression ratio and throughput of the compression algorithm. Our current result has several limitations that we plan to address in future work:

We do not discuss the accuracy of the results after decompression. This depends on the application, compression algorithm, and affects the achievable compression ratios. Our performance model only requires knowledge of the compression time and ratio, and it is up to the user of this model to determine what accuracy is needed and thus what compression ratio is realistic for their application.

We only present results using the ZFP compression library. It would be interesting to try a greater variety of compression algorithms.

ZFP only supports serial decompression. If ZFP supported parallel decompression, our experiments would likely show a geater region in which the combined method is faster than pure checkpointing. Furthermore, ZFP only supports fields with up to three dimensions, while exploiting similarities between fields at different time steps may yield a better compression ratio.

Our performance model is based on uniform compression ratios and times. However, many applications, including FWI, are likely to have initial conditions that contain little information and are easily compressed, and the compression ratio gradually declines as the field becomes more complex. We based our experiments on the final wave field, which is presumably difficult to compress.

In comparing pure compression with pure checkpointing, we assume that every checkpoint is compressed and decompressed. However, if the available memory is only slightly less than the required memory, an implementation that compresses only a subset of the checkpoints might outperform the expectations of our model.

We do not discuss multilevel checkpointing, where some checkpoints are stored on a slower, larger device. We expect compression to be beneficial in these scenarios due to reduced data transfer sizes.
Acknowledgment
This work was funded by the Intel Parallel Computing Centre at Imperial College London and EPSRC EP/R029423/1. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics and Computer Science programs under contract number DEAC0206CH11357. We would also like to acknowledge the support from the SINBAD II project and the member organizations of the SINBAD Consortium.
We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.
References
 [1] Blosc (http://blosc.org).
 [2] Earth’s physical resources: petroleum  detection, exploration and evaluation (continued) (http://www.open.edu/openlearn/sciencemathstechnology/science/environmentalscience/earthsphysicalresourcespetroleum/contentsection3.2.1).
 [3] Fred Aminzadeh, N Burkhard, J Long, Tim Kunz, and P Duclos. Three dimensional SEG/EAGE models—an update. The Leading Edge, 15(2):131–134, 1996.
 [4] Guillaume Aupy, Julien Herrmann, Paul Hovland, and Yves Robert. Optimal multistage algorithm for adjoint computation. SIAM Journal on Scientific Computing, 38(3):C232–C255, 2016.
 [5] Christian Boehm, Mauricio Hanzich, Josep de la Puente, and Andreas Fichtner. Wavefield compression for adjoint methods in fullwaveform inversion. Geophysics, 81(6):R385–R397, 2016.
 [6] Robert G Clapp. Reverse time migration with random boundaries. In Seg technical program expanded abstracts 2009, pages 2809–2813. Society of Exploration Geophysicists, 2009.
 [7] ERIC C Cyr, JN Shadid, and T Wildey. Towards efficient backwardintime adjoint computations using data compression techniques. Computer Methods in Applied Mechanics and Engineering, 288:24–44, 2015.
 [8] F Rubio Dalmau, M Hanzich, J de la Puente, and N Gutiérrez. Lossy data compression with DCT transforms. In EAGE Workshop on High Performance Computing for Upstream, 2014.
 [9] Peter Deutsch and JeanLoup Gailly. Zlib compressed data format specification version 3.3. Technical report, 1996.
 [10] Sheng Di, Dingwen Tao, Xin Liang, and Franck Cappello. Efficient lossy compression for scientific data based on pointwise relative error bound. IEEE Transactions on Parallel and Distributed Systems, 2018.
 [11] Andreas Griewank and Andrea Walther. Algorithm 799: Revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
 [12] Christos Kaklamanis, Theodore Papatheodorou, PaulG. Spirakis, Jeremy Iverson, Chandrika Kamath, and George Karypis. Lecture Notes in Computer Science, volume 7484, chapter Fast and Effective Lossy Compression Algorithms for Scientific Datasets, pages 843–856. Springer Berlin Heidelberg, 2012.
 [13] Navjot Kukreja, Jan Hückelheim, Michael Lange, Mathias Louboutin, Andrea Walther, Simon W Funke, and Gerard Gorman. Highlevel python abstractions for optimal checkpointing in inversion problems. arXiv preprint arXiv:1802.02474, 2018.
 [14] Peter Lindstrom. Fixedrate compressed floatingpoint arrays. IEEE transactions on visualization and computer graphics, 20(12):2674–2683, 2014.
 [15] M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman. Devito: an embedded domainspecific language for finite differences and geophysical exploration. CoRR, abs/1808.01995, Aug 2018.
 [16] F. Luporini, M. Lange, M. Louboutin, N. Kukreja, J. Hückelheim, C. Yount, P. Witte, P. H. J. Kelly, G. J. Gorman, and F. J. Herrmann. Architecture and performance of Devito, a system for automated stencil computation. CoRR, abs/1807.03032, jul 2018.
 [17] Molly A. O’Neil and Martin Burtscher. Floatingpoint data compression at 75 gb/s on a GPU. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2011.
 [18] RE Plessix. A review of the adjointstate method for computing the gradient of a functional with geophysical applications. Geophysical Journal International, 167(2):495–503, 2006.
 [19] Michel Schanen, Oana Marin, Hong Zhang, and Mihai Anitescu. Asynchronous twolevel checkpointing scheme for largescale adjoints in the spectralelement solver nek5000. In ICCS, pages 1147–1158, 2016.
 [20] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The JPEG 2000 still image compression standard. IEEE Signal processing magazine, 18(5):36–58, 2001.
 [21] Philipp Stumm and Andrea Walther. Multistage approaches for optimal offline checkpointing. SIAM Journal on Scientific Computing, 31(3):1946–1967, 2009.
 [22] Jean Virieux, Stéphane Operto, Hafedh BenHadjAli, Romain Brossier, Vincent Etienne, Florent Sourbier, Luc Giraud, and Azzam Haidar. Seismic wave modeling for seismic imaging. The Leading Edge, 28(5):538–544, 2009.
 [23] Qiqi Wang, Parviz Moin, and Gianluca Iaccarino. Minimal repetition dynamic checkpointing algorithm for unsteady adjoint calculation. SIAM Journal on Scientific Computing, 31(4):2549–2567, 2009.
 [24] Pengliang Yang, Jinghuai Gao, and Baoli Wang. RTM using effective boundary saving: A staggered grid GPU implementation. Computers & Geosciences, 68:64–72, 2014.
 [25] Yu Zhang, Houzhu Zhang, and Guanquan Zhang. A stable TTI reverse time migration and its implementation. Geophysics, 76(3):WA3–WA11, 2011.
Comments
There are no comments yet.