The content of this abstract is intended as a technical talk for an audience with little to no prior knowledge about cloud computing, but basic knowledge about reverse-time migration and HPC. This talk is designed for people who are interested in how the cloud can be adapted for large-scale seismic imaging and inversion in both research and production environments.
Seismic imaging and parameter estimation are among the most computationally challenging problems in scientific computing and thus require access to high-performance computing (HPC) clusters for working on relevant problem sizes as encountered in today’s oil and gas (O&G) industry. Some companies such as BP, PGS and Exxon Mobile operate private HPC clusters with maximum achievable performance in the order of petaflops[13, 1], while some companies are even moving towards exascale computing 
. However, the high upfront and maintenance cost of on-premise HPC clusters make this option only financially viable in a production environment where computational resources constantly operate close to maximum capacity. Many small and medium sized O&G companies, academic institutions and service companies have a highly varying demand for access to compute and/or are financially not in a position to purchase on-premise HPC resources. Furthermore, researchers in seismic inverse problems and machine learning oftentimes require access to a variety of application-dependent hardware, such as graphical processing units (GPUs) or memory optimized compute nodes for reverse-time migration (RTM).
Cloud computing thus offers a valuable alternative to on-premise HPC clusters, as it provides a large variety of (theoretically) unlimited computational resources without any upfront cost. Access to resources in the cloud is based on a pay-as-you-go pricing model, making it ideal for providing temporary access to compute or for supplementing on-premise HPC resources to meet short-term increases in computing demands. However, some fundamental differences regarding hardware and how computational resources are exposed to users exist between on-premise HPC clusters and the cloud. While cloud providers are increasingly investing in HPC technology, the majority of existing hardware is not HPC optimized and networks are conventionally based on Ethernet. Additionally, the pay-as-you-go pricing model is a usage-based system, which means users are constantly charged for running instances. This possibly results in very high operating costs if instances sit idle for extended amounts of time, which is common in standard RTM workflows based on a client-server model in which the master process distributes the workload to the parallel workers. In this work, we demonstrate a serverless approach to seismic imaging on Microsoft Azure, which does not rely on a cluster of permanently running virtual machines (VMs). Instead, expensive compute instances are automatically launched and scaled by the cloud environment, thus preventing instances from sitting idle. For solving the underlying forward and adjoint wave equations, we use a domain-specific language compiler called Devito [8, 10], which combines a symbolic user interface with automated performance optimization for generating fast and parallel C code using just-in-time compilation. The separation of concerns between the wave equation solver and the serverless workflow implementation leads to a seismic imaging framework that scales to large-scale problem sizes and allows reducing the operating cost in the cloud up to a factor of , as demonstrated in our subsequent RTM case study.
Iii Current State of the Art
The cloud is increasingly adopted by O&G companies for general purpose computing, marketing, data storage and analysis , but utilizing the cloud for HPC applications such as (least-squares) RTM and full-waveform inversion (FWI) remains challenging. A wide range of performance benchmarks on various cloud platforms find that the cloud can generally not provide the same performance in terms of latency, bandwidth and resilience as conventional on-premise HPC clusters [12, 4, 11], or only at considerable cost. Recently, cloud providers such as Amazon Web Services (AWS) or Azure have increasingly extended their HPC capabilities and improved their networks , but HPC instances are oftentimes considerably more expensive than standard cloud VMs . On the other hand, the cloud offers a range of novel technologies such as massively parallel objective storage, containerized batch computing and event-driven computations that allow addressing computational bottlenecks in novel ways. Adapting these technologies requires re-structuring seismic inversion codes, rather than running legacy codes on a virtual cluster of permanently running cloud instances (lift and shift). Companies that have taken steps towards the development of cloud-native technology include S-Cube, whose FWI workflow for AWS utilizes object storage, but is still based on a master-worker scheme . Another example is Osokey 
, a company offering fully cloud-native and serverless software for seismic data visualization and interpretation. In a previous publication, we have adapted these concepts for seismic imaging and introduced a fully cloud-native workflow for serverless imaging on AWS. Here, we describe the implementation of this approach on Azure and present a D imaging case study.
Iv Methods and Key Results
The key contribution of this talk is a serverless implemenation of an RTM workflow on Azure. The two main steps of a (generic) RTM workflow are the parallel computation of individual images for each source location and the subsequent summation of all components into a single image, which can be interpreted as an instance of a MapReduce program . Rather than running RTM on a cluster of permanently running VMs, we utilize a combination a high-throughput batch processing and event-driven computations to compute images for separate source locations as an embarrassingly parallel workflow (Figure 1 and 2). The parallel computation of RTM images for separate source locations is implemented with Azure Batch, a service for scheduling and running containerized workloads. The image of each respective source location is processed by Azure Batch as a separate job, each of which can be executed on a single or multiple VMs (i.e. using MPI-based domain decomposition). Azure Batch accesses computational resources from a batch pool and automatically adds and removes VMs from the pool based on the number of pending jobs, thus mitigating idle instances. The software for solving the underlying forward and adjoint wave equations is deployed to the batch workers through Docker containers and Devito’s compiler automatically performs a series of performance optimization to generate optimized C code for solving the PDEs (Figure 2).
As communication between individual jobs is not possible, we separately implement the reduce part of RTM (i.e. the summation of all images into a single data cube) using Azure functions. These event-driven functions are automatically invoked when a batch workers writes its computed image to the object storage system (blob storage) and sends the corresponding object identifier to a message queue, which collects the IDs of all results (Figure 3). As soon as object IDs are added to the queue, Azure functions that sum up to 10 images from the queue are automatically invoked by the cloud environment. Each function writes its summed image back to the storage and the process is repeated recursively until all images have been summed into a single volume. As such, the summation process is both asynchronous and parallel, as the summation is started as soon as the first images are available and multiple Azure functions can be invoked at the same time.
Iv-B RTM Case study
For our D RTM case study on Azure, we use a synthetic velocity model derived from the D SEG Salt and Overthrust models, with dimensions of km. We discretize the model using a m grid, which results in grid points. We generate data at Hz peak frequency for a randomized seismic acquisition geometry, with data being recorded by receivers that are randomly distributed along the ocean floor. The source vessel fires the seismic source on a dense regular grid, consisting of source locations ( in total). For imaging, we assume source-receiver reciprocity, which means that sources and receivers are interchangeable and data can be sorted into shot records with receivers each. We model wave propagation for generating the seismic data with an anisotropic pseudo-acoustic TTI wave equation and implement discretized versions of the forward and adjoint (linearized) equations with Devito, as presented in .
For the computations, we use Azure’s memory optimized E and Es VMs, which have GB of memory, vCPUs and a GHz Intel Xeon E processor . To fit the forward wavefields in memory, we utilize two VMs per source and use MPI-based domain decomposition for solving the wave equations. The time-to-solution of each individual image as a function of the source location is plotted in Figure (a)a, with the average container runtime being minutes per image. The on-demand price of the E/Es instances is per hour, which results in a cumulative cost of for the full experiment and a total runtime of approximately hours using VMs. Figure (b)b, shows the cumulative idle time for computing the workload from Figure (a)a on a fixed cluster as a function of the number of parallel VMs. We model the idle time by assuming that all jobs (one job per source location) are distributed to the parallel workers on a first-come-first-serve basis and that a master worker collects all results. The idle time using a fixed VM cluster results from the fact that all cluster workers have to wait for the final worker to finish its computations, while Azure Batch automatically scales down the cluster, thus preventing instances from sitting idle. While VM clusters on Azure in principle support auto-scaling as well, this is not possible if MPI is used to distribute the data/sources to the workers. Thus, performing RTM on a fixed cluster of VMs results in additional costs due to idle time up to a factor of . By utilizing low-priority instances, it is possible to further reduce the operating cost by a factor of (i.e. up to a factor in total).
Adapting the cloud using serverless workflows, in contrast to lift and shift, allows us to leverage cloud services such as batch computing and reduce the operating cost of RTM by a factor of . This transition is made possible through abstract user interfaces and an automatic code generation framework, which is highly flexible, but provides the necessary performance to work on industry-scale problems.
Vi Presenter Bio
P. Witte, M. Louboutin and F. J. Herrmann are part of the core Devito development team and have closely collaborated with multiple cloud companies to develop serverless implementations of RTM. They believe that the computational cost and complexity of seismic inversion can only be managed through abstractions, automatic code generation and software based on a separation of concerns. C. Jones is the head of development at Osokey, a company specialized in cloud-native software for seismic data analysis.
We would like to acknowledge S. Brandsberg-Dahl, A. Morris, K. Umay, H. Shel, S. Roach and E. Burness (all with Microsoft Azure) for collaborating with us on this project. Many thanks also to H. Modzelewski (University of British Columbia) and J. Selvage (Osokey). This research was funded by the Georgia Research Alliance.
-  (2019) A close-up look at the world’s largest HPC system for commercial research. HPC Wire. Note: https://www.hpcwire.com/2018/01/14/close-look-worlds-largest-hpc-system-commercial-research/ Cited by: §II.
-  (2019) Sharing learnings: the methodology, optimisation and benefits of moving subsurface data to the public cloud. In 81st Annual International Meeting, EAGE, Expanded Abstracts, Cited by: §III.
-  (2019) AWS enterprise customer success stories. Amazon Web Services Case Studies. External Links: Cited by: §III.
-  (2013) Performance issues and performance analysis tools for HPC cloud applications: a survey. Computing 95 (2), pp. 89–108. Cited by: §III.
-  (2008) MapReduce: simplified data processing on large clusters. Communications of the Association for Computing Machinery (ACM) 51 (1), pp. 107–113. Cited by: §IV-A.
-  (2019) DUG McCloud. DownUnder GeoSolutions. Note: https://dug.com/dug-mccloud/ Cited by: §II.
-  (2019) Linux virtual machines pricing. Microsoft Azure. External Links: Cited by: §III, §IV-B.
-  (2019) Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration. Geoscientific Model Development 12 (3), pp. 1165–1187. External Links: Cited by: §II.
-  (2018) Effects of wrong adjoints for RTM in TTI media. In 88th Annual International Meeting, SEG, Expanded Abstracts, pp. 331–335. External Links: Cited by: §IV-B.
-  (2018) Architecture and performance of Devito, a system for automated stencil computation. Note: Computing Research Repository (arXiv CoRR)https://arxiv.org/abs/1807.03032 Cited by: §II.
-  (2016) Performance evaluation of Amazon Elastic Compute Cloud for NASA high-performance computing applications. Concurrency and Computation: Practice and Experience 28 (4), pp. 1041–1055. Cited by: §III.
-  (2012) Evaluating interconnect and virtualization performance for high performance computing. Association for Computing Machinery (ACM) SIGMETRICS Performance Evaluation Review 40 (2), pp. 55–60. Cited by: §III.
-  (2019) Seismic processing and imaging. PGS. External Links: Cited by: §II.
-  (2019) Event-driven workflows for large-scale seismic imaging in the cloud. In 89th Annual International Meeting, SEG, Expanded Abstracts, pp. 3984–3988. Cited by: §III.
-  (2019) XWI on AWS: revolutionary earth model building on the cloud. Microsoft Azure Documentation. Note: https://www.s-cube.com/xwi-on-the-cloud/ Cited by: §III.