Log In Sign Up

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

by   Yongmin Hu, et al.
Beihang University

Common Midpoint (CMP) and Common Reflection Surface (CRS) are widely used methods for improving the signal-to-noise ratio in the field of seismic processing. These methods are computationally intensive and require high performance computing. This paper optimizes these methods on the Sunway many-core architecture and implements large-scale seismic processing on the Sunway Taihulight supercomputer. We propose the following three optimization techniques: 1) we propose a software cache method to reduce the overhead of memory accesses, and share data among CPEs via the register communication; 2) we re-design the semblance calculation procedure to further reduce the overhead of memory accesses; 3) we propose a vectorization method to improve the performance when processing the small volume of data within short loops. The experimental results show that our implementations of CMP and CRS methods on Sunway achieve 3.50x and 3.01x speedup on average compared to the-state-of-the-art implementations on CPU. In addition, our implementation is capable to run on more than one million cores of Sunway TaihuLight with good scalability.


To Use or Not to Use: CPUs' Cache Optimization Techniques on GPGPUs

General Purpose Graphic Processing Unit(GPGPU) is used widely for achiev...

GraphCage: Cache Aware Graph Processing on GPUs

Efficient Graph processing is challenging because of the irregularity of...

Shared-Memory Parallel Maximal Clique Enumeration

We present shared-memory parallel methods for Maximal Clique Enumeration...

PRIMME_SVDS: A High-Performance Preconditioned SVD Solver for Accurate Large-Scale Computations

The increasing number of applications requiring the solution of large sc...

High performance computing and energy efficiency: focus on OpenFOAM

High performance calculation is increasingly used within society. Previo...

Eigenvector Component Calculation Speedup over NumPy for High-Performance Computing

Applications related to artificial intelligence, machine learning, and s...

Enhanced AGCM3D: A Highly Scalable Dynamical Core of Atmospheric General Circulation Model Based on Leap-Format

The finite-difference dynamical core based on the equal-interval latitud...

1. Introduction

Seismic processing techniques refine seismic data to evaluate the design of different models with cross-section images. These techniques help geologists to build models of the interested areas, which can be used to identify oil and gas reservoirs beneath the earth surface (Yilmaz, 2001). Common Midpoint (CMP) method (Ottolini and Claerbout, 1984) and Common Reflection Surface (CRS) method (Jäger et al., 2001) are widely used seismic processing techniques. The general idea of the CMP method is to acquire a series of traces (gather) that are reflected from the same mid-point under the surface. The traces are then stacked horizontally with auto-correction so that it improves the quality of the seismic data with high signal-to-noise ratio. The fold of the stack is determined by the number of traces in the CMP gather. Different from the CMP method, the CRS method is based on the ray theory, especially the paraxial ray theory. The CRS method treats the corresponding ray at the specific reflection point in the underground as the central ray. The other rays in the neighborhood of the reflection point are regarded as the paraxial rays. All the paraxial rays determine a stacking surface. The energy stacked on the same stacking surface results in a stacked profile with high signal-to-noise ratio.

The computation demand of seismic processing is tremendous. The GPUs are commonly used as acceleration devices in seismic processing to achieve high performance. The GPUs are used to accelerate the 3D output imaging scheme (CRS-OIS) in seismic processing (Ni and Yang, 2012), which utilizes the many-core architecture of GPU and achieves a good performance speedup on datasets with high computational intensity. The OpenCL is also used to implement the computation of semblance and traveltime (Marchetti et al., 2011), that accelerates the CRS method. The existing work (Lashgar and Baniasadi, 2016) also demonstrates the ability of OpenACC on improving the performance of seismic processing. Compared to the unoptimized OpenACC implementation, the fine-tuning technique can obtain a significant speedup. In addition to the GPU, there is also research work (Marchetti et al., 2010) attempts to optimize seismic processing on dedicated accelerating device such as FPGA.

The Sunway TaihuLight is the first supercomputer with a peak performance of over 100 PFlops. It was ranked the first place in Top500 in June 2016. The Sunway TaihuLight uses China homemade Sunway SW26010 processor. Each Sunway processor contains four Core Groups (CGs), and each CG consists of one Management Processing Element (MPE) and 64 Computing Processing Elements (CPEs). The many-core architecture design of Sunway processor has the great potential for high-performance computing. After built in place, the Sunway processor has demonstrated its success in various scientific applications for high performance. Especially, the atmospheric dynamics (Yang et al., 2016) and earth-quake simulation (Fu et al., 2017) running on the full system of Sunway TaihuLight for large-scale computation won the ACM Gordon Bell prize. Moreover, the optimization of various computation kernels, such as SpMV (Liu et al., 2018) and stencil (Ao et al., 2017), also demonstrates the unique performance advantage of Sunway architecture. In addition to the traditional scientific applications, the Sunway processor has also shown its potential to support emerging applications. For instance, swDNN (Fang et al., 2017)

is a highly optimized library to accelerate deep learning applications on Sunway, and swCaffe

(Li et al., 2018a) is a deep learning framework supports large-scale training on Sunway TaihuLight.

Although existing works have explored different architectures to optimize seismic processing, it is impossible to naively adopt the existing works to Sunway due to its unique architecture design. Specifically, the following challenges need to be addressed in order to achieve good performance for the CMP and CRS methods on Sunway. First, unlike the traditional x86 processor, the design of the CPEs does not contain a cache, but a 64KB user-controlled scratch pad memory (SPM), which means without careful management, the frequent accesses to main memory could lead to severe performance degradation. Secondly, in order to achieve the ideal memory bandwidth on Sunway, the DMA transfers issued from the CPEs must contain at least 1024B data. However in the CMP and CRS methods, only a tiny data ranging from 4B to 76B is required during each computation step, which is prohibitive to achieve optimal the performance on Sunway. Moreover, the operations applied to the tiny data within short loops make it difficult to take advantage of the vector units on Sunway processor.

In order to solve the above challenges, this paper proposes a re-design of the CMP and CRS methods on Sunway processor. In addition, several optimization techniques are also proposed to adapt to the architecture features of Sunway efficiently. The experiment results demonstrate our implementation of seismic processing achieves significant speedup when scaling to massive number of Sunway cores. Specifically, this paper makes the following contributions:

  • We propose a software cache method for seismic processing on Sunway CPEs. This method utilizes the architecture features of DMA and LDM on Sunway. When the memory access occurs, the CPE sends the data request to the software cache. After receiving the data request, the software cache retrieves data from the memory through DMA, and then send the data back to the CPE. After that, the data is buffered in the software cache to effectively alleviate the long memory access delay.

  • We re-design the Common Depth Point (CDP) procedure that dominates the performance of CMP and CRS methods to adapt to the Sunway architecture. Specifically, we combine multiple search processes onto a single CPE, and synchronize across search processes by buffering the intermediate results from each computation step. In addition, we combine the data to be accessed at each step of the search processes, and thus reduce the number of DMA accesses.

  • We propose a vectorization method to improve the computation efficiency when processing the tiny data within short loops. We first convert the global reduction operations into several independent element-wise vector operations, and then use the vector array to perform element-wise vector operations with the ending element processed separately.

The rest of this paper is organized as follows: In Section 2, we introduce the background of the CMP and CRS methods, as well as the Sunway architecture. Section 3 presents our design and optimization of seismic processing on Sunway to achieve massively scaling. The evaluation results are given in Section 4. Section 5 discusses the related work and Section 6 concludes this paper.

2. Background

2.1. Common Midpoint Method

The fundamental idea of the CMP method is shown in Figure 1(a). The sound source placed at the source point is excited. After the sound wave is reflected by the underground reflection point , the receiver on the surface receives the signal at point . Each data record captured is a seismic trace and a group of seismic traces that share the same midpoint is called a CMP gather. When the reflection surface is horizontal and the speed does not change horizontally, the CMP gathers are equivalent to Common Depth Point (CDP) gathers. This is the seismic data this paper deals with, therefore we use the term CMP and CDP interchangeably.

Figure 1. The illustration of the common midpoint method (CMP).

In the CMP method, traces belonging to the same CMP gather are corrected and stacked, which generates a stacked trace. As shown in Figure 1(b), before the traces are stacked together, a Normal Moveout (NMO) correction is applied to the reflection traveltimes according to the distances between their sources and receivers, which groups signals that are produced by the same reflectors. The quality of the stacked trace depends on the quality of the NMO correction. The NMO in the CMP method is to correct the hyperbolic curve (also known as traveltime curve), which depends on the distance between the source and the receiver as well as the average velocity in which the wave propagated during the seismic data acquisition. Although the distance is known in advance, the velocity is usually unknown. Therefore, it is necessary to find the best stacking velocity.

To find the best stacking velocity, the CMP method enumerates through different velocities. For each of enumerated velocities, it computes the semblance, a coherence metric that indicates whether the traveltime curve defined by a given velocity would produce a good stacking. The semblance computation is performed over a traveltime curve that intersects seismic traces. Considering that the traces are represented by discrete samples, some points of the intersections may not align with the actual elements in the dataset. Therefore, we use the interpolation of nearby samples to estimate the seismic amplitude at that point. The Equation 

1 defines the computation for semblance. There are M traces in a single CDP, represents the sample of the trace, and the intersection of the traveltime curve of the trace is . The semblance calculation is performed in a window of length , which walks through the traces of the current CDP and access samples in each intersection. The value of is determined by the sampling interval during data acquisition. In the CMP method, there is no dependency between the computation of individual CDPs, therefore they can be computed in parallel.


2.2. Common Reflection Surface Method

As shown in Figure 2(a), the ray from the exciting point to the receiving point is the central ray, whereas the ray from the exciting point to the receiving point is the paraxial ray. The central points and belong to CDP0 and CDP1 respectively. According to the paraxial ray theory, when processing the central ray , it requires the data of the paraxial ray . Therefore, when computing the CDP0, it requires the data from CDP1. In Figure 2(b), the orange point represents the central point, and the neighbors within the radius include four CDPs () that are represented with blue dots, and the remaining green dots are not in the neighborhood of . It also means, when processing CDP0, the data from CDP1, CDP2, CDP3 and CDP4 is required. The semblance computation in the CMP method can be easily extended to the CRS method. The only difference is to obtain the trace data of the CDPs in its neighborhood when processing the central CDP, as well as change the NMO curve to a curved surface. For large-scale processing, we partition the two-dimensional coordinates of the middle points of each CDP using grid, and map the grids to different CGs of the Sunway processor. The adjacent grids exchange data through asynchronous MPI communication.

Figure 2. The illustration of the common reflection surface method (CRS).

2.3. The Sunway Many-core Architecture

The Sunway TaihuLight supercomputer provides a theoretical peak performance of 125PFlops. It consists of 40,960 Sunway SW26010 processors with 1.4PB memory and an aggregated bandwidth of 4,473.16TB/s. The architecture of the Sunway processor is shown in Figure 3, which is composed of four core groups (CGs). Each CG contains a Management Processing Element (MPE) and 64 Computing Processing Elements (CPE), and each CG is attached of 8GB DDR3 memory. The 8GB attached memory can be accessed by both MPE and CPEs with the bandwidth of approximately 136GB/s. The MPE has 32KB L1 instruction cache and 32KB L1 data cache, in addition to 256KB L2 cache for both instruction and data. Each CPE has its own 16KB L1 instruction cache but no data cache. However, there is 64KB local device memory (LDM) on each CPE that is explicitly managed by software. The CPEs can initiate a direct memory access (DMA) operation that reads data from memory to the LDM, or writes data from the LDM to memory. The CPEs in the same row or column of the CG can communicate with each other through register communication. Each CPE has a vector unit that supports 256-bit wide vector floating-point operation. The survey paper (Fu et al., 2016) has shown that the memory bandwidth of Sunway processor is quite limited compared to the massive computation power. Therefore, the most effective optimization techniques on Sunway include the rational use of LDM, data transfer through register communication between CPEs, computation acceleration through vector units and data access through DMA for higher bandwidth.

Figure 3. The architecture of Sunway SW26010 processor.

2.4. Challenges on Sunway Architecture

As the dominant computation of the seismic processing with both CMP and CRS methods, accelerating the procedure of semblance calculation is critical to achieve satisfactory performance on Sunway. However, the data access pattern during the semblance calculation is prohibitive to obtain good performance on Sunway for two reasons: 1) the data accesses are random, which leads to high memory access latency due to the lack of data cache on CPEs; 2) the volume of data accesses is quite small, which is unable to fully utilize the precious DMA bandwidth as well as the vector units for performance acceleration. The specific challenges are as follows:

  • The random data accesses during the semblance calculation deteriorate the performance of seismic processing on Sunway. Due to the lack of data cache on CPEs, a software cache method is necessary to buffer the data accessed from main memory by using the limited LDM on CPEs effectively.

  • The semblance calculation only accesses a small volume of data, which is hard to fully utilize the DMA bandwidth. Therefore, it is necessary to re-design the process of semblance calculation by combining the computations and buffering the intermediate results on each CPE in order to improve bandwidth utilization.

  • In addition to the low bandwidth utilization, the small volume of data during the semblance calculation also prohibits the exploration of vectorization. Therefore, to utilize the vector units on Sunway, it requires revealing the vectorization potential by adjusting the computation pattern of semblance calculation.

3. Re-designing the Seismic Processing for Massively Scaling

3.1. Design Overview

Figure 4 shows the design and optimization of the CDP computation of the CMP and CRS methods on Sunway architecture. Firstly, the MPE on each CG reads the partitioned seismic data. Seismic data consists of several CDPs, and each CDP contains several traces, each of which is composed of samples. For the CMP method, the computation on each CDP is independent from the rest. Whereas for the CRS method, the computation of the central CDP requires data from the surrounding CDPs. In such case, the MPE calculates the two-dimensional coordinates of the middle point of each CDP, and then divides the inner and outer regions according to the two-dimensional coordinates. The calculation of the outer region involves the region of the adjacent mesh, which requires the CDPs in the outer region to be sent to the adjacent mesh. As shown in Figure 4 step 1⃝, the data transfer of the outer region and the calculation of the inner region is performed simultaneously through asynchronous MPI. After the central CDP receives the data from surrounding CDPs, the CDP computation is the same in both CRS and CMP methods. Therefore, we take the computation of a single CDP for an example to illustrate the optimizations we have applied on each Sunway CG.

The CDP computation involves the semblance calculation that walks through the traces of the current CDP. In order to improve the performance of the CDP computation on Sunway, we propose several techniques to re-design and optimize the computation procedures. First, we use the master CPE and worker CPEs collaboratively to implement a software cache to eliminate random data accesses at the intersection (step 2⃝). Second, we propose a vectorization method to improve the computation efficiency when processing the tiny data within short loops (step 3⃝). Third, we re-design the calculation process so that each worker CPE can process multiple sample-NMO velocity pairs simultaneously to further improve bandwidth utilization (step 4⃝).

To better illustrate how our proposed techniques work together, we take the processing of one CDP as shown in the upper part of Figure 4 for example. There are two adjacent traces ( and ) in a single CDP stored in continuous memory region, and each trace contains three (sample, NMO velocity) pairs (e.g., , , ). When memory access occurs, the software cache first takes into effect (step 2⃝.) Every two adjacent CPEs in each row of the CPE mesh are organized into a group, with one of them serving as the master CPE and the other one serving as the worker CPE. The worker CPE first sends a data request to the master CPE in its own group through register communication. After the master CPE receives the request, it retrieves the data from the memory through DMA, then sends the requested data back to the worker CPE. The requested data is buffered in the LDM of the master CPE. The vectorization method (step 3⃝a and 3⃝b) converts the reduction operation into independent element-wise vector operations, then uses the vector array to perform element-wise vector operation with the ending elements processed separately. The re-designed calculation process (step 4⃝) synchronizes the processing of , and in sequence on , and buffers their intermediate results. Then, the CPE group continues to process next trace (). The above steps are repeated until the last trace is processed. After all the (sample, NMO velocity) pairs have been processed, the computation of a single CDP completes.

Figure 4. The design overview of CDP computation on Sunway architecture.

3.2. Improving Parallelism within a CG

Since the CDP computation dominates the execution time of seismic processing, the optimization of the CDP on a CG is critical to fully exploit the performance of Sunway processor. The maximal number of traces in a CDP is the of the dataset and the total number of CDPs in a dataset is denoted as . Each seismic trace is represented by an array, where each element is a sample. We assume that the seismic traces have the same number of samples () across all CDPs, which is widely accepted in literature (Gimenes et al., 2018). Figure 5(a) shows that a CDP contains 4 traces, each of which contains several samples. In the same CDP, two adjacent traces are stored in continuous memory region. In addition, the samples of a single trace are also stored continuously. As shown in Figure 5(a), the center of the four colored boxes is the intersection of the traveltime curve with the four traces. The semblance is computed within a window of width , which also represents the number of samples in each color boxes.

Figure 5. The data structure of CDP (a), and the mapping of (sample, NMO velocity) pair search to CPEs (b).

In order to achieve parallel processing of the CDP on a single CG, we chose a grid-based search method. The entire computation is divided into three phases, including initialization, calculation and result writing back. The initialization phase is performed on the MPE. Firstly, the CDP data is accessed, and the NMO velocity array is generated according to the upper and lower bounds of the NMO velocity. Then the halfpoints are computed that are necessary for calculating the traveltime curve. The NMO velocity is stored in an array of size , whereas the halfpoints are stored in another array, each element of which corresponds to a trace in the CDP. The MPE then creates a semblance matrix with size of in memory, which is used in the semblance computation to find the most coherent NMO speed. At the same time, the MPE also creates an array with size of in memory to store the most coherent NMO velocity.

The calculation phase involves finding the most coherent NMO velocity for each sample from the NMO velocity array. For each sample, we enumerate the elements in the NMO velocity array, and compute the semblance of each (sample, NMO velocity) pair by walking through the traces of the current CDP according to the traveltime curve. Each (sample, NMO velocity) pair is independent from each other and can be processed in parallel. Figure 5(b) shows how the computing grid is divided among the CPEs and how the results are written back to the semblance matrix. Each point in the computing grid represents a (sample, NMO velocity) pair. In this example, there are 6 NMO speeds () and 5 samples (). The enlarged area shows how the points in the computing grid are mapped to CPEs, which means each CPE is responsible for computing a (sample, NMO velocity) pair. After filling in the entire semblance matrix, we need to find the NMO velocity with biggest semblance value for each sample. This velocity is the best coherent velocity required.

The computation procedure of a single CDP is shown in Algorithm 1. For each CDP, it enumerates the sample-NMO velocity pairs (line 2), and then finds the intersection of the traveltime curve and traces. At each intersection, it first obtains the halfpoint of the current trace (line 9-11), then accesses the data with size of (line 12-13), and finally retrieves the data computed in a window of width (line 14-19). Each trace has its own corresponding halfpoints, therefore the accesses to halfpoints are continuous when walking through the traces sequentially. Based on this observation, we can reserve a space of size on LDM to prefetch the halfpoints in advance and save them in . We can control the amount of prefetched data in LDM by adjusting .

1:function CDP(sample-NMO pairs)
2:       for each sample-NMO pair  do
3:             for  do
5:             end for
8:             for  do
9:                    if  then
10:                          prefetch halfpoints through DMA
11:                    end if
12:                    calculate index of random data access
13:                    get data of size by through DMA
14:                    for  do
19:                    end for
20:             end for
21:       end for
22:end function
Algorithm 1 The computation of a single CDP on a CG

3.3. Eliminating Random Memory Access

3.3.1. Software Cache within a CG

Due to the limited memory bandwidth on Sunway, we propose a software cache to alleviate the long memory access delay caused by random data access. We design a software cache, that is, two adjacent CPEs in each row of the CPE mesh are organized into a group, and one CPE in each group is selected to act as the software cache of the group. The selected CPE for caching is the master CPE, and the other one is the worker CPE. When memory access occurs, the master CPE accesses the data through DMA and distributes the data to the worker CPE through register communication. Existing research (Xu et al., 2017) reveals that when the accumulative data size of the DMA accesses from the 64 CPEs within a CG is less than 1024B, the achievable DMA bandwidth is proportional to the size of data accesses. In both CMP and CRS methods, the maximum size of data access is 76B. Therefore, the proposed software cache is capable to combine multiple DMA accesses, which not only reduces the number of memory accesses, but also increases the achievable DMA bandwidth.

When designing the software cache, we also consider the computation characteristics of CDP. The memory accesses at the software cache during the CDP computation are shown in Figure 6. We denote the processing of a trace as a phase. The master and worker CPEs calculate their corresponding memory region of data access, and the worker CPE sends the requested memory region to the master CPE. After the master CPE receives the request, it identifies the minimum and maximum memory address among the regions, and then copies the data between the minimum and the maximum address to the LDM of the master CPE. The master CPE sends back the data to the worker CPE based on the requested memory region. Then both the master and worker CPEs start their corresponding calculations. When the master and worker CPEs finish processing current trace, they proceed to the next trace.

Figure 6. The memory accesses at the software cache for the CDP computation.

We implement a synchronization-free mechanism to reduce synchronization overhead for the communication between the master and worker CPEs. As shown in Figure 7, a request signal is sent to the master CPE from the worker CPE. After obtaining the data from the memory, the master CPE sends the data back to the worker CPE. We name the above procedure as a round of communication. After multiple rounds of communication, the worker CPE finishes the calculation and sends a signal to the master CPE. After the master CPE receives the signal, it also sends a signal to the worker CPE, and releases the software cache. Then the master CPE enters the calculation mode to complete the remaining calculations. Since we assign more tasks to the master CPE than the worker CPE, the master CPE always completes its calculation later than the worker CPE. It is clear that with the above mechanism, no synchronization is required for the communication between the master and worker CPEs.

Figure 7. The synchronization-free mechanism for the communication between the master and worker CPEs.

3.3.2. Re-designing the Computation for Semblance

Due to the poor data locality in the semblance calculation, we re-design the semblance calculation procedure and the data access pattern in order to reduce the number of memory accesses and improve the data reuse. The original procedure of the semblance calculation is shown in Figure 8(a). For each trace, the computing grid contains points. To perform the search, there are intersections randomly distributed in the trace, which leads to a large number of random memory accesses. As shown in Figure 8(b), after the calculation re-design, we process all the intersections in a trace continuously, and save the intermediate results from the computation of each intersection before moving on to the next trace. The above procedure is repeated until the last trace is processed. After the re-design, the calculation of next trace can reuse the intermediate results from the previous trace in the LDM. In addition to the calculation re-design, we also re-design the data access pattern. Each CPE has intersections of a trace. Before the re-design, each data access happens in a different time period, with no opportunity to merge data accesses or reuse the data. However, after the re-design, the intersections on each CPE are processed continuously. Based on this property, we identify the minimum () and maximum () memory regions for all the samples within a trace, and prefetch the data between the memory region and before processing the trace.

Algorithm 2 presents the re-designed procedure of the semblance calculation. Multiple sample-NMO velocity pairs are processed simultaneously on a single CPE. For each sample-NMO velocity pair, the , and variables used during the computation have been expanded with one more dimension respectively for buffering data (line 28), compared to Algorithm 1. After initialization, the traces in a CDP are processed in sequence (line 9) and the data halfpoints is prefetched before a new trace is processed (line 1012). For the current trace, the memory addresses of the data accesses are calculated for each sample-NMO velocity pair and kept in the array (line 1315). Then, the maximum and minimum memory address in array is identified (line 1618) and used to determine the memory range () of data accesses (line 19). The data within the memory range is copied to LDM at one time through DMA operation (line 20). Finally, the calculation is performed in a window size of for each sample-NMO velocity pair and the intermediate results are kept in the , and arrays (line 2129). After processing the current trace, the algorithm continues to process the next trace until all traces in the CDP are processed. The final results are stored in the , and arrays.

Compared to the CMP method, the CRS method is more computationally intensive. Due to the limited LDM on each CPE, we cannot prefetch the data of all merged intersections at once. Therefore it is necessary to tile the merged intersections in order to assign the computation to multiple tasks. For instance, if the original loop size is to process the merged intersections. After tiling, the loop is divided into two tightly nested loops. The inner loop size is and the outer loop size is . The LDM space occupied by the merged intersections is proportional to the other than the . Therefore, the tile operation allows the program to effectively control the usage of LDM by the merged intersections .

function re-design(sample-NMO pairs)
2:       for each sample-NMO pair  do
             for  do
             end for
8:       end for
       for  do
10:             if  then
                    prefetch halfpoints through DMA
12:             end if
             for each sample-NMO pair  do
14:                    calculate index of random data access
             end for
16:             for each sample-NMO pair  do
                    find the min and max val in
18:             end for
20:             get data of size by through DMA
             for each sample-NMO pair  do
                    for  do
28:                    end for
             end for
30:       end for
end function
Algorithm 2 Re-designing the computation of semblance
Figure 8. The original (a) and re-designed (b) computation procedure of semblance calculation.

3.4. Exploiting Vectorization

We further exploit the opportunity for vectorization after the re-design of semblance calculation. As shown in Algorithm 1, each sample-NMO velocity pair maintains the corresponding array and , variables. In the innermost loop, the element-wise vector calculations are applied to the array, whereas the reduction calculations are applied to and variables. As shown in Figure 9, the random accessed data is only a small portion of the samples from each trace, which is recorded as . To vectorize the above calculations on Sunway, two challenges need to be addressed. Firstly, the window size may not be a multiple of four, which is the width of the vector unit on Sunway. Considering the reduction operation, if we directly vectorize the innermost loop, then for each , the ending data cannot be effectively vectorized which requires additional processing. In particular, if is small, which means the innermost loop is a short loop, then the overhead of processing the ending data outweighs the benefit of vectorization. Secondly, may not be 32B aligned in LDM due to the random data access. On Sunway, the unaligned SIMD load/store throws an exception and then is split into several normal load/store instructions, which fails to exploit the computation capability of the vector unit.

Figure 9 shows an example on how the vectorization method is applied to a single CDP. In order to load the unaligned into the vector register, we use the instruction that can load four unrelated float variables into the variable, without requiring these four variables to be 32B aligned. However, compared to the standard instruction, it requires multiple LDM accesses. For element-wise vector operations, we use the vector array that consists of the vector variables with the length of . Taking Figure 9 as an example, when equals to 11, the contains 11 elements of , and the vector array contains three variables of , , and , altogether representing the . Three instructions are required to load into the vector array in order to perform element-wise vector calculations. For the reduction calculation, the vector array consists of two variables and . The and are first loaded into sequentially, and then the vector calculation is performed. After that, the are loaded into for the vector calculation. When the calculation of the current trace () completes, it proceeds to the next trace (a). After all the traces in a CDP are processed, the vector array contains the results of element-wise vector operations on all of the CDP. To derive the results of reduction calculations, the four elements in and the first three elements in need to be accumulated.

In Figure 9, the data in is invalid, and thus the result in this corresponding position is also invalid for both element-wise vector calculation and reduction calculation. Although it seems to consume extra space and computing resources, such design can effectively reduce the overhead of processing the data at the end of in the short loop. With the re-design of semblance calculation, the intermediate results of processing multiple sample-NMO search pairs need to be buffered on the same CPE. The intermediate results of element-wise vector calculations and reduction calculations including vector arrays , and are also need to be buffered.

Figure 9. An illustrative example for applying the vectorization method on a single CDP.

3.5. Asynchronous Parallel Processing among CGs

For both CMP and CRS methods, the semblance calculation for a single CDP is the similar, however the calculation among CDPs is quite different. For the CMP method, there is no dependency among different CDPs. Therefore, for large-scale processing, we use the CDP as the granularity of a task. We divide the data into many partitions, and each MPE reads a separate data partition and processes the CDPs within the partition by assigning the CDP computation to the CPEs. After the processing of current CDP, the intermediate results are buffered before proceeding to the next CDP. In the CRS method, each CDP calculates the two-dimensional coordinates of the middle point according to the coordinates of the source point and the receiving point. As shown in Figure 10, each CDP draws a circle based on its two-dimensional coordinates of the middle point. The data of all CDPs in this circle is collected by the central CDP to be processed. Therefore, for the CRS method we take the computation of the central CDP as the task granularity. We divide the coordinate grid by the coordinate of the middle point. The CDP at the boundary of the grid needs to obtain data from its neighbors, whereas the rest of the CDPs only need to obtain data within the grid.

Specifically, we pre-process the CDP data, during which the CDP data belonging to the same grid is written to the same data partition. Each MPE reads a data partition, aggregates the traces into corresponding CDPs, and then calculates the middle coordinates for each CDP. According to the radius used in the CRS method, the MPE identifies the boundary of each grid, denoted as outer region, and leaves the rest as the inner region. As shown in Figure 10, the gray points belong to the outer region, whereas the blue points belong to the inner region. The MPE packs the CDP data of the outer region and sends it to the neighboring grids in the four directions, as well as receives the data from the neighboring grids. Since the calculation of the inner region does not require data from other grids, the MPE assigns the calculation of the inner region to the CPEs for parallelization. The calculation of the inner region and the data transfer of the outer region can be performed simultaneously bye using the MPI asynchronous communication. After the MPE asynchronously sends data through MPI, it calls CPEs to process the inner region in parallel. After the inner region is processed, the MPE checks whether the asynchronous communication finishes. After each grid receives the outer region data sent by its neighboring grids, each MPE proceeds to process the outer region of its own grid.

Figure 10. The parallel processing among CGs using asynchronous MPI communication.

4. Evaluation

4.1. Experimental Setup

In the experiments, we use the Sunway SW26010 processor for performance evaluation. For comparison, we use the-state-of-the-art implementations (Gimenes et al., 2018) of the CMP and CRS methods running on 2 Intel E5-2680 V4 processors with 28 physical cores and Nvidia K40 GPU. We use the -O3 and -DNDEBUG options to compile the program. We also turn on the auto-vectorization during the compilation. We generate 8 diverse seismic datasets with detailed properties shown in Table 1. In general, the number of CDPs () is proportional to the size of the dataset. Our synthesized datasets contain the number of CDPs ranging from 61,628 to 2,648,430. For a single CDP, the describes the number of traces contained in a CDP, the describes the number of samples in each trace, and the determines the number of data per random data access. Since there are no public seismic datasets available, our datasets are synthesized with diverse properties that we believe to be representative. The performance metric used in the evaluation is , which equals to the number of intersections produced by all semblance calculations divided by the total execution time.

In the field of seismic processing, single precision floating point is accurate enough to derive valid results (Gimenes et al., 2018). Hence, all evaluation results presented in this paper are in single precision floating point. In order to verify whether our approach affects the accuracy of CMP and CRS method, we provide the relative error of the results compared to the executions on CPU. In addition, we compare the relative error of our optimized parallel implementations on CPEs, the sequential implementations on MPE as well as the parallel implementations on GPU. Since the trend of the relative error is almost the same between CMP and CRS method, we only provide the relative error of CRS in Table 2. It is clear that the relative error of CRS running on Sunway is much smaller compared to running on GPU. In addition, the relative error of the parallel implementation on CPEs is almost the same compared to the sequential implementation on MPE. This demonstrates our approach hardly affects the accuracy of the CMP and CRS method.

Seismic Dataset fold ns dt ncdps (large scale)
data1 60 550 220 2,648,430
data2 60 550 240 2,648,430
data3 60 1,650 220 1,000,000
data4 60 1,650 240 1,000,000
data5 1,000 550 220 202,500
data6 1,000 550 240 202,500
data7 1,000 1,650 220 61,628
data8 1,000 1,650 240 61,628
Table 1. The detailed properties of the seismic datasets.
Seismic Dataset GPU
data1 5.88e-02 3.29e-05 3.29e-05
data2 7.44e-02 2.79e-05 2.79e-05
data3 7.42e-02 1.22e-04 1.22e-04
data4 7.76e-02 1.35e-04 1.35e-04
data5 3.31e-02 1.43e-04 7.81e-03
data6 4.40e-02 7.81e-03 7.81e-03
data7 3.34e-02 6.7e-310 6.9e-310
data8 5.24e-02 7.81e-03 7.81e-03
Table 2. The relative error of parallel implementation on GPU, sequential implementation on MPE and parallel implementation on CPEs of CRS method compared to CPU.

4.2. Single Node Evaluation

The performance comparison of the CMP and CRS implementations on one Sunway processor, dual CPU and GPU K40 is shown in Figure 11. We scale down the of all datasets in Table 1 to 8 in order to fit the resources on a single node across all architectures. The performance on dual CPU is chosen as the baseline. We also show the performance impact after applying our optimization techniques such as software cache, calculation re-design and vectorization (simd). As shown in Figure 11, the naive implementations on Sunway are limited by the memory bandwidth and cannot fully utilize the computation power of CPEs. It is also clear that our optimization techniques are quite effective to mitigate random memory accesses as well as exploit the vectorization for improving the performance of seismic processing on Sunway. After applying all our optimization techniques on Sunway, the CMP and CRS method achieves 3.50 and 3.01 speedup on average respectively across all datasets compared to the baseline. We notice that the CMP method achieves better performance in all eight datasets compared to GPU, whereas the CRS method is slightly worse than GPU on three datasets (data2, data3 and data4). This is mainly due to the limited memory bandwidth of Sunway processor (90.4GB/s), whereas the memory bandwidth of GPU K40 is higher by an order of magnitude (288GB/s).

Figure 11. The performance comparison of the CMP and CRS implementations on one Sunway processor, dual CPU and GPU K40. The speedup on the y axis is normalized to the baseline performance on CPU.

4.3. Roofline Model Analysis

We use the roofline model analysis to better understand the performance impact of our proposed optimization techniques on Sunway. Due to the similar computation pattern between CRS and CMP, we only provide the roofline model analysis of CRS for conciseness. We analyze the performance results of CRS implementation on data1 dataset. Other evaluation results show the similar tendency. As shown in Figure 12, the operational intensity of the original program is 1.52 FLOPS/byte. In addition, the roofline model of a Sunway CG reveals that in order to fully utilize its performance, 33.84FLOPS calculations should be performed when accessing one byte data in memory. As shown in Figure 12, after applying our software cache, the operational intensity is doubled due to the data access by different intersections can be used by each other. In addition to the software cache, after re-designing the procedure of semblance calculation, the operational intensity can be derived using Equation 2. For a particular dataset, is a constant, the size of the tile is mainly determined by the size of the LDM, and refers to the size of the data accessed by a DMA operation on a CPE. The more intersections processed by a single CPE at a time, the more data is overlapped and can be reused for latter calculation. The operational intensity after applying the calculation re-design increases to 16.96 FLOPS/byte. The roofline model analysis demonstrates our optimization techniques are effective to improve the performance of seismic processing on Sunway.

Figure 12. The roofline model of the CRS implementation on Sunway.

4.4. Scalability

We evaluate both the strong and weak scalability of the CMP and CRS methods on Sunway. The performance is measured by the execution time of both methods including the I/O time. For strong scalability, the number of CGs used for computing scales from 1,024 to 16,384 with the input dataset unchanged. For weak scalability, when the number of CGs doubles, the size of the input dataset also doubles. Note that each CG contains 65 Sunway cores, therefore the number of cores used in the experiments ranges from 66,560 () to 1,064,960 (). We use the performance when running on 1,024 CGs as the baseline. The evaluation results for strong scalability is shown in Figure 13. Since the CMP method does not exchange data between processes, it maintains good scalability in general. Whereas the CRS method exchanges the boundary data between processes, its scalability is poorer than CMP method in most of the cases. Especially, with certain datasets (e.g., data3, data4, data7 and data8), the performance of CRS method actually decreases when more CGs are utilized. The reason for the decreased performance of CRS method is that with data7 and data8, each process receives less number of CDPs (Table 1). Since the size of the dataset is constant, each CDP contains a large number of traces, which generates dominate communication overhead to exchange data with neighboring processes that severely deteriorates the performance. However, the reason for the decreased performance of CRS method with data3 and data4 requires further investigation.

Figure 14 shows the evaluation results of weak scalability. We assign each process 400MB data at maximum to calculate, which is the largest amount of data that the CRS method can run on a single CG. Similar to the strong scalability experiments, the CMP method achieves better scalability compared to CRS in most of the cases. For both strong and weak scalability experiments, we notice that when the number of CGs scales beyond 2,048, the scalability drops significantly. This is because every 256 compute nodes (each compute node contains two Sunway processors) in Sunway TaihuLight are organized into a supernode. The compute nodes within a supernode (totally ) are connected via a single-switch with full bisection bandwidth. Whereas, all supernodes are connected via a fat-tree with 1/4 of the full bisection bandwidth. The limited bandwidth among supernodes constrains the scalability of our implementations when beyond 2,048 CGs. This is in accordance with existing research (Lin et al., 2018). Note that the largest scalability experiment actually occupies more than one million Sunway cores ()! The scalability results demonstrate our implementations of seismic processing are capable to run in large scale on Sunway TaihuLight supercomputer.

Figure 13. Strong scalability to process 310GB seismic data using both CMP and CRS methods on Sunway. The x axis indicates the number of Sunway CGs, and the y axis indicates the performance in terms of giga-semblance-trace/sec (log scaled).
Figure 14. Weak scalability to compute 400MB data per process using both CMP and CRS methods on Sunway. The x axis indicates the number of Sunway CGs, the y axis indicates the performance in terms of giga-semblance-trace/sec (log scaled).

4.5. Portability

Although the proposed optimization techniques are targeting the Sunway TaihuLight supercomputer, these techniques are also applicable on other systems (de Dinechin et al., 2013) that adopt the similar many-core cache-less architecture. Specifically, 1) the re-design of semblance calculation procedure increases the computing intensity of seismic processing significantly as shown in the roofline model analysis. This technique is effective to improve the performance of seismic processing on systems that lack L2 cache or with limited L2 cache; 2) the vectorization method improves the computation efficiency when processing the tiny data within short loops. This technique is necessary for seismic processing to exploit the powerful vector units with ever-increasing width on emerging architectures (e.g., AVX512 on Intel KNL).

5. Related Work

5.1. Performance Optimization of Seismic Processing

There has been a lot of work trying to improve the CMP method. Silva et al.  (Da Silva et al., 2016) evaluate the performance of the CMP method on different platforms. The CMP method is implemented using the SYCL programming model and compared with the implementations using OpenCL and OpenMP. However, the evaluated platforms have high memory bandwidth, which dose not suffer the performance problem on Sunway due to the limited memory bandwidth. Zeng et al.  (Zeng and Thurber, 2016) explore a different signal-to-noise ratio optimizer with the time-frequency domain-phase weighted stacking. They implement their method using the FFTW C library and the cuFFT CUDA library with significant performance improvement. However, these high performance CUDA libraries do not exist on the emerging architectures such as Sunway. Lawrens et al.  (Lawrens et al., 2015) analyze the characteristics of the CRS algorithm and applies the NUMA parallel computation scheme to optimize the CRS-Stack computation. Due to the unique architecture design of Sunway processor, existing optimization techniques of seismic processing are difficult to directly apply to Sunway. The above reasons motivate our work to re-design and optimize the CMP and CRS method to adapt to the Sunway architecture, so that they can fully exploit the massive computation power of Sunway TaihuLight.

5.2. Performance Optimization on Sunway

A large number of applications have been optimized on Sunway Taihulight supercomputer. Duan et al.  (Duan et al., 2018) have realized large-scale simulation of molecular dynamics, which fully exploits the architecture advantages of Sunway with the design of complex software cache. There are also research works devoted to optimize of the computation kernels on Sunway. For instance, Liu et al.  (Liu et al., 2018) implement the efficient Sparse Matrix-Vector Multiplication (SpMV) on Sunway, which uses register communication to implement a complex communication mechanism, and thus achieves efficient mapping of SpMV algorithm to the hardware resources. Li et al.  (Li et al., 2018b) implement an efficient multi-role based SpTRSV algorithm on Sunway. It leverages the unique register communication mechanism to address memory bandwidth limitations. Chen et al.  (Chen et al., 2018) re-design the earthquake simulation algorithm to reduce memory access costs tailored for the heterogeneous many-core architecture of Sunway. All the above optimization works on Sunway have inspired our re-design and optimization techniques for the CMP and CRS method on Sunway. To the best of our knowledge, this paper is the first work to implement large-scale seismic data processing on the Sunway TaihuLight supercomputer with highly optimized CMP and CRS implementations targeting the Sunway architecture.

6. Conclusion

In this paper, we propose efficient implementations of seismic processing using both the CMP and CRS methods on the Sunway TaihuLight supercomputer for massively scaling. Specifically, we propose a software cache to alleviate the random memory accesses during the computation. We re-design the semblance calculation procedure to improve the bandwidth utilization by combining the search processes and buffering the intermediate results on each CPE. Moreover, we propose a vectorization method to improve the computation efficiency when processing tiny data within short loops. The experimental results show that our implementations of the CMP and CRS method on one Sunway processor achieve 3.50 and 3.01 speedup on average respectively than the-state-of-the-art implementations on CPU. Moreover, our approach is able to scale to more than one million Sunway cores with good scalability.


  • (1)
  • Ao et al. (2017) Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 pflops stencil computations for atmospheric modeling on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 535–544.
  • Chen et al. (2018) Bingwei Chen, Haohuan Fu, Yanwen Wei, Conghui He, Wenqiang Zhang, Yuxuan Li, Wubin Wan, Wei Zhang, Lin Gan, Wei Zhang, et al. 2018. Simulating the Wenchuan earthquake with accurate surface topography on Sunway TaihuLight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 40.
  • Da Silva et al. (2016) Hércules Cardoso Da Silva, Flávia Pisani, and Edson Borin. 2016. A comparative study of SYCL, OpenCL, and OpenMP. In 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). IEEE, 61–66.
  • de Dinechin et al. (2013) Benoıt Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoıt Ganne, Pierre Guironnet de Massas, François Jacquet, Samuel Jones, Nicolas Morey Chaisemartin, Frédéric Riss, et al. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In 2013 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–6.
  • Duan et al. (2018) Xiaohui Duan, Ping Gao, Tingjian Zhang, Meng Zhang, Weiguo Liu, Wusheng Zhang, Wei Xue, Haohuan Fu, Lin Gan, Dexun Chen, et al. 2018. Redesigning LAMMPS for peta-scale and hundred-billion-atom simulation on Sunway TaihuLight. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 148–159.
  • Fang et al. (2017) Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swdnn: A library for accelerating deep learning applications on sunway taihulight. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 615–624.
  • Fu et al. (2017) Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, et al. 2017. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2.
  • Fu et al. (2016) Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 072001.
  • Gimenes et al. (2018) Tiago Lobato Gimenes, Flávia Pisani, and Edson Borin. 2018. Evaluating the Performance and Cost of Accelerating Seismic Processing with CUDA, OpenCL, OpenACC, and OpenMP. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 399–408.
  • Jäger et al. (2001) Rainer Jäger, Jürgen Mann, German Höcht, and Peter Hubral. 2001. Common-reflection-surface stack: Image and attributes. Geophysics 66, 1 (2001), 97–109.
  • Lashgar and Baniasadi (2016) Ahmad Lashgar and Amirali Baniasadi. 2016. Openacc cache directive: Opportunities and optimizations. In 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD). IEEE, 46–56.
  • Lawrens et al. (2015) Fernando Lawrens, Aditya Jiwandono, and Rachmat Sule. 2015. Implementation of non uniform memory address (NUMA) parallel computation in order to speed up the common reflection surface (CRS) stack optimization process. In Proceedings of the 12th SEGJ International Symposium, Tokyo, Japan, 18-20 November 2015. Society of Exploration Geophysicists and Society of Exploration …, 48–51.
  • Li et al. (2018a) Liandeng Li, Jiarui Fang, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, and Guangwen Yang. 2018a. swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 413–422.
  • Li et al. (2018b) Mingzhen Li, Yi Liu, Hailong Yang, Zhongzhi Luan, and Depei Qian. 2018b. Multi-role SpTRSV on Sunway Many-Core Architecture. In

    20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, Exeter, United Kingdom, June 28-30, 2018

    . 594–601.
  • Lin et al. (2018) Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wenguang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, et al. 2018. ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 56.
  • Liu et al. (2018) Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards efficient spmv on sunway manycore architectures. In Proceedings of the 2018 International Conference on Supercomputing. ACM, 363–373.
  • Marchetti et al. (2010) P Marchetti, D Oriato, O Pell, AM Cristini, and D Theis. 2010. Fast 3D ZO CRS Stack–An FPGA Implementation of an Optimization Based on the Simultaneous Estimate of Eight Parameters. In 72nd EAGE Conference and Exhibition incorporating SPE EUROPEC 2010.
  • Marchetti et al. (2011) Paolo Marchetti, Alessandro Prandi, Bruno Stefanizzi, Herve Chevanne, Ernesto Bonomi, and Antonio Cristini. 2011. OpenCL implementation of the 3D CRS optimization algorithm. In SEG Technical Program Expanded Abstracts 2011. Society of Exploration Geophysicists, 3475–3479.
  • Ni and Yang (2012) Yao Ni and Kai Yang. 2012. A GPU Based 3D Common Reflection Surface Stack Algorithm with the Output Imaging Scheme (3D-CRS-OIS). In SEG Technical Program Expanded Abstracts 2012. Society of Exploration Geophysicists, 1–5.
  • Ottolini and Claerbout (1984) Richard Ottolini and Jon F Claerbout. 1984. The migration of common midpoint slant stacks. Geophysics 49, 3 (1984), 237–249.
  • Xu et al. (2017) Zhigeng Xu, James Lin, and Satoshi Matsuoka. 2017. Benchmarking sw26010 many-core processor. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 743–752.
  • Yang et al. (2016) Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yulong Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 6.
  • Yilmaz (2001) Öz Yilmaz. 2001. Seismic data analysis: Processing, inversion, and interpretation of seismic data. Society of exploration geophysicists.
  • Zeng and Thurber (2016) Xiangfang Zeng and Clifford H Thurber. 2016. A graphics processing unit implementation for time–frequency phase-weighted stacking. Seismological Research Letters 87, 2A (2016), 358–362.