    # The Fast and Free Memory Method for the efficient computation of convolution kernels

We introduce the Fast Free Memory method (FFM), a new fast method for the numerical evaluation of convolution products. Inheriting from the Fast Multipole Method, the FFM is a descent-only and kernel-independent algorithm. We give the complete algorithm and the relevant complexity analysis. While dense matrices arise normally in such computations, the linear storage complexity and the quasi-linear computational complexity enable the evaluation of convolution products featuring up to one billion entries. We show how we are able to solve complex scattering problems using Boundary Integral Equations with dozen of millions of unknowns. Our implementation is made freely available within the Gypsilab framework under the GPL 3.0 license.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The numerical computation of convolution products is a crucial issue arising in many domains like the filtering, the computation of boundary integral operators, optimal control, etc. In a continuous framework, a convolution product is of the form

 v(x)=∫ΩG(x,y)u(y)dΩy (1)

where is some domain of integration in , some function. The bivariate function is some convolution kernel of the form

 G(x,y)=G(x−y) (2)

where is some distance which can be the classical euclidean distance . Of course, eq. (1) does not admit an analytical expression in the general case and the integral is computed numerically using, for instance, a quadrature rule. Assuming we want to evaluate on a finite set of nodes , we have

 v(xi)≈NY∑j=1ωjG(xi,yj)u(yj) (3)

where and are respectively the weights and nodes of such a quadrature. The discrete convolution

may be recast as a simple matrix-vector product

 v=G⋅W⋅u (4)

where , , and . Obviously, the matrix is dense. Therefore, the storage and computational cost grow quadratically. The computation of (4) is constrained to smaller problems ( a few thousand) on personal computers and smaller servers, and to for industrial servers.

The current approach is to perform the computation approximately up to a given tolerance (accuracy). In the past thirty years, multiple so-called acceleration methods have been proposed. The entries of can be seen as the description of an interaction between a source set of nodes and a target set of nodes . Thus all blocks of describe the interaction between a source subset of and a target subset of . Theses interactions may be compressible, i.e. it admits a low-rank representation. This is the case, for example, when two subsets are far enough following an admissibility criterion

. The methods mentioned in the following propose different alternatives on the way the interactions are characterized and computed. The standard way is probably the

Fast Multipole Method (FMM) developed by L. Greengard and V. Rokhlin (see [greengardFMM]), initially introduced for the computation of the gravitational potential of a cloud of particles. Later versions feature the support of oscillatory kernels like the Helmholtz Green kernel. One major drawback is that the implementations are mostly kernel-specific despite recent advances in the domain [fong2009]. We refer to [chengFMM] for more details. In 1999, a new approach named Hierarchical matrices (-matrices) was introduced by S. Börm, L. Grasedyck and W. Hackbusch. This method is based on the representation of the matrix by a quadtree whose leaves are low-rank or full-rank submatrices. A strong advantage in favor of hierarchical matrices is that a complete algebra has been created: addition, multiplication, LU-decomposition, etc. Unfortunately, -matrices become less effective for strongly oscillating kernels because the rank of the compressible blocks increases with the frequency of the oscillations. For more details and a complete mathematical analysis, we refer to [hackbusch2015, borm2015]. One of the most recent compression methods, to our knowledge, is the Sparse Cardinal Sine Decomposition (SCSD) proposed by F. Alouges and M. Aussal in 2015 [alouges2015]. It is based on a representation of the Green kernel in the Fourier domain using the integral representation of the cardinal sine function. One major advantage is that the matrix-vector product is performed without partitioning of the space. However, there is no corresponding algebra. All the aforementioned methods aim at having a quasi-linear storage and computational complexity.

We introduce here the Fast Free Memory method (FFM) which blends together multiple features of the existing methods in order to have a minimalist storage requirement. It is designed for the computation of massive matrix-vector products, up to billions of nodes, where methods featuring quasi-linear storage complexity fail because of the non-linear part. The FFM relies heavily on the FMM algorithm, in particular on the octree-based space partitioning, and introduces compression techniques featured in the -matrices and the SCSD. In particular, the FFM is a kernel-independent method for the computation of the matrix-vector product (3). A powerful feature is that the required storage is minimalist with linear complexity and that the computational complexity is quasi-linear. This enables the computation of matrix-vector products with hundred of millions of nodes on the source and target sets with laboratory-sized servers. This paper is divided into three main parts. In the first part, we develop the FFM algorithm for standard kernels (non-oscillating kernels and the Helmholtz Green kernel). In the second part, we prove the storage and computational complexities. In the last part, we illustrate the proven complexities on academic and industrially-sized problems with millions of entries up to one billion.

## 2 The FFM algorithm

The FFM is a divide-and-conquer method recursively implemented. It is based on two partitioning trees (one for the source set and one for the target set) strongly related to the octree used in the classical FFM. This inheritance and the particularities of the FFM are explained in the first part of this section where we also introduce the notations which will be used in this paper. In the second part, we develop the different steps of the algorithm for the case of a general convolution and we optimize it for the case of oscillatory kernels.

### 2.1 The FFM and the FMM

The basic idea in the FMM is that the interaction between subsets of nodes sufficiently far one from another admit a low-rank representation. The space is therefore partitioned using two octrees (see Figure 1) obtained by successive refinements of the bounding boxes of the initial source and target set of nodes. At each level of refinement, boxes far one from the other (following a given criterion) correspond to compressible interactions. The other boxes are further subdivided and yield in turn low-rank and non-compressible interactions. When the matrices corresponding to the non-compressible interactions are small-enough, a full computation is performed. For more details, we refer to the bibliography on the topic, for example [greengardFMM, chengFMM, darve1999, darve2000]. In the FMM, the low-rank approximation is performed using a multipole expansion of the convolution kernel, see [greengardHuang]. As this will be illustrated in the following section 2.2, the FFM is more versatile and such expansions will be used only for the oscillating kernels. Figure 1: Representation of the octree used in the FFM with three levels of refinement.

The differences between the two methods find their origin in the way the partitioning octrees are built. Classically, the root bounding-box for each set of nodes is tight in the sense that its edge dimensions fit tightly the expansion of the set of nodes in each direction. Consequently, the initial bounding boxes for the source and target sets have different sizes and shapes in general. In the FFM, both initial bounding boxes are cubic with the same

edge length. In particular, it means that for a given level of refinement all the boxes in both octrees are the same up to a translation. The direct consequence of this choice is that there is no need to interpolate data between the boxes of the different trees because the discretization in the quadrature spaces are the same. Another difference is that the FFM is a

descent-only algorithm. This last particularity enables the linear storage complexity, see section 3.

### 2.2 The FFM algorithm

In this subsection, we describe the kernel-independent compression method for the matrix-vector product. We first describe the initialization. Then, we explain how the kernel-independent matrix-vector product is computed using a well-known Lagrange-interpolation method. We show how the user can choose the compression method when dealing with the particular case of the Helmholtz Green kernel. Finally, a stopping criterion is proposed. In the following, we reuse the previous notations and we introduce as the depth of the octree. The case corresponds to the root and is the maximum allowed depth.

#### 2.2.1 Initialization

The initialization () is performed easily by computing the bounding boxes for and . Let and be the maximum edge dimension of , resp. , then the initial dimension for both bounding boxes is

 d0=max(dX,max,dY,max). (5)

The initial bounding box for each set is simply a cube enclosing , resp. , with edge length . At the depth in the octree, the dimension of the bounding boxes is simply .

#### 2.2.2 The kernel-independent FFM

We suppose that the current refinement level is . We consider only one interaction between a box of the source tree and a box of the target tree. This interaction is low-rank if the distance between their centers is greater than two-times the edge length of a box. Let the number of nodes in the target box and the number of nodes in the source box, the compressed product is performed using a classical bivariate Lagrange interpolation, see for example [cambier2017] or the chapters related to the -matrices in [hackbusch2015]. The principle of the Lagrange interpolation is to approximate the convolution kernel like

 G(x,y)≈rX∑i=1Li(x)rY∑j=1G(xc,i,yc,j)Lj(y), (6)

where

• and are the target and source control nodes for the Lagrange polynomials. Following [mastroianni2001, cambier2017], the best interpolation nodes are the Chebyshev nodes.

• , resp.

are the Lagrange polynomials defined as the tensorization of the one-dimensional Lagrange polynomials in each direction of the space and localized at the

control nodes.

• and are the ranks of the interpolation for the target and source variables. If are the rank of the interpolation in the direction , then and . For a prescribed accuracy on the interpolation, these ranks depend solely on the size of the interpolation domain (in other words: the size of the bounding boxes) and on the regularity of the kernel being interpolated. Consequently, we have

 rX≡rY≡r≡(r1)3 (7)

in the FFM framework.

The interpolated matrix-vector product can be therefore recast as

 v≈LTX⋅(T⋅(LY⋅u)) (8)

where

• is the source vector whose entries are localized at the nodes contained within the source box.

• , resp. , is the interpolation matrix whose entries are the Lagrange polynomials localized at the source control nodes evaluated at the source nodes.

• is the transfer-matrix whose entries are the kernel evaluated for each possible couple of target and source control node.

However, is not necessarily low and can be further compressed using the Adaptive Cross Approximation, see [bebendorf], such that

 T≈A⋅BT (9)

where and such that . Finally, the Lagrange interpolation consists in four successive matrix-vector products such that

 v≈LTX⋅(A⋅(BT⋅(LY⋅u))). (10)

As the rank depends on the kernel, it may increase unacceptably when dealing with oscillatory kernels because the polynomial order must be high to fit the oscillations in the sub-domains. In that case, it is beneficial to use kernel-specific compression method as illustrated in the next section 2.2.3.

#### 2.2.3 The FFM for oscillatory kernels

In this section, we illustrate the change of compression method for the special case of the Helmholtz Green kernel

 G(x,y)=14πeik|x−y||x−y| (11)

where is the wavenumber. A test is performed on the value of to determine wether the low-rank interaction is oscillating. For example, the evaluation of the Helmholtz kernel (11) for two nodes and sufficiently close can be detected as non-oscillating because the value of is small. In this case, we use the Lagrange interpolation described in subsubsection 2.2.2 instead of a specific low-frequency FMM (see [greengard1998] or [darve2004]). In the other case, we approximate the kernel using its Gegenbauer-series expansion like in the FMM.

Let , resp. , be the center of the target, resp. source, box, then

 x−y=(x−x0)+(x0−y0)+(y0−y) (12)

which can be reformulated like

 r=r0+rxy (13)

where . Let also be the unit sphere in , then following for example [darve1999]

 eikrr=iklimL→∞∫S2eik^s⋅rxyTL,r0(^s)d^s (14)

where and is the Gegenbauer series such that

 (15)

where is the spherical Hankel function of the first kind and of order , is the Legendre polynomial of order . In practice, (15) is truncated at the rank

 L=⌊|k|⋅√3⋅dl−log(ε)⌋ (16)

where is the size of the edge of the bounding box and is the prescribed accuracy on the matrix-vector product, see [darve2000]. The integral (14) is computed using a spherical quadrature such that, for two nodes and ,

 eik|xi−yj||xi−yj|≈iknQ∑q=1eikxi⋅^sq((ωqTL,r0(^sq)eikr0⋅^sq)e−ik^sq⋅yj). (17)

The computation is therefore recast as three successive matrix vector products

 v≈F^s→X⋅(TL,r0⋅(FY→^s⋅u)) (18)

where

• is the dense

matrix representing the discrete non-uniform forward Fourier transform from the

source set to the Fourier domain such that

 ~uq=n∑j=1e−ik^sq⋅yjuj. (19)
• is the transfer diagonal matrix whose entries contain the value of Gegenbauer series evaluated at such that

 ~vq=ikωq(TL,r0(^sq)eikr0⋅^sq)~uq. (20)
• is the dense matrix representing the discrete non-uniform backward Fourier transform from the Fourier domain to the target set such that

 vi=nQ∑q=1eikxi⋅^sq~vq. (21)

Of course, the Fourier-matrices are never assembled and the Fourier transforms are computed using the corresponding Non-Uniform Fast Fourier Transform (NUFFT) introduced by A. Dutt and V. Rokhlin [duttNufft1993], later improved by Greengard [greengardNufft]. We refer to these papers for a complete analysis of the algorithm.

###### Remark 1.

In the classical FMM, the Fast Fourier Transforms are computed at the deepest level in the trees, forcing a back-propagation in the octree. In the FFM, they are performed on-the-fly enabling a descent-only algorithm.

#### 2.2.4 Stopping criterion

The recursive partitioning is stopped whenever one of the following is verified:

• any compressible interaction has been computed,

• the average number of nodes in the boxes is below some value.

The remaining non-compressible interactions, if any, are computed as a dense matrix-vector product.

## 3 Complexity analysis

In this section, we prove the storage complexity and the computational complexity of the FFM, where . To that purpose, we assume that each set or

consists in an uniform distribution of nodes in a cube. The case of surface node-distributions is eventually tackled as a particular case.

### 3.1 Complexity of an octree

We give here some general well-known results on space partitioning trees. Assuming is the length of the edge of the root bounding box, the bounding boxes at level have the dimension . They contain (in average) nodes. There are at most non-empty boxes. Consequently, the maximum depth of an octree is . The construction of the tree itself requires operations. In general, the required storage is also but it is in the FFM framework because we store only the data needed at the current depth. For the sake of simplicity, we assume that the boxes contain exactly

nodes as it does not modify the overall estimate.

##### The particular case of surface distributions

We emphasize the fact that using an octree to subdivide evenly distributed nodes on a surface amounts to consider a plane surface partitioned using a quadtree as illustrated on Figure 2. Figure 2: Illustration of the subdivision of a surface using an octree.

Consequently, we replace the by in the aforementioned results: the maximum depth is , etc.

###### Remark 2.

For the sake of simplicity, the subscript for the indicating the type of the logarithm is omitted whenever it is not required for the comprehension.

### 3.2 Complexity estimates for the kernel-independent FFM

We prove here the storage and computational complexity for the kernel-independent version. Since the FFM is a descent only algorithm, we can discard data stored at the parent level in the tree. This minimal storage requirement leads to the following proposition.

###### Proposition 1.

The storage complexity of the FFM is and the computational complexity is .

###### Proof.

Before we prove the estimates, we make some preliminary remarks. We first emphasize that the interpolation step for each of the source set, resp. target, is performed only once for all the corresponding subsets. Second, we drive the attention to the fact that, while there should be many interpolations to compute, most of the transfers are the same. On Figure 3, the two transfers (arrows) represented between the boxes from the source tree (full lines) and the target tree (dotted) have the same transfer matrix which is computed only once.

In fact, the amount of different translations is uniquely determined by the initial configuration of the root bounding boxes as illustrated in 2-dimension on Figure 4. The left picture corresponds to two overlapping quadtrees. If we enumerate all the possible low-rank interactions for the filled box at the depth 2, we find . The filled bounding box interacts at most with the three outer layers minus the closest layer of boxes meaning there are at most possibilities per box at all levels of refinement. All the other boxes are children of boxes involved in low-rank interactions at the previous level of refinement. On the right configuration, the trees are shifted and the amount of possible interactions is now . This number corresponds to the total number of transfer matrices which needs to be computed at the current depth ; it does not depend on . Figure 4: Space configuration of two three-leveled quadtrees in 2-dimension. Left: the quadtrees overlap. Right: the quadtree are shifted. Very thick line: level-0, thick line: level-1, thin line: level-2. Fully colored cell: bounding box for which we want to determine the low-rank interactions.
1. Let the number of control nodes for the Lagrange interpolation in each direction of space, then is bounded above by a value independent on . Let and , then the storage requirement for the first interpolation is per interpolation matrix. Since there are as many matrices as there are boxes in the tree at level , the storage requirement is then . The interpolation consists in matrix-vector products, each of them of size . The computational complexity of the complete Lagrange interpolation of the source set for any is therefore .

2. Let be the maximum number of unique transfers (see again Figure 4), the storage and the computation of these matrices are each as the number and the dimensions of the transfer matrices do not depend on nor . Each interpolated source vector is then multiplied most -times by a transfer matrix whose size is independent on or . Consequently, this transfer step has complexity for the storage and the computational complexity.

3. Finally, the last interpolation is performed in two steps:

1. The construction of the interpolation matrices for the target set has the same complexity as for the source set, i.e. it is .

2. The transfer to the target set of interpolations consist in computing the "interpolation matrix"-"transferred vector" for each of the "transferred vector". This step should be performed at most times for each of the target i.e. -times.

Therefore, the second interpolation step has the same complexity as the step 1, i.e. it is in storage and computational complexity.

To these costs, we must add the worst cost of the construction of the octrees which is and of the non-compressible interactions which is linear because the maximum number of close interactions for one node is bounded above by the maximum number of nodes allowed in the leaves of the tree.

We conclude that the storage requirement for the FFM is . Regarding the computational requirement, the worst case consist in performing the interpolation -times. We conclude that the computational cost for the complete FFM product is . ∎

###### Remark 3.

These complexity estimates do not depend on wether the nodes are evenly scattered in a volume (octree-based partitioning) or on a surface (equivalent quadtree-based partitioning).

### 3.3 Complexity estimates for the oscillating kernels

In the previous subsection 3.2, we proved very general estimates. When dealing with oscillating kernels, one must introduce the notion of wavelength and of "discretization" per wavelength. Let be the average distance between two nodes, it is generally chosen such that

 k⋅Δx=C,C≈1, (22)

where we recall that is the wavenumber. In other words, there are approximately six nodes per wavelength. This is fundamentally different from the generic case as the number of nodes in a given volume depends explicitly on the wavenumber . We assume in the following that is accordingly adjusted as a function of .

We suppose again that the nodes are scattered evenly in a volume. The special case of surfaces is tackled at the end of this subsection.

###### Proposition 2.

The storage complexity for the oscillating FFM is and the computational complexity is .

###### Proof.

The proof is very similar to the proof of Proposition 1. We first remark that the NUFFT is based on the Fast Fourier Transform which has a storage requirement and a computational complexity. These estimates remain valid for the NUFFT but with a bigger multiplicative constant, see [greengardNufft]. The main difficulty is to estimate the dependency of to . To that purpose, we assume that for some we have . In our choice of spherical quadrature, we have

 nQ=2⋅L⋅(L+1) (23)

where is given by (16). By substituting and expanding , we obtain

 nQ=A(k2l)2+B(k2l)+C (24)

where are constants. Following the hypothesis , we have that

 N≈(d0Δx)3⇔k∼N1/3. (25)

Consequently,

 nQ=AN2/34l+BN1/32l+C (26)

meaning that . In the particular case of surface distributions of nodes, we have and the polynomial (25) becomes

 nQ=AN4l+BN1/22l+C. (27)

This means that for surface distributions of nodes, . Since the coefficients are all positive, we conclude that we always have .

We detail the complexity of each step below:

1. The NUFFT has the same complexity as the classical FFT. Therefore, the storage requirement for one NUFFT is meaning that the total storage requirement is . The total number of NUFFTs is and each has the computational complexity . Since , we conclude that the storage complexity of the first step is and that the computational complexity is .

2. The second step is a simple multiplication in the Fourier domain which is performed -times on vectors of length , see the proof of Proposition 1. By remarking that , we conclude that this step has a linear storage and computational complexity.

3. The last step is the backward step of the first one. Consequently, we obtain exactly the same complexities.

By including the complexity of the tree and the computation of the non-compressible interactions (see the end of the proof of Proposition 1), we conclude that the worst storage requirement is still and that the worst computational complexity is . ∎

###### Remark 4.

In this section, we proved that the storage requirement of the FFM is always linear while the computational complexity is in the general case (but with eventually big multiplicative constants) and for the particular case of oscillating kernels. These estimates are not worse than the existing compression methods and they are illustrated in the next section.

## 4 Numerical Examples

In this section, we give examples of application of the FFM. Our implementation is written in the MATLAB language. First we illustrate the estimates proven in the previous section 3. In a second time, we show how one can use the FFM to solve Boundary Integral Equations iteratively and we solve two publicly available benchmarks in acoustic and electromagnetic scattering.

### 4.1 Scalability of the FFM

The scalability of the FFM is illustrated by computing the convolution product where the kernel is the Laplace Green kernel given below,

 G(x,y)=14π1|x−y|. (28)

Physically, it amounts to compute the gravitational potential generated by masses located at the source nodes at the target nodes. To that purpose, we pick randomly source nodes and target nodes on the unit sphere centered at the origin as illustrated on Figure 5.

Then, we simply compute the convolution product (4) with the FFM and we measure the memory requirement and the computation time. The computation111The convolution product between two sets of nodes may be computed using the ffmProduct() function, available within the open-source Gypsilab framework [gypsiHub] in the ./openFfm directory. An example is provided by running the nrtFfmBuilder.m script. was performed on a computer with 12 cores at 2.9 GHz, 256 GBytes of RAM and Matlab R2017a using single precision. The results are gathered in Table 1 for a prescribed accuracy on the matrix-vector product. The linear storage requirement is confirmed. We observe furthermore that the computational scalability is close to linear. More importantly we are able to achieve a matrix-vector product with one billion of nodes in each of the source and target set in less than four hours. On the required 100 GBytes, approximately 40 GBytes are required for the storage of the coordinates of the nodes, the input vector and the output vector. This implies that the multiplicative constant in the estimate is almost .

The same experiment is repeated for the Helmholtz Green kernel. The results are given in Table 2 with the corresponding maximum value.

### 4.2 Boundary Integral Equations and the FFM

Boundary Integral Equations can be obtained from certain equations describing, for example, physical phenomenons like the propagation of an acoustic or electromagnetic wave in a homogeneous medium. We refer to [nedelec] starting p. 110, or [coltonKress] starting p. 66, for more details on how they are obtained. In order to explain how the FFM is used to solve such equations, we consider the scattering of an acoustic wave propagating in an infinite medium by a scatterer with boundary on which we apply a Dirichlet boundary condition. This problem can be tackled by solving the following Boundary Integral Equation

 ∫ΓG(x,y)λ(y)dγy=−ui(x),x∈Γ (29)

where is some unknown, is the Helmholtz Green kernel (see (11)) and is the incident wave. This equation may be solved using different method. Here we present shortly the Boundary Element Method based on a Galerkin formulation. We introduce a test function to obtain the Galerkin formulation of eq. (29)

 ∫Γ×Γλ⋆(x)G(x,y)λ(y)dγydγx=−∫Γui(x)λ⋆(x)for allλ⋆. (30)

We further introduce the discrete approximation spaces and such that

 λ(y) =N∑j=1uj⋅λj(y), (31) λ⋆(y) =N∑i=1vi⋅λ⋆i(x). (32)

There are multiple ways to deal with this singular integral. Here we integrate the double integral using a Gauss-Legendre quadrature. The resulting inaccurate integration of the singularity is tackled later. Let and be the weight and nodes of quadrature, the eq. (30) now reads

 ∫Γ×Γλ⋆(x)G(x,y)λ(y)dγydγx≈N∑i=1ving∑k=1λ⋆i(xg,k)ωg,kng∑l=1G(xg,k,yg,l)N∑j=1ωg,lλj(yg,l)uj. (33)

Therefore, eq. (29) is rewritten as linear system of equations,

 S⋅λ=Ui. (34)

The Galerkin matrix can be recast as the product of three matrices such that

 S=(Λ⋆)T⋅(G⋅Λ) (35)

where is the matrix "transporting" the basis functions to the quadrature nodes, is the matrix "quadrature-to-test-functions" and is the matrix such that . While the matrices and are sparse and can be stored with linear complexity, is full. In the process of an iterative inversion algorithm such as GMRES, see [saadSchultz], one or more matrix-vector products are required,

1. . This product has linear complexity.

2. . This product is compressed using the FFM.

3. . This product also has linear complexity.

In general, is closely related to the number of elements in the discretization of . Assuming for example that there are three quadrature nodes per element, then meaning that the size of may be in fact much higher than the actual size of the linear system. The singular integral is computed independently using a semi-analytical method. It takes the form of an additional sparse matrix which "removes" the singularity integrated numerically in and adds the "exact" integration of the kernel.

Our FFM library is interfaced with the Gypsilab software. Gypsilab is an open-source (GPL3.0) Finite Element framework entirely written in the Matlab language aiming at assembling easily the matrices related to the variational formulations arising in the Finite Element Method or in the Boundary Element Method. Among other things, it features a complete -matrix algebra (sum, product, LU decomposition, …) compatible with the native matrix types of Matlab. For more details on the capabilities of Gypsilab we refer to [gypsilab, gypsiHub]. In the context of this paper, we use it to manage the computation of the matrices , and the right-hand-side 222The high-level interface with the FFM is available as an other overloaded integral() function within Gypsilab. An example is provided by running the nrtFfmBuilderFem.m script..

We present here two examples of application. The first one corresponds to the scattering of an underwater acoustic wave by a submarine and the second one is the scattering of an electromagnetic wave by a perfectly electric conductor rocket launcher.

#### 4.2.1 Acoustic scattering by a submarine

We solve the acoustic scattering by a submarine with Neumann Boundary condition. This example is based on the BeTSSi benchmark [betssi]. The mesh is provided by ESI Group and it is remeshed using the open-source Mmg Platform [mmg]. We could solve this problem using the following equation

 ∫Γ∂2G∂nx∂ny(x,y)μ(y)dγy=∂ui∂nx (36)

whose variational formulation is

 k2∫Γ×Γ(μ⋆(x)⋅G(x,y)⋅μ(y)⋅(nx⋅ny))dγxdγy−∫Γ×Γ(rotΓμ⋆(x)⋅G(x,y)⋅rotΓμ(y))dγxdγy=∫Γμ⋆(x)⋅∂ui% ∂nx(x)dγx (37)

where is the unknown, is the test function, the wavenumber, is the outbound normal vector at position , and is the normal derivative with respect to the variable . Eq. (36) is very ill-conditioned and is also ill-posed for some frequencies. The integral operator is called the hypersingular boundary integral operator. The scattering problem could also be solved using another boundary integral equation

 −μ(x)2+∫Γ∂G∂nx(x,y)μ(y)dγy=−ui(x) (38)

which is better conditioned but which is also ill-posed for some frequencies and less accurate in practice. The boundary integral operator in eq. (38) is the adjoint of the double layer operator. To circumvent these issues, we use a linear combination of the integral operators involved in eq. (36) and (38) called the Brakhage-Werner formulation, see [kress]. This formulation is better conditioned and always well-posed. It is important to note that each iteration in the GMRES algorithm requires in fact FFM-products in our implementation: three for the part with the scalar product of the normal vectors, three for the part with the scalar product of the surface rotational, and three for the adjoint of the double layer operator.

The submarine is m long and the frequency is set at kHz. Assuming, the celerity of sound in water is m.s, the wavelength is m meaning that there are approximately wavelengths along the submarine, or differently written we have . The mesh features triangles. Since the numerical integration is performed using a quadrature rule with 3 nodes per triangle, the size of one FFM-product is . The problem is discretized with -elements implying that the total amount of (nodal) unknowns is . It is solved using a preconditioned GMRES algorithm without restart on a 32 cores server at 3.0 GHz with 512 GBytes RAM. The tolerance for both the GMRES and the FFM product is set to . Convergence is achieved in 12 iterations and 50000 seconds ( 13 hours and 50 minutes) including the assembling of the preconditioner and the regularization matrix. Each iteration requires approximately seconds. The radiated field on the surface is represented on Figure 6. This computation is also performed using the FFM. The memory peak is measured at approximately 200 GBytes when assembling the preconditioner and the regularization matrix. Figure 6: Absolute value of μ on the surface of the submarine

#### 4.2.2 Perfectly electric rocket launcher

This example is a slightly modified version of a test case extracted from the Workshop EM-ISAE 2018, see [workshop2018]. We study the scattering of an electromagnetic plane wave by a perfectly electric launcher. This problem is solved the Combined Field Integral Equation (CFIE) which is a linear combination of the Electric Field Integral Equation (EFIE) and the Magnetic Field Integral Equation (MFIE). For the construction of these equations, we refer once again to [nedelec] starting p. 234, or [coltonKress] starting p. 108. The EFIE reads

 1k2∇Γ∫ΓG(x,y)∇Γ⋅J(y)dγy+(∫ΓG(x,y)J(y)dγy)T=−(Ei)TikZ (39)

where is the tangential trace of the magnetic field, is the incident electromagnetic wave, is an impedance, and is the tangential trace operator. The MFIE reads

 (40)

where is the incident magnetic field. The CFIE then reads

 CFIE=α⋅EFIE+(1−α)⋅Z⋅MFIE,α∈]0,1[. (41)

We use the classical Raviart-Thomas finite element space of order 0 (RT). The launcher is 60 m long and 12 m in diameter (including the boosters). The electromagnetic wave propagates at GHz meaning that the wavelength is 0.15 m, assuming a celerity of light equals to m.s. Therefore, there are wavelength along the launcher. The total number of unknowns is for approximately triangles. Consequently, the size of a FFM product is . We use the same server as in subsubsection 4.2.1. One GMRES matrix-vector product involves now ( for the EFIE, for the MFIE) FFM products and lasts approximately hours. The real part of the solution is represented on Figure 7. Figure 7: Real part of J on the whole launcher (top) and detail (bottom).

### 4.3 The FFM compared to the H-matrices

We want to stress that for lower problems sizes (), the FFM performs, in general, worse than other methods like the -matrices or the FMM. Indeed, everything is computed anew at every call to the method in order to save memory. When compared to the -matrices, the assembling of the -matrix costs more than a few FFM products but the -matrix-vector product is then almost free. This is illustrated in Table 3 for the Laplace Green kernel. The computation is performed on a laptop and a single core at 2.6 GHz is used. The total available RAM is 16 GBytes. We use the -matrix library featured in Gypsilab. However, the FFM enables the computation of massive convolution products as illustrated in subsections 4.1 and 4.2, even for oscillating kernels, on small servers, thus avoiding the use of huge infrastructures.

## 5 Conclusion

We have proposed a powerful and scalable alternative to the existing compression methods when dealing with convolution featuring millions, or eventually billions, of nodes. The two main ingredients of the FFM are: a descent-only tree traversal, and cubic bounding boxes with the same edge size for the two octrees. It enables a linear storage complexity and a quasi-linear computational complexity which are numerically highlighted. The FFM is kernel-independent but its versatility enables optimization for strongly oscillating convolution kernels. Moreover, the implementation effort is small as only a few hundred Matlab lines are required. From a practical point of view, the FFM gives the possibility to compute realistic problems to users who do not have access to a computer cluster. The code is now freely available in the git repository of Gypsilab at [gypsiHub] under the GPL 3.0 license.

## Acknowledgements

This work is funded by the Direction Générale de l’Armement. We would like to thank François Alouges for his invaluable help in the development of this work and Leslie Greengard for providing the NUFFT code used for the computation of the numerical results.