Log In Sign Up

Transformer Meets Boundary Value Inverse Problems

by   Ruchi Guo, et al.

A Transformer-based deep direct sampling method is proposed for solving a class of boundary value inverse problem. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a case study for a fundamental and critical question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural network? Inspired by direct sampling methods for inverse problems, the 1D boundary data are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions in different frequency input channels. Then, by introducing learnable non-local kernel, the approximation of direct sampling is recast to a modified attention mechanism. The proposed method is then applied to electrical impedance tomography, a well-known severely ill-posed nonlinear inverse problem. The new method achieves superior accuracy over its predecessors and contemporary operator learners, as well as shows robustness with respect to noise. This research shall strengthen the insights that the attention mechanism, despite being invented for natural language processing tasks, offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures.


page 17

page 18

page 19


Deep learning architectures for nonlinear operator functions and nonlinear inverse problems

We develop a theoretical analysis for special neural network architectur...

A CCBM-based generalized GKB iterative regularizing algorithm for inverse Cauchy problems

This paper examines inverse Cauchy problems that are governed by a kind ...

Learn an index operator by CNN for solving diffusive optical tomography: a deep direct sampling method

In this work, we investigate the diffusive optical tomography (DOT) prob...

Solving Optical Tomography with Deep Learning

This paper presents a neural network approach for solving two-dimensiona...

Deep Preconditioners and their application to seismic wavefield processing

Seismic data processing heavily relies on the solution of physics-driven...

Construct Deep Neural Networks Based on Direct Sampling Methods for Solving Electrical Impedance Tomography

This work investigates the electrical impedance tomography (EIT) problem...

1 Introduction

Boundary value inverse problems aim to recover the internal structure or distribution of multiple media inside an object (2D reconstruction) based on only the data available on the boundary (1D signal input), which arise from many imaging techniques, e.g., electrical impedance tomography (EIT) (Holder, 2004), diffuse optical tomography (DOT) (Culver et al., 2003), magnetic induction tomography (MIT) (Griffiths et al., 1999). Not needing any internal data renders these techniques generally non-invasive, safe, cheap, and thus quite suitable for monitoring applications.

Let be a nonlinear operator associated a physical model governed by certain partial differential equations (PDE) on a bounded in . For the measurement on the boundary , and the coefficient of the underlying PDEs to be recovered, the forward PDE model is


where is a noise following certain distribution. Comparing with the classic linear inverse problems

in computer vision and signal processing, e.g.

(Marroquin et al., 1987), to recover a signal from measurement with noise , the fundamental difference is that is highly nonlinear, and the Hilbert spaces and are usually infinite dimensional. In many situations, even though the inverse operator is theoretically well-defined, as the infinite dimensional is practically approximated by a finite dimensional (sub)space , the problem of seeking the approximated operator is usually highly ill-posed (not having a well-defined unique output) and poses great challenges to the reconstruction algorithms.

2 Background, related work, and contributions

Classical iterative methods.

There are in general two types of methodology to solve inverse problems. The first one is a large family of iterative or optimization-based methods (Dobson & Santosa, 1994; Martin & Idier, 1997; Chan & Tai, 2003; Vauhkonen et al., 1999; Guo et al., 2018; Rondi & Santosa, 2001; Chen et al., 2020). One usually looks for the desired by solving a minimization problem


where is a regularization term to alleviate the ill-posedness, and its design plays a critical role for a successful reconstruction (Tarvainen et al., 2008; Tehrani et al., 2012; Wang et al., 2012). For almost all these iterative methods, due to the ill-posedness, the computation generally takes numerous iterations to converge, and the reconstruction is highly sensitive to noise. Besides, needs to be evaluated at each iteration, which is itself expensive as it requires solving forward PDE models.

Direct methods.

The second methodology is to develop a well-defined mapping parametrized with , empirically constructed to approximate the inverse map itself:


This method is referred to as non-iterative or direct methods in the literature. Distinguished from iterative approaches, direct methods are in general highly problem-specific, as they are designed on an ad hoc basis based on the mathematical structure of the various inverse operators. For instances, methods in EIT and DOT include factorization methods (Kirsch & Grinberg, 2007; Azzouz et al., 2007; Brühl, 2001; Hanke & Brühl, 2003), MUSIC-type algorithms (Cheney, 2001; Ammari & Kang, 2004, 2007; Lee et al., 2011), and the D-bar methods (Knudsen et al., 2007, 2009) based on a Fredholm integral equation (Nachman, 1996), among which are the direct sampling methods (DSM) (Chow et al., 2014, 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020) being our focus in this work. These methods generally have a closed-form expression of to approximate , and the parameters have clear mathematical meaning. For each fixed , this procedure is usually much more stable than iterative approaches with respect to the input data. Furthermore, the evaluation for each given is distinctly fast as no training optimization is needed anymore. However, such a simple closed form of admitting efficient execution may not be available in practice since some mathematical assumptions and derivation may not hold.

Boundary value inverse problems.

For most cases of boundary value inverse problems in 2D, one key difference, e.g., with an image inverse problem, is that data are only available on 1D manifolds, which are used to reconstruct 2D targets. The boundary data themselves generally involve certain input-output structure, which adds more complexity. In Adler & Guardo (1994); Fernández-Fuentes et al. (2018); Feng et al. (2018) boundary measurements are collected and directly input into feedforward fully connected networks. As the data reside on different manifolds, special treatments are made to the input data, such as employing pre-reconstruction stages to generate rough 2D input to CNNs (Ben Yedder et al., 2018; Ren et al., 2020).

Operator learning and inverse problems.

Solving an inverse problem is essentially to approximate the inverse operator

but based on finitely many data. The emerging deep learning (DL) technique makes it possible to directly emulate operators. The concept resembles greatly the aforementioned classical direct methodology. But operator learners by DNNs are generally considered as black boxes. A natural question to ask is whether and how existing innovative neural architectures, usually designed for other purposes, can be modified conforming with the mathematical nature of the underlying problem, which ultimately leads to structure-conforming DNNs. Inspired by classical direct methods, we try to improve the DNN reconstruction pipeline from the architectural perspective. In this regard, the proposed study provides a positive example to a hopefully definitive answer to this question, which bridges deep learning and conventional tasks in physical sciences.

Operator learners.

Operator learning has become an active research field for inverse problems in recent years, especially related to image reconstruction where CNN plays a central role, see e.g, Kłosowski & Rymarczyk (2017); Nguyen et al. (2018); Tan et al. (2018); Jin et al. (2017); Kang et al. (2017); Barbastathis et al. (2019); Latif et al. (2019); Zhu et al. (2018); Chen et al. (2021); Coxson et al. (2022). Some efforts have been made to couple classical reconstruction methods and CNN. Notable examples include Hamilton et al. (2019); Hamilton & Hauptmann (2018), where a CNN post-processes images obtained by the classical D-bar methods; Fan et al. (2019); Fan & Ying (2020) develop BCR-Net to mimic pseudo-differential operators appearing in many inverse problems; a deep direct sampling method is proposed in Guo & Jiang (2020); Jiang et al. (2021)

that learns local convolutional kernels mimicking the gradient operator of DSM. Another example is radial basis function neural networks seen in

Hrabuska et al. (2018); Michalikova et al. (2014); Wang et al. (2021a). Nevertheless, convolutions in CNNs use locally supported kernel whose receptive field involves only a small neighborhood of a pixel, thus layer-wise speaking, CNN does not align well with the non-local nature of inverse problems. More recently, the learning of PDE-related forward problems using global kernel has gained attraction, most notably the Fourier Neural Operator (FNO) (Li et al., 2021)

. FNO takes the unique advantage of the low-rank nature of certain problems, and proposes to train a local kernel in the frequency domain yet global in the spatial-temporal domain, mimicking the solution’s kernel integral form. Others include DeepONets

(Lu et al., 2021; Wang et al., 2021b; Jin et al., 2022) and Transformers (Kissas et al., 2022; Li et al., 2022; Cao, 2021).

Related studies on Transformers.

The attention mechanism-based models have become the state of the art in many areas since Vaswani et al. (2017), such as language tasks and computer vision e.g., Ma et al. (2021); Dosovitskiy et al. (2021). One of the most important and attractive aspects of the attention mechanism is its unparalleled capability to model long-range interactions (Tay et al., 2021) through many efficient variants see e.g., Katharopoulos et al. (2020); Choromanski et al. (2021); Nguyen et al. (2021). The relation of the attention with kernel learning is firstly studied in (Tsai et al., 2019) and later connected with random feature (Peng et al., 2021). Connecting the non-PDE-based integral kernels and the attention mechanism has been seen in Hutchinson et al. (2021); Guibas et al. (2022). Among inverse problems, Transformers have been applied in medical imaging applications including segmentation problems (Zhou et al., 2021; Hatamizadeh et al., 2022; Petit et al., 2021), X-Ray (Tanzi et al., 2022), magnetic resonance imaging (MRI) (He et al., 2022), ultrasound (Perera et al., 2021), optical coherence tomography (OCT) (Song et al., 2021), etc. To our best knowledge, there is no work in the literature that connects the attention mechanism in Transformer and the mathematical structure of PDE-based inverse problems.

2.1 Contributions

  • [topsep=0pt, leftmargin=1em]

  • A structure-conforming network architecture. Inspired by the classical DSM, we decompose the inverse operator into a harmonic extension and an integral operator with learnable non-local kernels that has an attention-like structure. Additionally, the attention architecture is reinterpreted through a Fredholm integral operator to rationalize the application of the Transformer to the boundary value inverse problem.

  • Theoretical and experimental justification for the advantage of Transformer. We have proved that, in Transformeris, a modified attention is able to represent target functions exhibiting higher frequency natures from lower frequency input features. In the experiments, a comparative study further demonstrates a favorable match between the Transformer and the problem structure.

3 Interplay between mathematics and neural architectures

In this section, we use EIT, a classical boundary value inverse problem, as a prominent example to walk through that the triple tensor product in the attention mechanism matches extremely well with representing a solution in the inverse operator theory of EIT. The proposed method is motivated by and in pursuit to provide examples to the answers of the following questions

  • [topsep=-2pt, leftmargin=2.5em, itemsep=0pt]

  • What is an appropriate finite dimensional data format of as inputs to the neural network?

  • Is there a suitable neural network in (3) taking advantage of the mathematical structure?

3.1 From EIT to Operator Learning

The forward model of EIT is given by the following second order elliptic partial differential equation


Given the domain , the goal is to identify the unknown inclusion buried in . The value and are two (approximately) known constants discontinuous across the boundary . Without loss of generality, we assume . Then, the coefficient of (4

) to be recovered can be described by a characteristic function

defined for each point , and is for and for , or equivalently . Then,

is a vector of values at the grid points approximating

(see Figure 2 in Appendix B for an example discretization).

In the application of EIT, the boundary measurement space contains information of how the electric potential behaves on the boundary. By exerting a current on the boundary, solving (4) for a unique with this specific boundary information (Neumann boundary condition) reveals ’s behavior on the whole domain . However, what can be measured is only the voltages on the boundary. This procedure is called Neumann-to-Dirichlet (NtD) mapping:


The - duality for Sobolev spaces here is for formality, and we refer the readers to Appendix A for more detailed descriptions of function spaces. One measurement is a single data pair , and all the measurements form this infinite-dimensional space:

If the NtD mapping is known, the theoretical uniqueness of can be established (Brühl, 2001; Hanke & Brühl, 2003; Astala & Päivärinta, 2006; Nachman, 1996). Mathematically, it means if one is able to measure all the current-to-voltage pairs in , where form a set of basis functions of the Hilbert space , then , which is essentially , is well-defined admitting explicit format (Brühl, 2001).

We shall illustrate how the mathematical setup above is realized in a discrete level. For each fixed , the NtD mapping in (5) can be expressed as


where and are tensors containing boundary data at grid points on the boundary. If is known, then (6) is obtained through a typical PDE solver. For unknown , using a significantly large number of pairs of - can obtain a huge matrix approximating to on a reasonably fine grid. The mechanism here essentially results in a tensor2tensor mapping/operator from to the images of . In this case, the operator can be learned through a large number of - data pairs. In particular, the BCR-Net (Fan & Ying, 2020) is a DNN approximation falls into this category.

However, as infinity or large number of boundary data pairs is not attainable in practice, the problem of more practical interests is to use only a few data pairs for reconstruction with being very small not enough to get any reasonably good approximation of in (6). Let be the subspace spanned by . Here, the highly nonlinear forward operator for finitely many pairs is still trivially well-defined. However, it becomes a long-standing open theoretical problem whether its inverse is well-defined. Namely, it is unclear if the inclusion can be exactly identified by finite measurement on boundary.

As we assume , the operator

is positive definite, and it has eigenvalues

with  (Cheng et al., 1989). The following theorem justifies the practice of approximating when is large enough. Roughly speaking, as is well defined, when , we can define an to approximate .

Theorem 1.

Suppose that the 1D boundary data

is the eigenfunction of

corresponding to the -th eigenvalue , and let the 2D data functions be obtained by solving


where . Define . Then, for any , there exists a sufficiently large such that .

The proof of Theorem 1 can be found in Appendix C. Nevertheless, the constructional proof of the existence of still relies on the entire NtD mapping , which again resorts to infiniteness, thus inaccessible in real applications. Another valuable corollary of Theorem 1 is to justify using the data functions as the input to neural networks, see Section 3.3 for more details.

Operator learning problems for EIT.

Many works have made efforts to derive a well-defined operator that estimates

instead of an exact or accurate reconstruction (Brühl, 2001; Chow et al., 2014; Dunlop & Stuart, 2015) for even . Such effort could benefit modern operator learning methods. To this end, we first introduce several attainable approximations of infinite dimensional spaces by finite dimensional counterparts for the proposed method.

  1. [topsep=0pt, itemsep=0.5pt, leftmargin=2em, label=(0)]

  2. Spatial discretization. Let be a mesh of with the mesh spacing and let be the set of grid points to represent the 2D discretizations for continuous signals. Then a function defined almost everywhere in can be approximated by a vector .

  3. Sampling of coefficient or equivalently . samples of with different shapes and locations following certain distribution are used approximate . is usually large enough to represent field applications of interest. For the -th sample of , we let be a discretization of on . Accordingly, is a data function associated with .

  4. Sampling of NtD mapping. The most highlighted approximation is, for each sample of , there are input-output data pairs to sample the NtD mapping , which are used to generate channels in an input image. Despite a “sufficiently large” needed in Theorem 1, in practice, can be chosen to be very small () to yield satisfactory results.

Our task is to find a parameterized mapping to approximate by minimizing


The parameters will affect the finite approximation to infiniteness in the following way: determines the resolution to approximate ; affects the representativity of the training data set; decides how much of a finite portion of the infinite spectral information of can be accessed.

3.2 From channels in attention to basis in integral transform

In this subsection, a modified attention mechanism is proposed as the basic block in the tensor2tensor-type mapping introduced in the next two subsections. Its reformulation conforms with one of the most used tools in applied mathematics: the integral transform. Solutions to many physical problems have this form, as it aggregates the interactions of a class of functions at different locations to characterize how the operator mapping behaves between infinite-dimensional spaces. More interestingly, in most applications, the interaction (kernel) does not have any explicit form, which meshes well with DL methodology philosophically. In fact, this is precisely the situation of the considered problem.

For simplicity, let the input of an encoder attention block be with channels, then the query , key , value are generated by three learnable projection matrices : , , . Here is the number of expanded channels for the latent interactions. A modified dot-product attention is proposed as follows:


where and are two learnable normalizations. Different from Nguyen & Salazar (2019); Xiong et al. (2020), this pre-inner-product normalization is applied right before the matrix multiplication of query and key, which takes inspiration from the normalization in the index map kernel integral (13) and (22), see also Boyd (2001) where the normalization for orthogonal bases essentially uses the (pseudo)inverse of the Gram matrices. In practice, its cheap alternative is chosen to be layer normalization (Ba et al., 2016)

or batch normalization

(Ioffe & Szegedy, 2015). is a mesh-based weight such that the summation becomes an approximation to an integral quadrature.

To elaborate these rationales, the -th column of the -th row of is , in which the -th row and . Thus, applying this to every column , attention (9) becomes a basis expansion representation for


Here, contains the coefficients for the linear combination of . This set form the ’s row space, and it further forms each row of the output by multiplying with . in (10) stands for the attention kernel, which aggregates the pixel-wise feature maps to measure how the input latent representations interact. Moreover, the latent representation in an encoder layer is spanned by the row space of , and is being nonlinearly updated layer-wise.

If we further assume that there exist a set of feature maps for normalized query and key, as well as value, e.g., see Choromanski et al. (2021). For , the feature map that maps , i.e., , then a kernel can be defined by


Now the discrete kernel with vectorial input is rewritten to an integral kernel , thus the dot-product attention is expressed as a nonlinear integral transform for the -th channel:


After plugging in certain minimization such as (8

), the backpropagation updates weights, which further lead a new set of latent representations. This procedure can be viewed as an iterative method to update the basis residing in each channel by solving the Fredholm integral equation of the first kind in (


To connect attention with inverse problems, the multiplicative structure in a kernel integral form for attention (12) is particularly useful. This falls into the category of Pincherle-Goursat (degenerate) kernels (Kress, 1999, Chapter 11), and its approximability depends on the number of the expanded channels . decides number of learned basis functions in expansion (10

), a subset of which eventually is used to form a set of basis. Here we show the following theorem, heuristically it says that: given enough but finite channels of latent representations, the attention kernel integral is able to “bootstrap” in frequency domain, that is, generating an output representation with higher frequencies than the input. Similar approximation result is impossible for CNN if opting for the usual framelet/wavelet interpretation

(Ye et al., 2018). The full proof with a more rigorous setting is in Appendix D.

Theorem 2 (Frequency bootstrapping).

Suppose there exists a channel in such that for some , the current finite-channel sum kernel approximates a non-separable kernel (e.g., Fourier-type ) to an error of under certain norm . Then, there exists a set of such that certain channel in the output of (10) approximates , with an error of under the same norm.

The considered inverse problem is essentially to recover high-frequency eigenpairs of based on the low-frequency data, see e.g. Figure 1. together with all its spectral information can be determined by the recovered inclusion shape, thus the existence result in Theorem 2 justifies the advantages of the attention mechanism for the considered problem.

3.3 From harmonic extension to tensor-to-tensor

To establish the connection between the problem of interest with the attention used in the Transformers, we begin with the case of a single measurement, i.e., . With this setting, it is possible to derive an explicit and simple formula to approximate inspired by direct sampling methods (DSM) (Chow et al., 2014, 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020) for a larger family of inverse problems. For instance, an approximation modified from the one in Chow et al. (2014) reads as


where , and is the boundary data measuring the difference of NtD mappings. In particular, is the solution to (7) with but with certain noise on the boundary, i.e., is replaced by , which is called the harmonic extension. The vector function is called probing direction and can be chosen empirically as . Lastly, is the solutions to:


where restricted on boundary is equipped with semi-norm, and is function associated with the point . Both and can be computed effectively by traditional fast PDE solvers, such as finite difference or finite element methods based on in Section 3.1. Essentially, (13) suggests that the derivative of along a direction scaled by a quantity approximates , where and are based on empirical choices. Indeed, the reconstruction accuracy is much limited by a single measurement, the nonparametric ansatz, and empirical choices. These restrictions give room for DL methodology.

Despite being restrictive, the formulation in (13) offers inspiration to give a potential answer to (Q1): the harmonic extension of can be used as the input to a tensor2tensor-type DNN. Constructing harmonic extension (2D features) from boundary data (1D signal input with limited embedding dimensions) can contribute to the desired high-quality reconstruction. First, harmonic functions are highly smooth away from the boundary, of which the solution automatically smooths out the noise on the boundary due to PDE theory (Gilbarg & Trudinger, 2001, Chapter 8), and thus enable the reconstruction being highly robust with respect to the noise (e.g., see Figure 1). Second, in terms of using certain backbone network to generate features for downstream tasks, harmonic extensions can be understood as a problem-specific way to design higher dimensional feature maps (Álvarez et al., 2012), which renders samples more separable in a higher dimensional data manifold than the one with merely boundary data.

Figure 1: The left plot: a decomposition of the inverse operator into a harmonic extension operator and a neural network : the 1D boundary data are preprocessed by extending to the domain interior to generate 2D feature maps which are then input for reconstruction. The right two plots: without and with 20% noise of showing that harmonic extension is robust to the boundary noise.

The information of is deeply hidden in . As shown in Figure 1, one cannot observe any pattern of directly from , for more examples see Appendix B. It is different from and much more difficult than the inverse problems studied in (Bhattacharya et al., 2021; Khoo et al., 2021) that aim to reconstruct 2D targets from the much more informative 2D internal data of .

As shown in Figure 1, can be nicely decoupled into a composition of a parametrized neural network operator and a non-learnable harmonic extension feature map , i.e., .

3.4 From index map integral to Transformer

In this last subsection, the probing direction , the inner product , and the norm in (13) are used as ingredients to form certain non-local learnable kernel integration. This non-localness is a fundamental trait for many inverse problems, in that depends on the entire data function. Then, the discretization of the modified index function is shown to match the multiplicative structure of the modified attention mechanism in (9).

In the forthcoming derivations, and a self-adjoint positive definite linear operator , are shown to yield the emblematic -- structure of attention. To this end, we make the following modifications and assumptions to the original index map in (13).

  • [topsep=-2pt, itemsep=-0.5pt, leftmargin=1em]

  • The reformulation of the index function is motivated by the heuristics that the global information of should be used to locate a point .


    If , then (15) reverts to the original one in (13).

  • Similarly, the probing direction is reasonably assumed to have a global dependence on

  • In the quantigy in (13), the key is which is assumed to have the following form:


    In Chow et al. (2014), it is shown that if induces a kernel with sharply peaked Gaussian-like distribution, the index function in (13) can achieve maximum values for points inside .

Based on the assumptions from (15) to (17), we derive a matrix representation approximating the new index function on a grid, which accord well with an attention-like architecture. Denote by

  • [topsep=-3pt, itemsep=0pt, leftmargin=1em]

  • : the vector that interpolates

    at the grid points , .

  • : the discrete Laplacian on of a finite element/finite difference discretization of coupled with the Neumann boundary condition and the zero integral normalization condition.

  • : the matrix that projects a vector defined at to one defined at the nodes on the boundary .

We shall discretize the variable by grid points in (15) and obtain an approximation to the integral:


We then consider (16) and focus on one component of . With a suitable quadrature rule to compute the integral, it can be rewritten as


Next, we proceed to express by discretizing the variational form in (14) by a linear finite element method. Let be the set of the basis functions, and be the vector approximating for each fixed . Denote as the vector approximating and as its restriction on the boundary nodes. Then, the finite element discretization yields the linear systems:


Note that the self-adjoint positive definite operator can be parameterized by a symmetric positive definite (SPD) matrix denoted by . We can approximate as


where as is SPD. Define which can be considered as another learnable vector since comes from the learnable matrix . Putting (18), (19) and (17) into (15), we obtain


Now, using the notation from Section 3.2, we denote the learnable kernel matrices and an input vector: for , and


Then, we are able to rewrite (22) as


where is a constant, and both and are taken element-wise. Here, we may define , , and as the values, keys and query. We can see that, the right matrix multiplications in (9) are low-rank approximations to the ones above in the attention mechanism. Hence, based on (24), essentially we need to find a function resulting in a vector approximation to the true characteristic function


Thus, the expression in (24) and (25) reveal that a Transformer is able to capture the classical formula in (13) equipped with non-local learnable kernels. Moreover, when there are data pairs, the data functions are generated by computing their harmonic extensions as in (7). Then, each harmonic extension is then treated as a channel of the input .

The derivation above motivates that the attention mechanism nicely fits the underlying mathematical structure. In summary, we propose to use a Transformer-based deep direct sampling method, and subsequently show that it has significantly better performance than the widely used CNN-based U-Net (Guo & Jiang, 2020) in linear inverse problems and modern operator learner such as Fourier Neural Operator (Li et al., 2021). In this regard, we provide a potential answer to the question (Q2); namely, the attention-based Transformer is better suited as it conforms more with the underlying mathematical structure, and both enjoy a global kernel formulation using the input data, which match the long range dependence nature of inverse problems.

4 Experiments

In this section we present some experimental results to show the quality of the reconstruction using a single channel () of the 2D harmonic extension feature from the 1D signal input. There are two baseline models to compare: one is the CNN-based U-Net (Ronneberger et al., 2015), the other is the state-of-the-art operator learner Fourier Neural Operator (FNO) (Li et al., 2021). The Transformer model of interest is a drop-in replacement of the baseline U-Net, and it is named by U-Integral Transformer (UIT). UIT uses the kernel integral inspired attention (9), and we also compare UIT with the linear attention-based Hybrid U-Transformer in Gao et al. (2021), as well as a Hadamard product-based cross-attention U-Transformer in Wang et al. (2022). An ablation study is also performed by replacing the convolution layers in the U-Net with attention (9) on various mesh grid sizes, e.g., U-Net with an attention block added in the coarsest level. For more details of the data setup, training, evaluation in the all experiments please refer to Appendix B.

The comparison result can be found in Table 1. Due to the fact that FNO keeping only the modes in the lower end of the spectra, it performs relatively poor in this EIT benchmark that needs to recover traits that consist of higher modes (sharp boundary edges of inclusion) from lower modes (smooth harmonic extension). Thanks to Theorem 2, attention-based models are capable to recover “high frequency target from low frequency data”, and in general outperform the CNN-based U-Net despite having only of the parameters. Another highlight is that, thanks to the unique PDE-based feature map through harmonic extension, the proposed models are extremely robust to noise. The proposed models can recover the buried domain under a moderately large noise (5%) and an extreme amount of noise (20%) which can be disastrous for many classical methods.

Relative error Position-wise cross entropy Dice coefficient # params
U-Net (baseline) 7.7m
U-Net+Coarse Attn 8.4m
U-Net big 31.0m
FNO2d (baseline) 10.4m
FNO2d big 33.6m
Cross-Attention UT 11.4m
UIT+Softmax (ours) 0.159 0.261 0.269 0.0551 0.0969 0.0977 0.903 0.862 0.848 11.1m
UIT (ours) 0.163 0.261 0.272 0.0564 0.0967 0.0981 0.897 0.858 0.845 11.4m
UIT+ (ours) 0.147 0.250 0.254 0.0471 0.0882 0.0900 0.914 0.891 0.880 11.4m
Table 1: Evaluation metrics of the EIT benchmark tests. : the normalized relative strength of noises added in the boundary data before the harmonic extension, see Appendix B for details. -error and cross entropy: the closer to 0 the better; Dice coefficient: the closer to 1 the better.

5 Conclusion

For a boundary value inverse problem, we propose a novel operator learner based on the mathematical structure of the inverse operator and Transformer. The proposed architecture consists of two components: the first one is a harmonic extension of boundary data (a PDE-based feature map), and the second one is a modified attention mechanism derived from the classical DSM by introducing some learnable non-local integral kernels. The evaluation accuracy on the benchmark problems surpasses the current widely-used CNN-based U-Net and the best operator learner FNO. This research strengthens the insights that the attention is an adaptable neural architecture that can incorporate a priori mathematical knowledge to design more physics-compatible DNN architectures. However, we acknowledge some limitations: similar to other operator learners, the data manifold on which the operator is learned is assumed to exhibit certain low-dimensional attributes that can be reasonably well approximated by a fixed number of bases.

Reproducibility Statement

This paper is reproducible. Experimental details about all empirical results described in this paper are provided in Appendix B

. Additionally, we provide the PyTorch

(Paszke et al., 2019) code for reproducing in the supplemental material. The dataset used in this paper is available at Formal proofs under a rigorous setting of all our theoretical results are provided in Appendices C-D.


This work is supported in part by National Science Foundation grants DMS-2012465 and DMS-2136075. No additional revenues are related to this work.


  • Adler & Guardo (1994) Andy Adler and Robert Guardo. A neural network image reconstruction technique for electrical impedance tomography. IEEE Trans Med Imaging, 13(4):594–600, 1994.
  • Ahn et al. (2020) Chi Young Ahn, Taeyoung Ha, and Won-Kwang Park. Direct sampling method for identifying magnetic inhomogeneities in limited-aperture inverse scattering problem. Computers & Mathematics with Applications, 80(12):2811–2829, 2020. ISSN 0898-1221. doi: URL
  • Álvarez et al. (2012) Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for Vector-Valued Functions: A Review. 2012.
  • Ammari & Kang (2004) Habib Ammari and Hyeonbae Kang. Reconstruction of Small Inhomogeneities from Boundary Measurements. Berlin: Springer, 2004.
  • Ammari & Kang (2007) Habib Ammari and Hyeonbae Kang.

    Polarization and Moment Tensors: With Applications to Inverse Problems and Effective Medium Theory

    New York: Springer, 2007.
  • Astala & Päivärinta (2006) Kari Astala and Lassi Päivärinta. Calderón’s inverse conductivity problem in the plane. Ann. of Math., pp. 265–299, 2006.
  • Azzouz et al. (2007) Mustapha Azzouz, Martin Hanke, Chantal Oesterlein, and Karl Schilcher. The factorization method for electrical impedance tomography data from a new planar device. International journal of biomedical imaging, 2007:83016–83016, 2007. doi: 10.1155/2007/83016. URL
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL
  • Barbastathis et al. (2019) George Barbastathis, Aydogan Ozcan, and Guohai Situ. On the use of deep learning for computational imaging. Optica, 6(8):921–943, Aug 2019. doi: 10.1364/OPTICA.6.000921. URL
  • Ben Yedder et al. (2018) Hanene Ben Yedder, Aïcha BenTaieb, Majid Shokoufi, Amir Zahiremami, Farid Golnaraghi, and Ghassan Hamarneh. Deep learning based image reconstruction for diffuse optical tomography. In Florian Knoll, Andreas Maier, and Daniel Rueckert (eds.), Machine Learning for Medical Image Reconstruction, pp. 112–119, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00129-2.
  • Bhattacharya et al. (2021) Kaushik Bhattacharya, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. Model reduction and neural networks for parametric pdes. The SMAI journal of computational mathematics, 7, 2021.
  • Boyd (2001) John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.
  • Brühl (2001) Martin Brühl. Explicit characterization of inclusions in electrical impedance tomography. SIAM Journal on Mathematical Analysis, 32(6):1327–1341, 2001. doi: 10.1137/S003614100036656X. URL
  • Cao (2021) Shuhao Cao. Choose a Transformer: Fourier or Galerkin. In Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021), 2021. URL
  • Chan & Tai (2003) Tony F. Chan and Xue-Cheng Tai. Identification of discontinuous coefficients in elliptic problems using total variation regularization. SIAM J. Sci. Comput, 25(3):881–904, 2003.
  • Chen et al. (2021) Dongdong Chen, Julián Tachella, and Mike E Davies. Equivariant imaging: Learning beyond the range space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4379–4388, 2021.
  • Chen et al. (2020) Junqing Chen, Ying Liang, and Jun Zou. Mathematical and numerical study of a three-dimensional inverse eddy current problem. SIAM J. on Appl. Math., 80(3):1467–1492, 2020.
  • Cheney (2001) Margaret Cheney. The linear sampling method and the MUSIC algorithm. Inverse Probl., 17(4):591–595, jul 2001. doi: 10.1088/0266-5611/17/4/301. URL
  • Cheng et al. (1989) Kuo-Sheng Cheng, David Isaacson, JC Newell, and David G Gisser. Electrode models for electric current computed tomography. IEEE. Trans. Biomed. Eng., 36(9):918–924, 1989.
  • Choromanski et al. (2021) Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with Performers. In International Conference on Learning Representations (ICLR), 2021. URL
  • Chow et al. (2014) Yat Tin Chow, Kazufumi Ito, and Jun Zou. A direct sampling method for electrical impedance tomography. Inverse Probl., 30(9):095003, 2014.
  • Chow et al. (2015) Yat Tin Chow, Kazufumi Ito, Keji Liu, and Jun Zou. Direct sampling method for diffusive optical tomography. SIAM J. Sci. Comput., 37(4):A1658–A1684, 2015.
  • Cover (1999) Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  • Coxson et al. (2022) Adam Coxson, Ivo Mihov, Ziwei Wang, Vasil Avramov, Frederik Brooke Barnes, Sergey Slizovskiy, Ciaran Mullan, Ivan Timokhin, David Sanderson, Andrey Kretinin, et al. Machine learning enhanced electrical impedance tomography for 2d materials. Inverse Problems, 38(8):085007, 2022.
  • Culver et al. (2003) J. P. Culver, R. Choe, M. J. Holboke, L. Zubkov, T. Durduran, A. Slemp, V. Ntziachristos, B. Chance, and A. G. Yodh. Three-dimensional diffuse optical tomography in the parallel plane transmission geometry: Evaluation of a hybrid frequency domain/continuous wave clinical system for breast imaging. Medical Physics, 30(2):235–247, 2003. doi: URL
  • Dobson & Santosa (1994) David C Dobson and Fadil Santosa. An image-enhancement technique for electrical impedance tomography. Inverse Probl., 10(2):317, 1994.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR 2021), 2021. URL
  • Dunlop & Stuart (2015) Matthew M. Dunlop and Andrew M. Stuart. The bayesian formulation of eit: Analysis and algorithms. arXiv:1508.04106v2, 2015.
  • Fan & Ying (2020) Yuwei Fan and Lexing Ying. Solving electrical impedance tomography with deep learning. J. Comput. Phys., 404:109119, 2020.
  • Fan et al. (2019) Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. BCR-Net: A neural network based on the nonstandard wavelet form. J. Comput. Phys., 384:1–15, 2019. ISSN 0021-9991. doi: URL
  • Feng et al. (2018) Jinchao Feng, Qiuwan Sun, Zhe Li, Zhonghua Sun, and Kebin Jia. Back-propagation neural network-based reconstruction algorithm for diffuse optical tomography. Journal of Biomedical Optics, 24(5):1 – 12, 2018. doi: 10.1117/1.JBO.24.5.051407. URL
  • Fernández-Fuentes et al. (2018) Xosé Fernández-Fuentes, David Mera, Andrés Gómez, and Ignacio Vidal-Franco. Towards a fast and accurate eit inverse problem solver: A machine learning approach. Electronics, 7(12), 2018. ISSN 2079-9292. doi: 10.3390/electronics7120422. URL
  • Gao et al. (2021) Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. Utnet: a hybrid transformer architecture for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 61–71. Springer, 2021.
  • Gilbarg & Trudinger (2001) David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order, volume 224. Springer, New York, 2 edition, 2001.
  • Griffiths et al. (1999) H Griffiths, WR Stewart, and W Gough. Magnetic induction tomography. a measuring system for biological tissues. Ann N Y Acad Sci., 20(873), 1999.
  • Guibas et al. (2022) John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference on Learning Representations, 2022. URL
  • Guo & Jiang (2020) R. Guo and J. Jiang. Construct deep neural networks based on direct sampling methods for solving electrical impedance tomography. SIAM J. Sci. Comput., 43(3):B678–B711, 2020.
  • Guo et al. (2018) Ruchi Guo, Tao Lin, and Yanping Lin. A fixed mesh method with immersed finite elements for solving interface inverse problems. J. Sci. Comput., 79(1):148–175, 2018.
  • Hamilton et al. (2019) Sarah J Hamilton, Asko Hänninen, Andreas Hauptmann, and Ville Kolehmainen. Beltrami-net: domain-independent deep d-bar learning for absolute imaging with electrical impedance tomography (a-eit). Physiol Meas., 40(7):074002, 2019.
  • Hamilton & Hauptmann (2018) Sarah Jane Hamilton and Andreas Hauptmann. Deep d-bar: Real-time electrical impedance tomography imaging with deep neural networks. IEEE Trans Med Imaging, 37(10):2367–2377, 2018.
  • Hanke & Brühl (2003) Martin Hanke and Martin Brühl. Recent progress in electrical impedance tomography. 19(6):S65–S90, nov 2003. doi: 10.1088/0266-5611/19/6/055. URL
  • Harris & Kleefeld (2019) Isaac Harris and Andreas Kleefeld. Analysis of new direct sampling indicators for far-field measurements. Inverse Problems, 35(5):054002, apr 2019. doi: 10.1088/1361-6420/ab08be. URL
  • Hatamizadeh et al. (2022) A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu. Unetr: Transformers for 3d medical image segmentation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1748–1758, Los Alamitos, CA, USA, jan 2022. IEEE Computer Society. doi: 10.1109/WACV51458.2022.00181. URL
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • He et al. (2022) Sheng He, P. Ellen Grant, and Yangming Ou. Global-local transformer for brain age estimation. IEEE Transactions on Medical Imaging, 41(1):213–224, 2022. doi: 10.1109/TMI.2021.3108910.
  • Holder (2004) David S Holder. Electrical impedance tomography: methods, history and applications. CRC Press, 2004.
  • Hrabuska et al. (2018) Radek Hrabuska, Michal Prauzek, Marketa Venclikova, and Jaromir Konecny. Image reconstruction for electrical impedance tomography: Experimental comparison of radial basis neural network and gauss – newton method. IFAC-PapersOnLine, 51(6):438–443, 2018. ISSN 2405-8963. doi: URL 15th IFAC Conference on Programmable Devices and Embedded Systems PDeS 2018.
  • Hutchinson et al. (2021) Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. In International Conference on Machine Learning, pp. 4533–4543. PMLR, 2021.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR, 2015.
  • Ito et al. (2013) Kazufumi Ito, Bangti Jin, and Jun Zou. A direct sampling method for inverse electromagnetic medium scattering. Inverse Probl., 29(9):095018, sep 2013. doi: 10.1088/0266-5611/29/9/095018. URL
  • Ji et al. (2019) Xia Ji, Xiaodong Liu, and Bo Zhang. Phaseless inverse source scattering problem: Phase retrieval, uniqueness and direct sampling methods. Journal of Computational Physics: X, 1:100003, 2019. ISSN 2590-0552. doi: URL
  • Jiang et al. (2021) Jiahua Jiang, Yi Li, and Ruchi Guo. Learn an index operator by cnn for solving diffusive optical tomography: a deep direct sampling method. SIAM J. Sci. Comput., 2021.
  • Jin et al. (2017) Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser.

    Deep convolutional neural network for inverse problems in imaging.

    IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
  • Jin et al. (2022) Pengzhan Jin, Shuai Meng, and Lu Lu. Mionet: Learning multiple-input operators via tensor product. arXiv preprint arXiv:2202.06137, 2022.
  • Kang et al. (2017) Eunhee Kang, Junhong Min, and Jong Chul Ye. A deep convolutional neural network using directional wavelets for low-dose x-ray ct reconstruction. Med Phys, 44(10):e360–e375, Oct 2017. ISSN 2473-4209 (Electronic); 0094-2405 (Linking). doi: 10.1002/mp.12344.
  • Kang et al. (2018) Sangwoo Kang, Marc Lambert, and Won-Kwang Park. Direct sampling method for imaging small dielectric inhomogeneities: analysis and improvement. Inverse Probl., 34(9):095005, jul 2018. doi: 10.1088/1361-6420/aacf1d. URL
  • Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
  • Khoo et al. (2021) Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
  • Kirsch & Grinberg (2007) Andreas Kirsch and Natalia Grinberg. The factorization method for inverse problems, volume 36. OUP Oxford, 2007.
  • Kissas et al. (2022) Georgios Kissas, Jacob Seidman, Leonardo Ferreira Guilhoto, Victor M Preciado, George J Pappas, and Paris Perdikaris. Learning operators with coupled attention. arXiv preprint arXiv:2201.01032, 2022.
  • Kłosowski & Rymarczyk (2017) Grzegorz Kłosowski and Tomasz Rymarczyk. Using neural networks and deep learning algorithms in electrical impedance tomography. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 7(3), 2017.
  • Knudsen et al. (2007) Kim Knudsen, Matti Lassas, Jennifer L. Mueller, and Samuli Siltanen. D‐bar method for electrical impedance tomography with discontinuous conductivities. SIAM Journal on Applied Mathematics, 67(3):893–913, 2007. doi: 10.1137/060656930. URL
  • Knudsen et al. (2009) Kim Knudsen, Matti Lassas, Jennifer L. Mueller, and Samuli Siltanen. Regularized d-bar method for the inverse conductivity problem. Inverse Probl. & Imaging, 3(4):599–624, 2009.
  • Kress (1999) Rainer Kress. Linear Integral Equations. Springer New York, 1999. doi: 10.1007/978-1-4612-0559-3. URL
  • Latif et al. (2019) Jahanzaib Latif, Chuangbai Xiao, Azhar Imran, and Shanshan Tu. Medical imaging using machine learning and deep learning algorithms: A review. In 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), pp. 1–5, 2019. doi: 10.1109/ICOMET.2019.8673502.
  • Lee et al. (2011) Okkyun Lee, Jongmin Kim, Yoram Bresler, and Jong Chul Ye. Diffuse optical tomography using generalized music algorithm. In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 1142–1145, 2011. doi: 10.1109/ISBI.2011.5872603.
  • Li et al. (2022) Zijie Li, Kazem Meidani, and Amir Barati Farimani. Transformer for partial differential equations’ operator learning. arXiv preprint arXiv:2205.13671, 2022.
  • Li et al. (2021) Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. URL
  • Lu et al. (2021) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021. doi: 10.1038/s42256-021-00302-5. URL
  • Ma et al. (2021) Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei. Deltalm: Encoder-decoder pre-training for language generation and translation by augmenting pretrained multilingual encoders. arXiv preprint arXiv:2106.13736, 2021.
  • Marroquin et al. (1987) Jose Marroquin, Sanjoy Mitter, and Tomaso Poggio. Probabilistic solution of ill-posed problems in computational vision. Journal of the american statistical association, 82(397):76–89, 1987.
  • Martin & Idier (1997) Thierry Martin and Jérôme Idier. A FEM-based nonlinear map estimator in electrical impedance tomography. In Proceedings of ICIP, volume 2, pp. 684–687. IEEE, 1997.
  • Michalikova et al. (2014) Marketa Michalikova, Rawia Abed, Michal Prauzek, and Jiri Koziorek. Image reconstruction in electrical impedance tomography using neural network. In 2014 Cairo International Biomedical Engineering Conference (CIBEC), pp. 39–42. IEEE, 2014.
  • Nachman (1996) Adrian I. Nachman. Global uniqueness for a two-dimensional inverse boundary value problem. Annals of Mathematics, 143(1):71–96, 1996.
  • Nguyen et al. (2021) Tan M. Nguyen, Vai Suliafu, Stanley J. Osher, Long Chen, and Bao Wang. FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Nguyen et al. (2018) Thanh C. Nguyen, Vy Bui, and George Nehmetallah. Computational optical tomography using 3-D deep convolutional neural networks. Optical Engineering, 57(4):1 – 11, 2018. doi: 10.1117/1.OE.57.4.043111. URL
  • Nguyen & Salazar (2019) Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong, November 2-3 2019. Association for Computational Linguistics. URL
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), pp. 8024–8035. 2019. URL
  • Peng et al. (2021) Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021. URL
  • Perera et al. (2021) Shehan Perera, Srikar Adhikari, and Alper Yilmaz. Pocformer: A lightweight transformer architecture for detection of covid-19 using point of care ultrasound. In ICIP, 2021.
  • Petit et al. (2021) Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. U-net transformer: Self and cross attention for medical image segmentation. In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings, pp. 267–276, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-87588-6. doi: 10.1007/978-3-030-87589-3_28. URL
  • Ren et al. (2020) Shangjie Ren, Kai Sun, Chao Tan, and Feng Dong. A two-stage deep learning method for robust shape reconstruction with electrical impedance tomography. IEEE Transactions on Instrumentation and Measurement, 69(7):4887–4897, 2020. doi: 10.1109/TIM.2019.2954722.
  • Rondi & Santosa (2001) Luca Rondi and Fadil Santosa. Enhanced electrical impedance tomography via the mumford-shah functional. ESAIM: COCV, 6:517–538, 2001. doi: 10.1051/cocv:2001121. URL
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer, 2015.
  • Smith & Topin (2019) Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, pp. 1100612. International Society for Optics and Photonics, 2019.
  • Song et al. (2021) Diping Song, Bin Fu, Fei Li, Jian Xiong, Junjun He, Xiulan Zhang, and Yu Qiao. Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function. IEEE Transactions on Medical Imaging, 40(9):2392–2402, 2021. doi: 10.1109/TMI.2021.3077484.
  • Tan et al. (2018) Chao Tan, Shuhua Lv, Feng Dong, and Masahiro Takei. Image reconstruction based on convolutional neural network for electrical resistance tomography. IEEE Sensors Journal, 19(1):196–204, 2018.
  • Tanzi et al. (2022) Leonardo Tanzi, Andrea Audisio, Giansalvo Cirrincione, Alessandro Aprato, and Enrico Vezzetti. Vision transformer for femur fracture classification. Injury, 2022. ISSN 0020-1383. doi: URL
  • Tarvainen et al. (2008) T. Tarvainen, M. Vauhkonen, and S.R. Arridge. Gauss–newton reconstruction method for optical tomography using the finite element solution of the radiative transfer equation. Journal of Quantitative Spectroscopy and Radiative Transfer, 109(17):2767–2778, 2008. ISSN 0022-4073. doi: URL
  • Tay et al. (2021) Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations (ICLR 2021), 2021. URL
  • Tehrani et al. (2012) J. Nasehi Tehrani, A. McEwan, C. Jin, and A. van Schaik. L1 regularization method in electrical impedance tomography by using the l1-curve (pareto frontier curve). Applied Mathematical Modelling, 36(3):1095–1105, 2012. ISSN 0307-904X. doi: URL
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4344–4353, Hong Kong, China, November 2019. doi: 10.18653/v1/D19-1443. URL
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS 2017), volume 30, 2017.
  • Vauhkonen et al. (1999) P.J. Vauhkonen, M. Vauhkonen, T. Savolainen, and J.P. Kaipio. Three-dimensional electrical impedance tomography based on the complete electrode model. IEEE Trans. Biomedical Engrg., 46(9):1150–1160, 1999. doi: 10.1109/10.784147.
  • Wang et al. (2022) Hongyi Wang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto, Xian-Hua Han, Yen-Wei Chen, and Ruofeng Tong. Mixed transformer u-net for medical image segmentation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2390–2394. IEEE, 2022.
  • Wang et al. (2021a) Huan Wang, Kai Liu, Yang Wu, Song Wang, Zheng Zhang, Fang Li, and Jiafeng Yao.

    Image reconstruction for electrical impedance tomography using radial basis function neural network based on hybrid particle swarm optimization algorithm.

    IEEE Sensors Journal, 21(2):1926–1934, 2021a. doi: 10.1109/JSEN.2020.3019309.
  • Wang et al. (2012) Qi Wang, Huaxiang Wang, Ronghua Zhang, Jinhai Wang, Yu Zheng, Ziqiang Cui, and Chengyi Yang. Image reconstruction based on l1 regularization and projection methods for electrical impedance tomography. Review of Scientific Instruments, 83(10):104707, 2012. doi: 10.1063/1.4760253. URL
  • Wang et al. (2021b) Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physics-informed deeponets. Science advances, 7(40):eabi8605, 2021b.
  • Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
  • Ye et al. (2018) Jong Chul Ye, Yoseob Han, and Eunju Cha. Deep convolutional framelets: A general deep learning framework for inverse problems. SIAM Journal on Imaging Sciences, 11(2):991–1048, 2018.
  • Zhou et al. (2021) Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu. nnformer: Volumetric medical image segmentation via a 3d transformer. arXiv:2109.03201v6, 2021.
  • Zhu et al. (2018) Bo Zhu, Jeremiah Z Liu, Stephen F Cauley, Bruce R Rosen, and Matthew S Rosen. Image reconstruction by domain-transform manifold learning. Nature, 555(7697):487–492, Mar 2018. doi: 10.1038/nature25988.

Appendix A Table of Common-used Notations

Notation Meaning
, admissible Sobolev spaces for the inverse problems, which are
and for EIT with data pairs
the gradient vector of a function,
the delta function such that , .
normal derivative of , measures the rate of change along the direction of
NtD mapping from Neumann data
(how fast the solution changes toward the outward normal direction) to
Dirichlet data (the solution’s value along the tangential direction)
an underlying spacial domain in
a subdomain in (not necessarily topologically-connected)
, ’s and ’s boundary, -dimensional manifolds
, the Sobolev space of functions
, the bounded linear functional defined on
all such that ’s integral on vanishes
the seminorm defined for functions in
Table 2: Notations used in an approximate chronological order and their meaning in this work.

Appendix B Experiment Set-up

b.1 Data generation and training

In the numerical examples, the data generation mainly follows what were featured in (Fan & Ying, 2020; Guo & Jiang, 2020). For examples please refer to Figure 2. The computational domain is set to be , and the two media with the different conductivities are with (inclusion) and (background). The inclusions are generated by four random ellipses with the length of the semi-major axis and semi-minor axis sampled from and , respectively. The rotation angle sampled from . There are 10800 samples in the training set, of which 20% are reserved as validation, and 2000 in the testing set for evaluation.

Figure 2: Randomly selected samples of elliptic inclusion to represent the coefficient that are used for training (left 1-4). A Cartesian mesh with a grid point (right). In computation, discretization of consists of values taken as at the mesh points of inside and at others.

The noise below (13) is assumed to be


where specifies the percentage of noise, and

is a normal Gaussian distribution independent with respect to

. As is merely pointwise imposed, the boundary data can be highly rough even though the true data is smooth.

Thanks to the position-wise binary nature of

, another choice of the loss function during training can be the binary cross entropy

, applied for a function in , to measure the distance between the ground truth and the network’s prediction


Thanks for the Pinsker inequality (e.g., see (Cover, 1999, Section 11.6)), serves as a good upper bound for the square of the total variation, which can be further bounded below by the -error given the boundedness of the position-wise value.

The training uses 1cycle (Smith & Topin, 2019)

learning rate strategy with a warm-up phase. A mini-batch ADAM iterations are run for a total number of 50 epochs with no extra regularization such as weight decay. The learning rate starts and ends with

, and reaches the maximum of at the end of the -th epoch. The . The result demonstrated is obtained from fixing the random number generator seed. Some testing results can be seen in Figure 5

. All models are trained on an RTX 3090 or an A4000. The codes to replicate the experiments are open-source and publicly available

111 .

Figure 3: The left: the training-testing pixel-wise binary cross entropy convergence for the CNN-based U-Net with 31 million parameters, a clear overfitting pattern is shown. The right: the training-testing convergence for the attention-based U-Transformer with 11.4 million parameters.
Figure 4: The harmonic extension feature map (left 1-3 as different channels’ inputs to the neural network) corresponding to a randomly chosen sample’s inclusion map (right). No visible relevance shown with the ground truth.
Figure 5: The neural network evaluation result for the inclusion in Figure 4 using various models. Top from left to right: U-Net baseline (7.7m) prediction with 1 channel (hidden channels on the finest grid, coarsest ); U-Net big (31m) prediction with 3 channels (hidden channels on the finest grid, coarsest ); Fourier Neural Operator (10.4m) prediction with 1 channel (48 hidden channels with 14 retained modes). Bottom from left to right: Fourier Neural Operator big (33m) prediction with 1 channel (64 hidden channels with 16 retained Fourier modes); UIT (11.4m) prediction with 1 channel (hidden channels on the finest grid, coarsest ); UIT (11.4m) prediction with 3 channels.

b.2 Network architecture

The U-Transformer architecture is a drop-in replacement of the standard CNN-based U-Net baseline model (7.7m) in Table 1. For a high level encoder-decoder schematics please see Figure 7.

Figure 6: Detailed flow of the modified 2D attention-based encoder layer using (9). : the number of channels in the input, : the number of expanded channels (for the basis expansion interpretation in Theorem D).

Positional embedding.

At each resolution, The 2D Euclidean coordinates of a regular grid are applied a channel expansion through a learnable linear layer, and then are added to each latent representation.

Double convolution block.

The double convolution block is modified from that commonly seen in Computer Vision (CV) models, such as ResNet (He et al., 2016). We modify this block such that upon being used in an attention block, the batch normalization (Ioffe & Szegedy, 2015) shall be replaced by the layer normalization (Ba et al., 2016) which can be understand as a learnable approximation to the Gram matrices’ inverse by a diagonal matrix.

Mesh-normalized attention.

The scaled dot-product attention in the network is chosen to be the softmax-free integral kernel attention in (9) with a mesh-based normalization. For a diagram in a single attention head please refer to Figure 6.

Coarse-fine interaction in up blocks.

The convolution layer on the coarsest level is replaced by the attention with pre-inner-product normalization in Section 3.2. The skip connection from the encoder latent representations to the ones in the decoder are generated using an architecture similar to the cross attention used in (Petit et al., 2021). and are generated from the latent representation functions on the same coarser grids, such that the attention kernel to measure the interaction between different channels is built from on the coarse grid. is associated with a finer grid. The major differences of ours with the one in Petit et al. (2021) are that ours are inspired by the kernel integral for a PDE problem, thus the modified attention in our method has (1) no softmax normalization or (2) no Hadamard product-type skip connection.

Figure 7: The schematics of the U-Transformer that follows the standard U-Net. The input is the concatenation of discretizations of and . The output is the approximation to the index map . The numbers of latent basis functions (channels) are annotated below each representation.


convolution + ReLU;

: layer normalization or batch normalization;

: bilinear interpolation from the fine grid to the coarse grid;

: cross attention from the coarse grid to the fine grid;

: input and output discretized functions in certain Hilbert spaces.

Appendix C Proof of Theorem 1

Theorem 1 (A finite dimensional approximation of the index map).

Suppose the boundary data is the eigenfunction of corresponding to the -th eigenvalue , and let be the data functions generated by harmonic extensions


where . Define the space on (the spatial dimension ):


Then, for any , there exists a sufficiently large such that


Let be any probing direction in (Brühl, 2001; Hanke & Brühl, 2003). Define a function


By Theorem 4.1 in (Guo & Jiang, 2020), we can show that


As , it is increasing with respect to . Then, there is a constant such that , . Given any , there is an interger such that , . Define


Note the fundamental inequality , . Then, if , there holds

if , there holds