1 Introduction
Boundary value inverse problems aim to recover the internal structure or distribution of multiple media inside an object (2D reconstruction) based on only the data available on the boundary (1D signal input), which arise from many imaging techniques, e.g., electrical impedance tomography (EIT) (Holder, 2004), diffuse optical tomography (DOT) (Culver et al., 2003), magnetic induction tomography (MIT) (Griffiths et al., 1999). Not needing any internal data renders these techniques generally noninvasive, safe, cheap, and thus quite suitable for monitoring applications.
Let be a nonlinear operator associated a physical model governed by certain partial differential equations (PDE) on a bounded in . For the measurement on the boundary , and the coefficient of the underlying PDEs to be recovered, the forward PDE model is
(1) 
where is a noise following certain distribution. Comparing with the classic linear inverse problems
in computer vision and signal processing, e.g.
(Marroquin et al., 1987), to recover a signal from measurement with noise , the fundamental difference is that is highly nonlinear, and the Hilbert spaces and are usually infinite dimensional. In many situations, even though the inverse operator is theoretically welldefined, as the infinite dimensional is practically approximated by a finite dimensional (sub)space , the problem of seeking the approximated operator is usually highly illposed (not having a welldefined unique output) and poses great challenges to the reconstruction algorithms.2 Background, related work, and contributions
Classical iterative methods.
There are in general two types of methodology to solve inverse problems. The first one is a large family of iterative or optimizationbased methods (Dobson & Santosa, 1994; Martin & Idier, 1997; Chan & Tai, 2003; Vauhkonen et al., 1999; Guo et al., 2018; Rondi & Santosa, 2001; Chen et al., 2020). One usually looks for the desired by solving a minimization problem
(2) 
where is a regularization term to alleviate the illposedness, and its design plays a critical role for a successful reconstruction (Tarvainen et al., 2008; Tehrani et al., 2012; Wang et al., 2012). For almost all these iterative methods, due to the illposedness, the computation generally takes numerous iterations to converge, and the reconstruction is highly sensitive to noise. Besides, needs to be evaluated at each iteration, which is itself expensive as it requires solving forward PDE models.
Direct methods.
The second methodology is to develop a welldefined mapping parametrized with , empirically constructed to approximate the inverse map itself:
(3) 
This method is referred to as noniterative or direct methods in the literature. Distinguished from iterative approaches, direct methods are in general highly problemspecific, as they are designed on an ad hoc basis based on the mathematical structure of the various inverse operators. For instances, methods in EIT and DOT include factorization methods (Kirsch & Grinberg, 2007; Azzouz et al., 2007; Brühl, 2001; Hanke & Brühl, 2003), MUSICtype algorithms (Cheney, 2001; Ammari & Kang, 2004, 2007; Lee et al., 2011), and the Dbar methods (Knudsen et al., 2007, 2009) based on a Fredholm integral equation (Nachman, 1996), among which are the direct sampling methods (DSM) (Chow et al., 2014, 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020) being our focus in this work. These methods generally have a closedform expression of to approximate , and the parameters have clear mathematical meaning. For each fixed , this procedure is usually much more stable than iterative approaches with respect to the input data. Furthermore, the evaluation for each given is distinctly fast as no training optimization is needed anymore. However, such a simple closed form of admitting efficient execution may not be available in practice since some mathematical assumptions and derivation may not hold.
Boundary value inverse problems.
For most cases of boundary value inverse problems in 2D, one key difference, e.g., with an image inverse problem, is that data are only available on 1D manifolds, which are used to reconstruct 2D targets. The boundary data themselves generally involve certain inputoutput structure, which adds more complexity. In Adler & Guardo (1994); FernándezFuentes et al. (2018); Feng et al. (2018) boundary measurements are collected and directly input into feedforward fully connected networks. As the data reside on different manifolds, special treatments are made to the input data, such as employing prereconstruction stages to generate rough 2D input to CNNs (Ben Yedder et al., 2018; Ren et al., 2020).
Operator learning and inverse problems.
Solving an inverse problem is essentially to approximate the inverse operator
but based on finitely many data. The emerging deep learning (DL) technique makes it possible to directly emulate operators. The concept resembles greatly the aforementioned classical direct methodology. But operator learners by DNNs are generally considered as black boxes. A natural question to ask is whether and how existing innovative neural architectures, usually designed for other purposes, can be modified conforming with the mathematical nature of the underlying problem, which ultimately leads to structureconforming DNNs. Inspired by classical direct methods, we try to improve the DNN reconstruction pipeline from the architectural perspective. In this regard, the proposed study provides a positive example to a hopefully definitive answer to this question, which bridges deep learning and conventional tasks in physical sciences.
Operator learners.
Operator learning has become an active research field for inverse problems in recent years, especially related to image reconstruction where CNN plays a central role, see e.g, Kłosowski & Rymarczyk (2017); Nguyen et al. (2018); Tan et al. (2018); Jin et al. (2017); Kang et al. (2017); Barbastathis et al. (2019); Latif et al. (2019); Zhu et al. (2018); Chen et al. (2021); Coxson et al. (2022). Some efforts have been made to couple classical reconstruction methods and CNN. Notable examples include Hamilton et al. (2019); Hamilton & Hauptmann (2018), where a CNN postprocesses images obtained by the classical Dbar methods; Fan et al. (2019); Fan & Ying (2020) develop BCRNet to mimic pseudodifferential operators appearing in many inverse problems; a deep direct sampling method is proposed in Guo & Jiang (2020); Jiang et al. (2021)
that learns local convolutional kernels mimicking the gradient operator of DSM. Another example is radial basis function neural networks seen in
Hrabuska et al. (2018); Michalikova et al. (2014); Wang et al. (2021a). Nevertheless, convolutions in CNNs use locally supported kernel whose receptive field involves only a small neighborhood of a pixel, thus layerwise speaking, CNN does not align well with the nonlocal nature of inverse problems. More recently, the learning of PDErelated forward problems using global kernel has gained attraction, most notably the Fourier Neural Operator (FNO) (Li et al., 2021). FNO takes the unique advantage of the lowrank nature of certain problems, and proposes to train a local kernel in the frequency domain yet global in the spatialtemporal domain, mimicking the solution’s kernel integral form. Others include DeepONets
(Lu et al., 2021; Wang et al., 2021b; Jin et al., 2022) and Transformers (Kissas et al., 2022; Li et al., 2022; Cao, 2021).Related studies on Transformers.
The attention mechanismbased models have become the state of the art in many areas since Vaswani et al. (2017), such as language tasks and computer vision e.g., Ma et al. (2021); Dosovitskiy et al. (2021). One of the most important and attractive aspects of the attention mechanism is its unparalleled capability to model longrange interactions (Tay et al., 2021) through many efficient variants see e.g., Katharopoulos et al. (2020); Choromanski et al. (2021); Nguyen et al. (2021). The relation of the attention with kernel learning is firstly studied in (Tsai et al., 2019) and later connected with random feature (Peng et al., 2021). Connecting the nonPDEbased integral kernels and the attention mechanism has been seen in Hutchinson et al. (2021); Guibas et al. (2022). Among inverse problems, Transformers have been applied in medical imaging applications including segmentation problems (Zhou et al., 2021; Hatamizadeh et al., 2022; Petit et al., 2021), XRay (Tanzi et al., 2022), magnetic resonance imaging (MRI) (He et al., 2022), ultrasound (Perera et al., 2021), optical coherence tomography (OCT) (Song et al., 2021), etc. To our best knowledge, there is no work in the literature that connects the attention mechanism in Transformer and the mathematical structure of PDEbased inverse problems.
2.1 Contributions

[topsep=0pt, leftmargin=1em]

A structureconforming network architecture. Inspired by the classical DSM, we decompose the inverse operator into a harmonic extension and an integral operator with learnable nonlocal kernels that has an attentionlike structure. Additionally, the attention architecture is reinterpreted through a Fredholm integral operator to rationalize the application of the Transformer to the boundary value inverse problem.

Theoretical and experimental justification for the advantage of Transformer. We have proved that, in Transformeris, a modified attention is able to represent target functions exhibiting higher frequency natures from lower frequency input features. In the experiments, a comparative study further demonstrates a favorable match between the Transformer and the problem structure.
3 Interplay between mathematics and neural architectures
In this section, we use EIT, a classical boundary value inverse problem, as a prominent example to walk through that the triple tensor product in the attention mechanism matches extremely well with representing a solution in the inverse operator theory of EIT. The proposed method is motivated by and in pursuit to provide examples to the answers of the following questions

[topsep=2pt, leftmargin=2.5em, itemsep=0pt]

What is an appropriate finite dimensional data format of as inputs to the neural network?

Is there a suitable neural network in (3) taking advantage of the mathematical structure?
3.1 From EIT to Operator Learning
The forward model of EIT is given by the following second order elliptic partial differential equation
(4) 
Given the domain , the goal is to identify the unknown inclusion buried in . The value and are two (approximately) known constants discontinuous across the boundary . Without loss of generality, we assume . Then, the coefficient of (4
) to be recovered can be described by a characteristic function
defined for each point , and is for and for , or equivalently . Then,is a vector of values at the grid points approximating
(see Figure 2 in Appendix B for an example discretization).In the application of EIT, the boundary measurement space contains information of how the electric potential behaves on the boundary. By exerting a current on the boundary, solving (4) for a unique with this specific boundary information (Neumann boundary condition) reveals ’s behavior on the whole domain . However, what can be measured is only the voltages on the boundary. This procedure is called NeumanntoDirichlet (NtD) mapping:
(5) 
The  duality for Sobolev spaces here is for formality, and we refer the readers to Appendix A for more detailed descriptions of function spaces. One measurement is a single data pair , and all the measurements form this infinitedimensional space:
If the NtD mapping is known, the theoretical uniqueness of can be established (Brühl, 2001; Hanke & Brühl, 2003; Astala & Päivärinta, 2006; Nachman, 1996). Mathematically, it means if one is able to measure all the currenttovoltage pairs in , where form a set of basis functions of the Hilbert space , then , which is essentially , is welldefined admitting explicit format (Brühl, 2001).
We shall illustrate how the mathematical setup above is realized in a discrete level. For each fixed , the NtD mapping in (5) can be expressed as
(6) 
where and are tensors containing boundary data at grid points on the boundary. If is known, then (6) is obtained through a typical PDE solver. For unknown , using a significantly large number of pairs of  can obtain a huge matrix approximating to on a reasonably fine grid. The mechanism here essentially results in a tensor2tensor mapping/operator from to the images of . In this case, the operator can be learned through a large number of  data pairs. In particular, the BCRNet (Fan & Ying, 2020) is a DNN approximation falls into this category.
However, as infinity or large number of boundary data pairs is not attainable in practice, the problem of more practical interests is to use only a few data pairs for reconstruction with being very small not enough to get any reasonably good approximation of in (6). Let be the subspace spanned by . Here, the highly nonlinear forward operator for finitely many pairs is still trivially welldefined. However, it becomes a longstanding open theoretical problem whether its inverse is welldefined. Namely, it is unclear if the inclusion can be exactly identified by finite measurement on boundary.
As we assume , the operator
is positive definite, and it has eigenvalues
with (Cheng et al., 1989). The following theorem justifies the practice of approximating when is large enough. Roughly speaking, as is well defined, when , we can define an to approximate .Theorem 1.
Suppose that the 1D boundary data
is the eigenfunction of
corresponding to the th eigenvalue , and let the 2D data functions be obtained by solving(7) 
where . Define . Then, for any , there exists a sufficiently large such that .
The proof of Theorem 1 can be found in Appendix C. Nevertheless, the constructional proof of the existence of still relies on the entire NtD mapping , which again resorts to infiniteness, thus inaccessible in real applications. Another valuable corollary of Theorem 1 is to justify using the data functions as the input to neural networks, see Section 3.3 for more details.
Operator learning problems for EIT.
Many works have made efforts to derive a welldefined operator that estimates
instead of an exact or accurate reconstruction (Brühl, 2001; Chow et al., 2014; Dunlop & Stuart, 2015) for even . Such effort could benefit modern operator learning methods. To this end, we first introduce several attainable approximations of infinite dimensional spaces by finite dimensional counterparts for the proposed method.
[topsep=0pt, itemsep=0.5pt, leftmargin=2em, label=(0)]

Spatial discretization. Let be a mesh of with the mesh spacing and let be the set of grid points to represent the 2D discretizations for continuous signals. Then a function defined almost everywhere in can be approximated by a vector .

Sampling of coefficient or equivalently . samples of with different shapes and locations following certain distribution are used approximate . is usually large enough to represent field applications of interest. For the th sample of , we let be a discretization of on . Accordingly, is a data function associated with .

Sampling of NtD mapping. The most highlighted approximation is, for each sample of , there are inputoutput data pairs to sample the NtD mapping , which are used to generate channels in an input image. Despite a “sufficiently large” needed in Theorem 1, in practice, can be chosen to be very small () to yield satisfactory results.
Our task is to find a parameterized mapping to approximate by minimizing
(8) 
The parameters will affect the finite approximation to infiniteness in the following way: determines the resolution to approximate ; affects the representativity of the training data set; decides how much of a finite portion of the infinite spectral information of can be accessed.
3.2 From channels in attention to basis in integral transform
In this subsection, a modified attention mechanism is proposed as the basic block in the tensor2tensortype mapping introduced in the next two subsections. Its reformulation conforms with one of the most used tools in applied mathematics: the integral transform. Solutions to many physical problems have this form, as it aggregates the interactions of a class of functions at different locations to characterize how the operator mapping behaves between infinitedimensional spaces. More interestingly, in most applications, the interaction (kernel) does not have any explicit form, which meshes well with DL methodology philosophically. In fact, this is precisely the situation of the considered problem.
For simplicity, let the input of an encoder attention block be with channels, then the query , key , value are generated by three learnable projection matrices : , , . Here is the number of expanded channels for the latent interactions. A modified dotproduct attention is proposed as follows:
(9) 
where and are two learnable normalizations. Different from Nguyen & Salazar (2019); Xiong et al. (2020), this preinnerproduct normalization is applied right before the matrix multiplication of query and key, which takes inspiration from the normalization in the index map kernel integral (13) and (22), see also Boyd (2001) where the normalization for orthogonal bases essentially uses the (pseudo)inverse of the Gram matrices. In practice, its cheap alternative is chosen to be layer normalization (Ba et al., 2016)
(Ioffe & Szegedy, 2015). is a meshbased weight such that the summation becomes an approximation to an integral quadrature.To elaborate these rationales, the th column of the th row of is , in which the th row and . Thus, applying this to every column , attention (9) becomes a basis expansion representation for
(10) 
Here, contains the coefficients for the linear combination of . This set form the ’s row space, and it further forms each row of the output by multiplying with . in (10) stands for the attention kernel, which aggregates the pixelwise feature maps to measure how the input latent representations interact. Moreover, the latent representation in an encoder layer is spanned by the row space of , and is being nonlinearly updated layerwise.
If we further assume that there exist a set of feature maps for normalized query and key, as well as value, e.g., see Choromanski et al. (2021). For , the feature map that maps , i.e., , then a kernel can be defined by
(11) 
Now the discrete kernel with vectorial input is rewritten to an integral kernel , thus the dotproduct attention is expressed as a nonlinear integral transform for the th channel:
(12) 
After plugging in certain minimization such as (8
), the backpropagation updates weights, which further lead a new set of latent representations. This procedure can be viewed as an iterative method to update the basis residing in each channel by solving the Fredholm integral equation of the first kind in (
12).To connect attention with inverse problems, the multiplicative structure in a kernel integral form for attention (12) is particularly useful. This falls into the category of PincherleGoursat (degenerate) kernels (Kress, 1999, Chapter 11), and its approximability depends on the number of the expanded channels . decides number of learned basis functions in expansion (10
), a subset of which eventually is used to form a set of basis. Here we show the following theorem, heuristically it says that: given enough but finite channels of latent representations, the attention kernel integral is able to “bootstrap” in frequency domain, that is, generating an output representation with higher frequencies than the input. Similar approximation result is impossible for CNN if opting for the usual framelet/wavelet interpretation
(Ye et al., 2018). The full proof with a more rigorous setting is in Appendix D.Theorem 2 (Frequency bootstrapping).
Suppose there exists a channel in such that for some , the current finitechannel sum kernel approximates a nonseparable kernel (e.g., Fouriertype ) to an error of under certain norm . Then, there exists a set of such that certain channel in the output of (10) approximates , with an error of under the same norm.
The considered inverse problem is essentially to recover highfrequency eigenpairs of based on the lowfrequency data, see e.g. Figure 1. together with all its spectral information can be determined by the recovered inclusion shape, thus the existence result in Theorem 2 justifies the advantages of the attention mechanism for the considered problem.
3.3 From harmonic extension to tensortotensor
To establish the connection between the problem of interest with the attention used in the Transformers, we begin with the case of a single measurement, i.e., . With this setting, it is possible to derive an explicit and simple formula to approximate inspired by direct sampling methods (DSM) (Chow et al., 2014, 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020) for a larger family of inverse problems. For instance, an approximation modified from the one in Chow et al. (2014) reads as
(13) 
where , and is the boundary data measuring the difference of NtD mappings. In particular, is the solution to (7) with but with certain noise on the boundary, i.e., is replaced by , which is called the harmonic extension. The vector function is called probing direction and can be chosen empirically as . Lastly, is the solutions to:
(14) 
where restricted on boundary is equipped with seminorm, and is function associated with the point . Both and can be computed effectively by traditional fast PDE solvers, such as finite difference or finite element methods based on in Section 3.1. Essentially, (13) suggests that the derivative of along a direction scaled by a quantity approximates , where and are based on empirical choices. Indeed, the reconstruction accuracy is much limited by a single measurement, the nonparametric ansatz, and empirical choices. These restrictions give room for DL methodology.
Despite being restrictive, the formulation in (13) offers inspiration to give a potential answer to (Q1): the harmonic extension of can be used as the input to a tensor2tensortype DNN. Constructing harmonic extension (2D features) from boundary data (1D signal input with limited embedding dimensions) can contribute to the desired highquality reconstruction. First, harmonic functions are highly smooth away from the boundary, of which the solution automatically smooths out the noise on the boundary due to PDE theory (Gilbarg & Trudinger, 2001, Chapter 8), and thus enable the reconstruction being highly robust with respect to the noise (e.g., see Figure 1). Second, in terms of using certain backbone network to generate features for downstream tasks, harmonic extensions can be understood as a problemspecific way to design higher dimensional feature maps (Álvarez et al., 2012), which renders samples more separable in a higher dimensional data manifold than the one with merely boundary data.
The information of is deeply hidden in . As shown in Figure 1, one cannot observe any pattern of directly from , for more examples see Appendix B. It is different from and much more difficult than the inverse problems studied in (Bhattacharya et al., 2021; Khoo et al., 2021) that aim to reconstruct 2D targets from the much more informative 2D internal data of .
As shown in Figure 1, can be nicely decoupled into a composition of a parametrized neural network operator and a nonlearnable harmonic extension feature map , i.e., .
3.4 From index map integral to Transformer
In this last subsection, the probing direction , the inner product , and the norm in (13) are used as ingredients to form certain nonlocal learnable kernel integration. This nonlocalness is a fundamental trait for many inverse problems, in that depends on the entire data function. Then, the discretization of the modified index function is shown to match the multiplicative structure of the modified attention mechanism in (9).
In the forthcoming derivations, and a selfadjoint positive definite linear operator , are shown to yield the emblematic  structure of attention. To this end, we make the following modifications and assumptions to the original index map in (13).

[topsep=2pt, itemsep=0.5pt, leftmargin=1em]

Similarly, the probing direction is reasonably assumed to have a global dependence on
(16)
Based on the assumptions from (15) to (17), we derive a matrix representation approximating the new index function on a grid, which accord well with an attentionlike architecture. Denote by

[topsep=3pt, itemsep=0pt, leftmargin=1em]

: the discrete Laplacian on of a finite element/finite difference discretization of coupled with the Neumann boundary condition and the zero integral normalization condition.

: the matrix that projects a vector defined at to one defined at the nodes on the boundary .
We shall discretize the variable by grid points in (15) and obtain an approximation to the integral:
(18) 
We then consider (16) and focus on one component of . With a suitable quadrature rule to compute the integral, it can be rewritten as
(19) 
Next, we proceed to express by discretizing the variational form in (14) by a linear finite element method. Let be the set of the basis functions, and be the vector approximating for each fixed . Denote as the vector approximating and as its restriction on the boundary nodes. Then, the finite element discretization yields the linear systems:
(20) 
Note that the selfadjoint positive definite operator can be parameterized by a symmetric positive definite (SPD) matrix denoted by . We can approximate as
(21) 
where as is SPD. Define which can be considered as another learnable vector since comes from the learnable matrix . Putting (18), (19) and (17) into (15), we obtain
(22) 
Now, using the notation from Section 3.2, we denote the learnable kernel matrices and an input vector: for , and
(23) 
Then, we are able to rewrite (22) as
(24) 
where is a constant, and both and are taken elementwise. Here, we may define , , and as the values, keys and query. We can see that, the right matrix multiplications in (9) are lowrank approximations to the ones above in the attention mechanism. Hence, based on (24), essentially we need to find a function resulting in a vector approximation to the true characteristic function
(25) 
Thus, the expression in (24) and (25) reveal that a Transformer is able to capture the classical formula in (13) equipped with nonlocal learnable kernels. Moreover, when there are data pairs, the data functions are generated by computing their harmonic extensions as in (7). Then, each harmonic extension is then treated as a channel of the input .
The derivation above motivates that the attention mechanism nicely fits the underlying mathematical structure. In summary, we propose to use a Transformerbased deep direct sampling method, and subsequently show that it has significantly better performance than the widely used CNNbased UNet (Guo & Jiang, 2020) in linear inverse problems and modern operator learner such as Fourier Neural Operator (Li et al., 2021). In this regard, we provide a potential answer to the question (Q2); namely, the attentionbased Transformer is better suited as it conforms more with the underlying mathematical structure, and both enjoy a global kernel formulation using the input data, which match the long range dependence nature of inverse problems.
4 Experiments
In this section we present some experimental results to show the quality of the reconstruction using a single channel () of the 2D harmonic extension feature from the 1D signal input. There are two baseline models to compare: one is the CNNbased UNet (Ronneberger et al., 2015), the other is the stateoftheart operator learner Fourier Neural Operator (FNO) (Li et al., 2021). The Transformer model of interest is a dropin replacement of the baseline UNet, and it is named by UIntegral Transformer (UIT). UIT uses the kernel integral inspired attention (9), and we also compare UIT with the linear attentionbased Hybrid UTransformer in Gao et al. (2021), as well as a Hadamard productbased crossattention UTransformer in Wang et al. (2022). An ablation study is also performed by replacing the convolution layers in the UNet with attention (9) on various mesh grid sizes, e.g., UNet with an attention block added in the coarsest level. For more details of the data setup, training, evaluation in the all experiments please refer to Appendix B.
The comparison result can be found in Table 1. Due to the fact that FNO keeping only the modes in the lower end of the spectra, it performs relatively poor in this EIT benchmark that needs to recover traits that consist of higher modes (sharp boundary edges of inclusion) from lower modes (smooth harmonic extension). Thanks to Theorem 2, attentionbased models are capable to recover “high frequency target from low frequency data”, and in general outperform the CNNbased UNet despite having only of the parameters. Another highlight is that, thanks to the unique PDEbased feature map through harmonic extension, the proposed models are extremely robust to noise. The proposed models can recover the buried domain under a moderately large noise (5%) and an extreme amount of noise (20%) which can be disastrous for many classical methods.
Relative error  Positionwise cross entropy  Dice coefficient  # params  
UNet (baseline)  7.7m  
UNet+Coarse Attn  8.4m  
UNet big  31.0m  
FNO2d (baseline)  10.4m  
FNO2d big  33.6m  
CrossAttention UT  11.4m  
UIT+Softmax (ours)  0.159  0.261  0.269  0.0551  0.0969  0.0977  0.903  0.862  0.848  11.1m 
UIT (ours)  0.163  0.261  0.272  0.0564  0.0967  0.0981  0.897  0.858  0.845  11.4m 
UIT+ (ours)  0.147  0.250  0.254  0.0471  0.0882  0.0900  0.914  0.891  0.880  11.4m 
5 Conclusion
For a boundary value inverse problem, we propose a novel operator learner based on the mathematical structure of the inverse operator and Transformer. The proposed architecture consists of two components: the first one is a harmonic extension of boundary data (a PDEbased feature map), and the second one is a modified attention mechanism derived from the classical DSM by introducing some learnable nonlocal integral kernels. The evaluation accuracy on the benchmark problems surpasses the current widelyused CNNbased UNet and the best operator learner FNO. This research strengthens the insights that the attention is an adaptable neural architecture that can incorporate a priori mathematical knowledge to design more physicscompatible DNN architectures. However, we acknowledge some limitations: similar to other operator learners, the data manifold on which the operator is learned is assumed to exhibit certain lowdimensional attributes that can be reasonably well approximated by a fixed number of bases.
Reproducibility Statement
This paper is reproducible. Experimental details about all empirical results described in this paper are provided in Appendix B
. Additionally, we provide the PyTorch
(Paszke et al., 2019) code for reproducing in the supplemental material. The dataset used in this paper is available at https://www.kaggle.com/datasets/scaomath/eittransformer. Formal proofs under a rigorous setting of all our theoretical results are provided in Appendices CD.Acknowledgments
This work is supported in part by National Science Foundation grants DMS2012465 and DMS2136075. No additional revenues are related to this work.
References
 Adler & Guardo (1994) Andy Adler and Robert Guardo. A neural network image reconstruction technique for electrical impedance tomography. IEEE Trans Med Imaging, 13(4):594–600, 1994.
 Ahn et al. (2020) Chi Young Ahn, Taeyoung Ha, and WonKwang Park. Direct sampling method for identifying magnetic inhomogeneities in limitedaperture inverse scattering problem. Computers & Mathematics with Applications, 80(12):2811–2829, 2020. ISSN 08981221. doi: https://doi.org/10.1016/j.camwa.2020.10.009. URL https://www.sciencedirect.com/science/article/pii/S0898122120304089.
 Álvarez et al. (2012) Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for VectorValued Functions: A Review. 2012.
 Ammari & Kang (2004) Habib Ammari and Hyeonbae Kang. Reconstruction of Small Inhomogeneities from Boundary Measurements. Berlin: Springer, 2004.

Ammari & Kang (2007)
Habib Ammari and Hyeonbae Kang.
Polarization and Moment Tensors: With Applications to Inverse Problems and Effective Medium Theory
. New York: Springer, 2007.  Astala & Päivärinta (2006) Kari Astala and Lassi Päivärinta. Calderón’s inverse conductivity problem in the plane. Ann. of Math., pp. 265–299, 2006.
 Azzouz et al. (2007) Mustapha Azzouz, Martin Hanke, Chantal Oesterlein, and Karl Schilcher. The factorization method for electrical impedance tomography data from a new planar device. International journal of biomedical imaging, 2007:83016–83016, 2007. doi: 10.1155/2007/83016. URL https://pubmed.ncbi.nlm.nih.gov/18350126.
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https://arxiv.org/abs/1607.06450.
 Barbastathis et al. (2019) George Barbastathis, Aydogan Ozcan, and Guohai Situ. On the use of deep learning for computational imaging. Optica, 6(8):921–943, Aug 2019. doi: 10.1364/OPTICA.6.000921. URL http://www.osapublishing.org/optica/abstract.cfm?URI=optica68921.
 Ben Yedder et al. (2018) Hanene Ben Yedder, Aïcha BenTaieb, Majid Shokoufi, Amir Zahiremami, Farid Golnaraghi, and Ghassan Hamarneh. Deep learning based image reconstruction for diffuse optical tomography. In Florian Knoll, Andreas Maier, and Daniel Rueckert (eds.), Machine Learning for Medical Image Reconstruction, pp. 112–119, Cham, 2018. Springer International Publishing. ISBN 9783030001292.
 Bhattacharya et al. (2021) Kaushik Bhattacharya, Bamdad Hosseini, Nikola B. Kovachki, and Andrew M. Stuart. Model reduction and neural networks for parametric pdes. The SMAI journal of computational mathematics, 7, 2021.
 Boyd (2001) John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.
 Brühl (2001) Martin Brühl. Explicit characterization of inclusions in electrical impedance tomography. SIAM Journal on Mathematical Analysis, 32(6):1327–1341, 2001. doi: 10.1137/S003614100036656X. URL https://doi.org/10.1137/S003614100036656X.
 Cao (2021) Shuhao Cao. Choose a Transformer: Fourier or Galerkin. In ThirtyFifth Conference on Neural Information Processing Systems (NeurIPS 2021), 2021. URL https://openreview.net/forum?id=ssohLcmn4r.
 Chan & Tai (2003) Tony F. Chan and XueCheng Tai. Identification of discontinuous coefficients in elliptic problems using total variation regularization. SIAM J. Sci. Comput, 25(3):881–904, 2003.
 Chen et al. (2021) Dongdong Chen, Julián Tachella, and Mike E Davies. Equivariant imaging: Learning beyond the range space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4379–4388, 2021.
 Chen et al. (2020) Junqing Chen, Ying Liang, and Jun Zou. Mathematical and numerical study of a threedimensional inverse eddy current problem. SIAM J. on Appl. Math., 80(3):1467–1492, 2020.
 Cheney (2001) Margaret Cheney. The linear sampling method and the MUSIC algorithm. Inverse Probl., 17(4):591–595, jul 2001. doi: 10.1088/02665611/17/4/301. URL https://doi.org/10.1088/02665611/17/4/301.
 Cheng et al. (1989) KuoSheng Cheng, David Isaacson, JC Newell, and David G Gisser. Electrode models for electric current computed tomography. IEEE. Trans. Biomed. Eng., 36(9):918–924, 1989.
 Choromanski et al. (2021) Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with Performers. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
 Chow et al. (2014) Yat Tin Chow, Kazufumi Ito, and Jun Zou. A direct sampling method for electrical impedance tomography. Inverse Probl., 30(9):095003, 2014.
 Chow et al. (2015) Yat Tin Chow, Kazufumi Ito, Keji Liu, and Jun Zou. Direct sampling method for diffusive optical tomography. SIAM J. Sci. Comput., 37(4):A1658–A1684, 2015.
 Cover (1999) Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
 Coxson et al. (2022) Adam Coxson, Ivo Mihov, Ziwei Wang, Vasil Avramov, Frederik Brooke Barnes, Sergey Slizovskiy, Ciaran Mullan, Ivan Timokhin, David Sanderson, Andrey Kretinin, et al. Machine learning enhanced electrical impedance tomography for 2d materials. Inverse Problems, 38(8):085007, 2022.
 Culver et al. (2003) J. P. Culver, R. Choe, M. J. Holboke, L. Zubkov, T. Durduran, A. Slemp, V. Ntziachristos, B. Chance, and A. G. Yodh. Threedimensional diffuse optical tomography in the parallel plane transmission geometry: Evaluation of a hybrid frequency domain/continuous wave clinical system for breast imaging. Medical Physics, 30(2):235–247, 2003. doi: https://doi.org/10.1118/1.1534109. URL https://aapm.onlinelibrary.wiley.com/doi/abs/10.1118/1.1534109.
 Dobson & Santosa (1994) David C Dobson and Fadil Santosa. An imageenhancement technique for electrical impedance tomography. Inverse Probl., 10(2):317, 1994.
 Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR 2021), 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
 Dunlop & Stuart (2015) Matthew M. Dunlop and Andrew M. Stuart. The bayesian formulation of eit: Analysis and algorithms. arXiv:1508.04106v2, 2015.
 Fan & Ying (2020) Yuwei Fan and Lexing Ying. Solving electrical impedance tomography with deep learning. J. Comput. Phys., 404:109119, 2020.
 Fan et al. (2019) Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. BCRNet: A neural network based on the nonstandard wavelet form. J. Comput. Phys., 384:1–15, 2019. ISSN 00219991. doi: https://doi.org/10.1016/j.jcp.2019.02.002. URL https://www.sciencedirect.com/science/article/pii/S0021999119300762.
 Feng et al. (2018) Jinchao Feng, Qiuwan Sun, Zhe Li, Zhonghua Sun, and Kebin Jia. Backpropagation neural networkbased reconstruction algorithm for diffuse optical tomography. Journal of Biomedical Optics, 24(5):1 – 12, 2018. doi: 10.1117/1.JBO.24.5.051407. URL https://doi.org/10.1117/1.JBO.24.5.051407.
 FernándezFuentes et al. (2018) Xosé FernándezFuentes, David Mera, Andrés Gómez, and Ignacio VidalFranco. Towards a fast and accurate eit inverse problem solver: A machine learning approach. Electronics, 7(12), 2018. ISSN 20799292. doi: 10.3390/electronics7120422. URL https://www.mdpi.com/20799292/7/12/422.
 Gao et al. (2021) Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. Utnet: a hybrid transformer architecture for medical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 61–71. Springer, 2021.
 Gilbarg & Trudinger (2001) David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order, volume 224. Springer, New York, 2 edition, 2001.
 Griffiths et al. (1999) H Griffiths, WR Stewart, and W Gough. Magnetic induction tomography. a measuring system for biological tissues. Ann N Y Acad Sci., 20(873), 1999.
 Guibas et al. (2022) John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro. Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=EXHGA3jlM.
 Guo & Jiang (2020) R. Guo and J. Jiang. Construct deep neural networks based on direct sampling methods for solving electrical impedance tomography. SIAM J. Sci. Comput., 43(3):B678–B711, 2020.
 Guo et al. (2018) Ruchi Guo, Tao Lin, and Yanping Lin. A fixed mesh method with immersed finite elements for solving interface inverse problems. J. Sci. Comput., 79(1):148–175, 2018.
 Hamilton et al. (2019) Sarah J Hamilton, Asko Hänninen, Andreas Hauptmann, and Ville Kolehmainen. Beltraminet: domainindependent deep dbar learning for absolute imaging with electrical impedance tomography (aeit). Physiol Meas., 40(7):074002, 2019.
 Hamilton & Hauptmann (2018) Sarah Jane Hamilton and Andreas Hauptmann. Deep dbar: Realtime electrical impedance tomography imaging with deep neural networks. IEEE Trans Med Imaging, 37(10):2367–2377, 2018.
 Hanke & Brühl (2003) Martin Hanke and Martin Brühl. Recent progress in electrical impedance tomography. 19(6):S65–S90, nov 2003. doi: 10.1088/02665611/19/6/055. URL https://doi.org/10.1088/02665611/19/6/055.
 Harris & Kleefeld (2019) Isaac Harris and Andreas Kleefeld. Analysis of new direct sampling indicators for farfield measurements. Inverse Problems, 35(5):054002, apr 2019. doi: 10.1088/13616420/ab08be. URL https://doi.org/10.1088/13616420/ab08be.
 Hatamizadeh et al. (2022) A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu. Unetr: Transformers for 3d medical image segmentation. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1748–1758, Los Alamitos, CA, USA, jan 2022. IEEE Computer Society. doi: 10.1109/WACV51458.2022.00181. URL https://doi.ieeecomputersociety.org/10.1109/WACV51458.2022.00181.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  He et al. (2022) Sheng He, P. Ellen Grant, and Yangming Ou. Globallocal transformer for brain age estimation. IEEE Transactions on Medical Imaging, 41(1):213–224, 2022. doi: 10.1109/TMI.2021.3108910.
 Holder (2004) David S Holder. Electrical impedance tomography: methods, history and applications. CRC Press, 2004.
 Hrabuska et al. (2018) Radek Hrabuska, Michal Prauzek, Marketa Venclikova, and Jaromir Konecny. Image reconstruction for electrical impedance tomography: Experimental comparison of radial basis neural network and gauss – newton method. IFACPapersOnLine, 51(6):438–443, 2018. ISSN 24058963. doi: https://doi.org/10.1016/j.ifacol.2018.07.114. URL https://www.sciencedirect.com/science/article/pii/S2405896318308589. 15th IFAC Conference on Programmable Devices and Embedded Systems PDeS 2018.
 Hutchinson et al. (2021) Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant selfattention for lie groups. In International Conference on Machine Learning, pp. 4533–4543. PMLR, 2021.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR, 2015.
 Ito et al. (2013) Kazufumi Ito, Bangti Jin, and Jun Zou. A direct sampling method for inverse electromagnetic medium scattering. Inverse Probl., 29(9):095018, sep 2013. doi: 10.1088/02665611/29/9/095018. URL https://doi.org/10.1088/02665611/29/9/095018.
 Ji et al. (2019) Xia Ji, Xiaodong Liu, and Bo Zhang. Phaseless inverse source scattering problem: Phase retrieval, uniqueness and direct sampling methods. Journal of Computational Physics: X, 1:100003, 2019. ISSN 25900552. doi: https://doi.org/10.1016/j.jcpx.2019.100003. URL https://www.sciencedirect.com/science/article/pii/S2590055219300022.
 Jiang et al. (2021) Jiahua Jiang, Yi Li, and Ruchi Guo. Learn an index operator by cnn for solving diffusive optical tomography: a deep direct sampling method. SIAM J. Sci. Comput., 2021.

Jin et al. (2017)
Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser.
Deep convolutional neural network for inverse problems in imaging.
IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.  Jin et al. (2022) Pengzhan Jin, Shuai Meng, and Lu Lu. Mionet: Learning multipleinput operators via tensor product. arXiv preprint arXiv:2202.06137, 2022.
 Kang et al. (2017) Eunhee Kang, Junhong Min, and Jong Chul Ye. A deep convolutional neural network using directional wavelets for lowdose xray ct reconstruction. Med Phys, 44(10):e360–e375, Oct 2017. ISSN 24734209 (Electronic); 00942405 (Linking). doi: 10.1002/mp.12344.
 Kang et al. (2018) Sangwoo Kang, Marc Lambert, and WonKwang Park. Direct sampling method for imaging small dielectric inhomogeneities: analysis and improvement. Inverse Probl., 34(9):095005, jul 2018. doi: 10.1088/13616420/aacf1d. URL https://doi.org/10.1088/13616420/aacf1d.
 Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
 Khoo et al. (2021) Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
 Kirsch & Grinberg (2007) Andreas Kirsch and Natalia Grinberg. The factorization method for inverse problems, volume 36. OUP Oxford, 2007.
 Kissas et al. (2022) Georgios Kissas, Jacob Seidman, Leonardo Ferreira Guilhoto, Victor M Preciado, George J Pappas, and Paris Perdikaris. Learning operators with coupled attention. arXiv preprint arXiv:2201.01032, 2022.
 Kłosowski & Rymarczyk (2017) Grzegorz Kłosowski and Tomasz Rymarczyk. Using neural networks and deep learning algorithms in electrical impedance tomography. Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, 7(3), 2017.
 Knudsen et al. (2007) Kim Knudsen, Matti Lassas, Jennifer L. Mueller, and Samuli Siltanen. D‐bar method for electrical impedance tomography with discontinuous conductivities. SIAM Journal on Applied Mathematics, 67(3):893–913, 2007. doi: 10.1137/060656930. URL https://doi.org/10.1137/060656930.
 Knudsen et al. (2009) Kim Knudsen, Matti Lassas, Jennifer L. Mueller, and Samuli Siltanen. Regularized dbar method for the inverse conductivity problem. Inverse Probl. & Imaging, 3(4):599–624, 2009.
 Kress (1999) Rainer Kress. Linear Integral Equations. Springer New York, 1999. doi: 10.1007/9781461205593. URL https://doi.org/10.1007/9781461205593.
 Latif et al. (2019) Jahanzaib Latif, Chuangbai Xiao, Azhar Imran, and Shanshan Tu. Medical imaging using machine learning and deep learning algorithms: A review. In 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), pp. 1–5, 2019. doi: 10.1109/ICOMET.2019.8673502.
 Lee et al. (2011) Okkyun Lee, Jongmin Kim, Yoram Bresler, and Jong Chul Ye. Diffuse optical tomography using generalized music algorithm. In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 1142–1145, 2011. doi: 10.1109/ISBI.2011.5872603.
 Li et al. (2022) Zijie Li, Kazem Meidani, and Amir Barati Farimani. Transformer for partial differential equations’ operator learning. arXiv preprint arXiv:2205.13671, 2022.
 Li et al. (2021) Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Burigede liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=c8P9NQVtmnO.
 Lu et al. (2021) Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021. doi: 10.1038/s42256021003025. URL https://doi.org/10.1038/s42256021003025.
 Ma et al. (2021) Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei. Deltalm: Encoderdecoder pretraining for language generation and translation by augmenting pretrained multilingual encoders. arXiv preprint arXiv:2106.13736, 2021.
 Marroquin et al. (1987) Jose Marroquin, Sanjoy Mitter, and Tomaso Poggio. Probabilistic solution of illposed problems in computational vision. Journal of the american statistical association, 82(397):76–89, 1987.
 Martin & Idier (1997) Thierry Martin and Jérôme Idier. A FEMbased nonlinear map estimator in electrical impedance tomography. In Proceedings of ICIP, volume 2, pp. 684–687. IEEE, 1997.
 Michalikova et al. (2014) Marketa Michalikova, Rawia Abed, Michal Prauzek, and Jiri Koziorek. Image reconstruction in electrical impedance tomography using neural network. In 2014 Cairo International Biomedical Engineering Conference (CIBEC), pp. 39–42. IEEE, 2014.
 Nachman (1996) Adrian I. Nachman. Global uniqueness for a twodimensional inverse boundary value problem. Annals of Mathematics, 143(1):71–96, 1996.
 Nguyen et al. (2021) Tan M. Nguyen, Vai Suliafu, Stanley J. Osher, Long Chen, and Bao Wang. FMMformer: Efficient and Flexible Transformer via Decomposed Nearfield and Farfield Attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
 Nguyen et al. (2018) Thanh C. Nguyen, Vy Bui, and George Nehmetallah. Computational optical tomography using 3D deep convolutional neural networks. Optical Engineering, 57(4):1 – 11, 2018. doi: 10.1117/1.OE.57.4.043111. URL https://doi.org/10.1117/1.OE.57.4.043111.
 Nguyen & Salazar (2019) Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of selfattention. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong, November 23 2019. Association for Computational Linguistics. URL https://aclanthology.org/2019.iwslt1.17.
 Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), pp. 8024–8035. 2019. URL http://papers.neurips.cc/paper/9015pytorchanimperativestylehighperformancedeeplearninglibrary.pdf.
 Peng et al. (2021) Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB.
 Perera et al. (2021) Shehan Perera, Srikar Adhikari, and Alper Yilmaz. Pocformer: A lightweight transformer architecture for detection of covid19 using point of care ultrasound. In ICIP, 2021.
 Petit et al. (2021) Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. Unet transformer: Self and cross attention for medical image segmentation. In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings, pp. 267–276, Berlin, Heidelberg, 2021. SpringerVerlag. ISBN 9783030875886. doi: 10.1007/9783030875893_28. URL https://doi.org/10.1007/9783030875893_28.
 Ren et al. (2020) Shangjie Ren, Kai Sun, Chao Tan, and Feng Dong. A twostage deep learning method for robust shape reconstruction with electrical impedance tomography. IEEE Transactions on Instrumentation and Measurement, 69(7):4887–4897, 2020. doi: 10.1109/TIM.2019.2954722.
 Rondi & Santosa (2001) Luca Rondi and Fadil Santosa. Enhanced electrical impedance tomography via the mumfordshah functional. ESAIM: COCV, 6:517–538, 2001. doi: 10.1051/cocv:2001121. URL https://doi.org/10.1051/cocv:2001121.
 Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Springer, 2015.
 Smith & Topin (2019) Leslie N Smith and Nicholay Topin. Superconvergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, volume 11006, pp. 1100612. International Society for Optics and Photonics, 2019.
 Song et al. (2021) Diping Song, Bin Fu, Fei Li, Jian Xiong, Junjun He, Xiulan Zhang, and Yu Qiao. Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function. IEEE Transactions on Medical Imaging, 40(9):2392–2402, 2021. doi: 10.1109/TMI.2021.3077484.
 Tan et al. (2018) Chao Tan, Shuhua Lv, Feng Dong, and Masahiro Takei. Image reconstruction based on convolutional neural network for electrical resistance tomography. IEEE Sensors Journal, 19(1):196–204, 2018.
 Tanzi et al. (2022) Leonardo Tanzi, Andrea Audisio, Giansalvo Cirrincione, Alessandro Aprato, and Enrico Vezzetti. Vision transformer for femur fracture classification. Injury, 2022. ISSN 00201383. doi: https://doi.org/10.1016/j.injury.2022.04.013. URL https://www.sciencedirect.com/science/article/pii/S0020138322002868.
 Tarvainen et al. (2008) T. Tarvainen, M. Vauhkonen, and S.R. Arridge. Gauss–newton reconstruction method for optical tomography using the finite element solution of the radiative transfer equation. Journal of Quantitative Spectroscopy and Radiative Transfer, 109(17):2767–2778, 2008. ISSN 00224073. doi: https://doi.org/10.1016/j.jqsrt.2008.08.006. URL https://www.sciencedirect.com/science/article/pii/S0022407308001854.
 Tay et al. (2021) Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations (ICLR 2021), 2021. URL https://openreview.net/forum?id=qVyeWgrC2k.
 Tehrani et al. (2012) J. Nasehi Tehrani, A. McEwan, C. Jin, and A. van Schaik. L1 regularization method in electrical impedance tomography by using the l1curve (pareto frontier curve). Applied Mathematical Modelling, 36(3):1095–1105, 2012. ISSN 0307904X. doi: https://doi.org/10.1016/j.apm.2011.07.055. URL https://www.sciencedirect.com/science/article/pii/S0307904X11004537.
 Tsai et al. (2019) YaoHung Hubert Tsai, Shaojie Bai, Makoto Yamada, LouisPhilippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 4344–4353, Hong Kong, China, November 2019. doi: 10.18653/v1/D191443. URL https://www.aclweb.org/anthology/D191443.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS 2017), volume 30, 2017.
 Vauhkonen et al. (1999) P.J. Vauhkonen, M. Vauhkonen, T. Savolainen, and J.P. Kaipio. Threedimensional electrical impedance tomography based on the complete electrode model. IEEE Trans. Biomedical Engrg., 46(9):1150–1160, 1999. doi: 10.1109/10.784147.
 Wang et al. (2022) Hongyi Wang, Shiao Xie, Lanfen Lin, Yutaro Iwamoto, XianHua Han, YenWei Chen, and Ruofeng Tong. Mixed transformer unet for medical image segmentation. In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2390–2394. IEEE, 2022.

Wang et al. (2021a)
Huan Wang, Kai Liu, Yang Wu, Song Wang, Zheng Zhang, Fang Li, and Jiafeng Yao.
Image reconstruction for electrical impedance tomography using radial basis function neural network based on hybrid particle swarm optimization algorithm.
IEEE Sensors Journal, 21(2):1926–1934, 2021a. doi: 10.1109/JSEN.2020.3019309.  Wang et al. (2012) Qi Wang, Huaxiang Wang, Ronghua Zhang, Jinhai Wang, Yu Zheng, Ziqiang Cui, and Chengyi Yang. Image reconstruction based on l1 regularization and projection methods for electrical impedance tomography. Review of Scientific Instruments, 83(10):104707, 2012. doi: 10.1063/1.4760253. URL https://doi.org/10.1063/1.4760253.
 Wang et al. (2021b) Sifan Wang, Hanwen Wang, and Paris Perdikaris. Learning the solution operator of parametric partial differential equations with physicsinformed deeponets. Science advances, 7(40):eabi8605, 2021b.
 Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
 Ye et al. (2018) Jong Chul Ye, Yoseob Han, and Eunju Cha. Deep convolutional framelets: A general deep learning framework for inverse problems. SIAM Journal on Imaging Sciences, 11(2):991–1048, 2018.
 Zhou et al. (2021) HongYu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu. nnformer: Volumetric medical image segmentation via a 3d transformer. arXiv:2109.03201v6, 2021.
 Zhu et al. (2018) Bo Zhu, Jeremiah Z Liu, Stephen F Cauley, Bruce R Rosen, and Matthew S Rosen. Image reconstruction by domaintransform manifold learning. Nature, 555(7697):487–492, Mar 2018. doi: 10.1038/nature25988.
Appendix A Table of Commonused Notations
Notation  Meaning 
,  admissible Sobolev spaces for the inverse problems, which are 
and for EIT with data pairs  
the gradient vector of a function,  
the delta function such that , .  
normal derivative of , measures the rate of change along the direction of  
NtD mapping from Neumann data  
(how fast the solution changes toward the outward normal direction) to  
Dirichlet data (the solution’s value along the tangential direction)  
an underlying spacial domain in  
a subdomain in (not necessarily topologicallyconnected)  
,  ’s and ’s boundary, dimensional manifolds 
,  the Sobolev space of functions 
,  the bounded linear functional defined on 
all such that ’s integral on vanishes  
the seminorm defined for functions in 
Appendix B Experiment Setup
b.1 Data generation and training
In the numerical examples, the data generation mainly follows what were featured in (Fan & Ying, 2020; Guo & Jiang, 2020). For examples please refer to Figure 2. The computational domain is set to be , and the two media with the different conductivities are with (inclusion) and (background). The inclusions are generated by four random ellipses with the length of the semimajor axis and semiminor axis sampled from and , respectively. The rotation angle sampled from . There are 10800 samples in the training set, of which 20% are reserved as validation, and 2000 in the testing set for evaluation.
The noise below (13) is assumed to be
(26) 
where specifies the percentage of noise, and
is a normal Gaussian distribution independent with respect to
. As is merely pointwise imposed, the boundary data can be highly rough even though the true data is smooth.Thanks to the positionwise binary nature of
, another choice of the loss function during training can be the binary cross entropy
, applied for a function in , to measure the distance between the ground truth and the network’s prediction(27) 
Thanks for the Pinsker inequality (e.g., see (Cover, 1999, Section 11.6)), serves as a good upper bound for the square of the total variation, which can be further bounded below by the error given the boundedness of the positionwise value.
The training uses 1cycle (Smith & Topin, 2019)
learning rate strategy with a warmup phase. A minibatch ADAM iterations are run for a total number of 50 epochs with no extra regularization such as weight decay. The learning rate starts and ends with
, and reaches the maximum of at the end of the th epoch. The . The result demonstrated is obtained from fixing the random number generator seed. Some testing results can be seen in Figure 5. All models are trained on an RTX 3090 or an A4000. The codes to replicate the experiments are opensource and publicly available
^{1}^{1}1https://github.com/scaomath/eittransformer. .b.2 Network architecture
The UTransformer architecture is a dropin replacement of the standard CNNbased UNet baseline model (7.7m) in Table 1. For a high level encoderdecoder schematics please see Figure 7.
Positional embedding.
At each resolution, The 2D Euclidean coordinates of a regular grid are applied a channel expansion through a learnable linear layer, and then are added to each latent representation.
Double convolution block.
The double convolution block is modified from that commonly seen in Computer Vision (CV) models, such as ResNet (He et al., 2016). We modify this block such that upon being used in an attention block, the batch normalization (Ioffe & Szegedy, 2015) shall be replaced by the layer normalization (Ba et al., 2016) which can be understand as a learnable approximation to the Gram matrices’ inverse by a diagonal matrix.
Meshnormalized attention.
Coarsefine interaction in up blocks.
The convolution layer on the coarsest level is replaced by the attention with preinnerproduct normalization in Section 3.2. The skip connection from the encoder latent representations to the ones in the decoder are generated using an architecture similar to the cross attention used in (Petit et al., 2021). and are generated from the latent representation functions on the same coarser grids, such that the attention kernel to measure the interaction between different channels is built from on the coarse grid. is associated with a finer grid. The major differences of ours with the one in Petit et al. (2021) are that ours are inspired by the kernel integral for a PDE problem, thus the modified attention in our method has (1) no softmax normalization or (2) no Hadamard producttype skip connection.
Appendix C Proof of Theorem 1
Theorem 1 (A finite dimensional approximation of the index map).
Suppose the boundary data is the eigenfunction of corresponding to the th eigenvalue , and let be the data functions generated by harmonic extensions
(28) 
where . Define the space on (the spatial dimension ):
(29) 
Then, for any , there exists a sufficiently large such that
(30) 
Proof.
Let be any probing direction in (Brühl, 2001; Hanke & Brühl, 2003). Define a function
(31) 
By Theorem 4.1 in (Guo & Jiang, 2020), we can show that
(32) 
As , it is increasing with respect to . Then, there is a constant such that , . Given any , there is an interger such that , . Define
(33) 
Note the fundamental inequality , . Then, if , there holds
if , there holds