Learning operators between Hilbert spaces provides a natural framework for the application of tools from supervised learning to the development of surrogate models that accelerate scientific computation by approximating existing expensive models and to the discovery of new models consistent with observed data when no model exists. In order to develop a deeper understanding of operator learning, this paper is concerned with nonparametric regression under random design on a separable infinite-dimensional Hilbert space. We consider the problem of learning , an unknown (possibly unbounded and in general densely defined) self-adjoint linear operator on , from data pairs related by
Here the represent noise and is referred to as the sample size.
The estimation of from the data Eq. 1 is in general an ill-posed linear inverse problem [vito2005learning]. We say that is more difficult to learn than another operator if the estimation of requires larger sample sizes to achieve a fixed error tolerance relative to that of , that is, has worse sample complexity. Broadly, our work aims to provide an answer to the question:
|What is the relative difficulty of learning different types of linear operators?|
The case in which comprises (e.g., real-valued) functions over a domain
is a particular focus; for example, it is of interest to understand the relative difficulty of learning forward and inverse operators arising from partial differential equations (PDEs). Indeed, there is an emerging body of work on supervised learning between Banach spaces focused primarily on forward, typically nonlinear, PDE solution operators[adcock2020deep, bhattacharya2020model, korolev2021two, li2020neural, lu2019deeponet, nelsen2020random, o2020derivative, schwab2019deep]. In the context of dynamical systems, there is also literature focused on learning the Koopman operator or its generator, both linear operators, from data generated by the underlying dynamics for system identification or forecasting [brunton2016discovering, giannakis2019data, klus2020data, klus2020eigendecompositions, patel2020physics]. Finally, there is interest in using surrogate forward models to speed up (e.g., Bayesian) inversion techniques [li2020neural] and in directly learning regularizers for inversion, or even the regularized inverse solution operator itself [arridge2019solving].
The study of linear function-to-function models within functional data analysis (FDA) [ramsay2005fda] is also a well-established area; see [crambes2013asymptotics, hormann2015note, reimherr2015functional, wang2020functional1] and references therein. Much of this work concerns the setting
and linear models based on kernel integral operators under colored noise. The operator estimation is then reduced to learning the kernel function, usually in a reproducing kernel Hilbert space (RKHS) framework. Linear operator learning has also been considered in machine learning[abernethy2009new], particularly in the context of conditional expectation operators [mollenhauer2020nonparametric] and conditional mean embeddings in RKHS [grunewalder2012conditional, klebanov2020rigorous, song2009hilbert].
The authors in [hormann2015note, reimherr2015functional]
study functional linear regression between two Hilbert spaces and work directly with a spectral representation of the estimator which allows them to obtain consistency of the prediction error assuming only boundedness of the true operator[hormann2015note], rather than compactness as assumed in much of the FDA literature; convergence rates are established in [reimherr2015functional]. While unbounded operators are not considered in these works, their approaches could likely be modified to handle them. Relatedly, the authors of [tabaghi2019learning] and [boulle2021learning] have similar motivations to us; the former establishes sample complexities for learning Schatten-class compact operators (motivated by inverse problem solution maps) while the latter for learning compact operators associated to Green’s functions of elliptic PDEs (motivated by PDE discovery). Our theory also treats these types of operators but goes further by proving sample complexities for the direct learning of unbounded operators, which are of primary interest in these papers (the inverse map in the former and the partial differential operator in the latter).
We consider Eq. 1
in an idealized setting in which the eigenbases of the target ground truth operator and data covariance operator are assumed known. Although quite strong, these assumptions may be realized in practice when, for example, prior knowledge is available that the covariance commutes with the target (hence simultaneously diagonalizable in the same eigenbasis) or that the target obeys certain physical principles (e.g., commutes with translation operators). This allows us to work in coordinates with respect to the eigenbasis and hence reduce inference of the operator to that of its eigenvalue sequence; similar ideas were recently applied to learn a differential operator arising in an advection-diffusion model[portone2021bayesian]. Our proof techniques in this linear diagonal setting follow the program set forth in [knapik2011bayesian], which is concerned with the Bayesian approach to ill-posed inverse problems in a diagonalized problem setting. Although the authors of [caponnetto2007optimal, rastogi2020convergence] established optimal convergence rates for regularized least squares algorithms in statistical direct and inverse learning problems with both infinite-dimensional input and output spaces, these results do not apply to our simpler linear operator regression setting Eq. 1. This is because their requirement that the appropriately defined point evaluation map (here acting on linear operators) is Hilbert–Schmidt never holds with infinite-dimensional, since the implied operator-valued kernel is not trace-class; the limitations of this strong assumption were also acknowledged in more general RKHS settings [grunewalder2012conditional, mollenhauer2020nonparametric].
1.1 Our Contributions
It is clear that much of the work regarding Eq. 1 has focused on bounded or compact . Instead, we provide a unified framework accommodating the learning of compact, bounded, and unbounded operators that allows us to compare the difficulty of learning operators defined by both forward and inverse problems. In particular, we show that unbounded linear operators can be learned in a stable and consistent manner, but with associated convergence rates in the large data limit that are worse relative to those of their continuous (bounded or compact) counterparts (see Figure 1).
The primary contributions we make in this paper are now listed:
we formulate linear operator learning both as a nonparametric Bayesian inverse problem with non-compact forward map and as a statistical learning optimization problem;
in the large sample limit, we prove convergence of the posterior estimator using dimension/discretization-independent asymptotic upper and lower error bounds in a family of suitable norms, in expectation and with high probability over input data;
our theory demonstrates that, as a point estimator, the posterior mean exhibits no loss in statistical performance when compared to the full posterior solution;
thus, as a byproduct of our posterior mean analysis, we analogously establish asymptotic convergence rates of the excess risk and generalization gap;
we perform numerical experiments to illustrate the learning of compact, bounded, and unbounded linear operators from noisy data, and the results both support the theory and confirm our conclusions beyond the confines of our idealized theorem setting.
The remainder of the paper is organized as follows. In Section 2, we describe the theoretical setting in which we work; the formulation in (i) is addressed in Section 2.2, 2.3, and Proposition 1. Our main theoretical results are in Section 3, where we provide asymptotic convergence rates of the inverse problem solution to the truth; contributions (ii)-(iii) are found in Theorems 4, 3, 2, and 1 and (iv) in Theorems 7, 6, and 5. We discuss the theory in Section 4, and numerical results (v) that illustrate, support, and extend beyond the convergence theory are provided in Section 5. Our concluding remarks follow in Section 6. Appendix A is devoted to proofs of the main results, with supporting lemmas in Appendix B.
2 Problem Formulation
After overviewing some notation in Section 2.1, we recast Eq. 1 as a random Bayesian inverse problem in Section 2.2. Section 2.3 gives an optimization perspective and defines expected risk and generalization gap in the infinite-dimensional setting. In Section 2.4
, we describe our main assumptions and explicitly characterize the conditional posterior probability measure under these assumptions.
Let be a real, separable, infinite-dimensional Hilbert space. For any self-adjoint positive definite linear operator on , we define
The set is the space of bounded linear operators mapping Hilbert space into , and when , we write . Similarly, the separable Hilbert space of Hilbert–Schmidt operators from to is denoted by , and when , we write . For any , denotes the outer product defined by for any , and we also use the shorthand . Additionally, when dealing with a possibly unbounded linear operator on , we denoted its domain by the subspace .
Let be a complete probability space. We use Bochner integrals throughout the article and when no confusion is caused by doing so, use the symbol
with no other superscripts to denote averaging over all sources of randomness. We implicitly justify all exchanges of expectation and series with the Fubini–Tonelli theorem. The chi-square distribution withdegrees of freedom is denoted by . We primarily work with centered Gaussian measures on measurable space , where is its symmetric, nonnegative covariance operator and is the standard Borel -algebra on , understood to be defined with respect to common probability space . Such a admits a sequence
of eigenvectors forming an orthonormal basis ofand nonnegative eigenvalues . Without loss of generality, we may assume that the eigenvalues are ordered to be nonincreasing; operator is necessarily trace-class on , hence compact, so that is summable. A sample may be obtained from the Karhunen–Loève (KL) expansion , where is an independent and identical distributed (i.i.d.) sequence with [steinwart2019convergence, stuart2010inverse]. Thus, the coordinates of in eigenbasis are independent and distributed as , and . We also consider generalizations of Gaussian measures defined on spaces larger than and on spaces of operators.
For , we write . For two sequences and of nonnegative real numbers, we write if the sequence is bounded away from zero and infinity (uniformly over the index ), if is bounded, and if . Similarly, we use standard asymptotic notation, writing as if there exist such that for all , as if , as if both and , and as if . We make frequent use of the Sobolev-like sequence Hilbert spaces
for any , equipped with the natural -weighted inner product and norm.
2.2 Bayesian Inversion
Consider the linear operator equation
where is a densely defined self-adjoint linear operator, , and
is a random variable modeling observational noise. Rather than the standard inverse problem setting in whichis viewed as the unknown, we instead view itself as a given pointwise evaluation linear map on operators: ; thus Eq. 2, or multiple such identities, defines an equation for operator unknown . We consider the setting where and are Gaussian and independent, and draw pairs to define a likelihood; we then assume that is a priori Gaussian which makes the likelihood conjugate. To this end, it is useful to denote the -fold product of by and define , where is symmetric, positive definite, and trace-class on ; equipped with the inner product , is a RKHS and may also be viewed as the Cameron–Martin space of [stuart2010inverse]. In our setting, an important role of is to ensure that the support of the prior on is large enough to encompass unbounded operators. We make the following assumptions about Eq. 2.
It holds that , comprised of i.i.d. , and , comprised of i.i.d. , are independent centered Gaussians. The data covariance is a symmetric, positive definite linear operator that is trace-class on , and the noise covariance operator is symmetric and positive definite. Furthermore, for some symmetric, positive definite, and trace-class on such that , there exists a Borel measurable linear operator that is self-adjoint with domain dense in , commutes with on , and generates the data , , where
We note that, although is a Gaussian random variable in (since is assumed trace-class on ), need not be (since is not assumed trace-class on ) and must be interpreted weakly (see [agapiou2014analysis, Sec. 5.1], [agapiou2018posterior, cavalier2008nonparametric, knapik2011bayesian]). Viewing as a random variable, for our prior model we place a centered Gaussian prior on independent of and . The sense in which is a proper Gaussian measure requires some care. The most natural Hilbert space of linear operators on is the set of Hilbert–Schmidt operators , and one could try to interpret as a measure supported on . However, this is unsatisfactory as it does not allow for prior knowledge that may fail to be continuous and thus not an element of . Instead, we take as a proper Gaussian measure on the larger space so that is assumed symmetric, positive definite, and trace-class on , but not necessarily trace-class on .
Recovery of the ground truth element from the given data Eq. 3 may be formulated in the notation of a traditional statistical inverse problem as follows. Concatenating on the index , we write and equip the product space with the inner product for any ; this makes a Hilbert space. Next, for (the -fold copy of ), we define the linear map by . We thus have the given data Eq. 3 in the standard form
where and . is analogous to the forward map in the terminology of inverse problems, except here it is a random variable. The following proposition shows that although this forward map is injective, contrary to usual inverse problems it is not compact.
is injective -almost surely and for any ; is not compact.
For any nonzero , as by Section 2.2, so i.i.d., . But , so that proves the first assertion.
For the second, notice that
is a vector-valued RKHS[micchelli2005learning] with operator-valued kernel , where and . However, is not compact on , which implies is not compact either. Extending the argument from (for ) to (for ) completes the proof.
We assume that , and are a priori independent and consider the Bayesian inverse problem of finding the distribution of , given and , assuming that the random variables are related by Eq. 4 (with replaced by ). A rigorous justification of the posterior formulae to follow, namely that it is Gaussian with mean and covariance Eq. 6 and that it is absolutely continuous with respect to the prior , with Radon–Nikodym derivative Eq. 7, may be obtained by application of [dashti2017, Theorems 32, 13, 37]. Because and are a priori independent, the posterior measure on given and , denoted by , is the same as that obtained when is conditioned on with fixed, almost surely (a.s.); this measure is Gaussian:
The mean is the linear operator and is the covariance operator. If we define to be the Kronecker product of and , which is a -by- block diagonal operator with diagonal entries , then the posterior mean and covariance are given by equationparentequation
in our infinite-dimensional setting (see [knapik2011bayesian, Prop. 3.1], [lehtinen1989linear, mandelbaum1984linear]), and additionally,
where is the required normalization constant. The likelihood formula Eq. 7 motivates the learning theory formulation that follows in the next subsection.
Our analysis is not limited to the assumption that is a Gaussian measure on ; the results in this paper would be virtually unchanged if was instead assumed to be given by a Karhunen–Loève random series expansion with non-Gaussian (but still independent) coefficients (see, e.g., [steinwart2019convergence]), provided they are sub-Gaussian. Indeed, since enters Eq. 4 only through the forward operator
, the inverse problem would still be linear Gaussian and hence the posterior would have the same form. However, the concentration inequalities and moment bounds appearing inLemmas 7, 6, and 5 within Appendix B are specific to the chi-square distribution, which arises because
is Gaussian; to generalize, these would need to be replaced with results pertaining to more general sub-exponential distributions.
We now define two natural statistical approximations to : the posterior mean estimator and the posterior sample estimator . The sense in which these estimators are close to , as a function of data volume , measures the amount of data required for given level of estimation accuracy; this is the focus of our analysis. The usual norms (e.g., operator, Hilbert–Schmidt, and trace norms) do not usefully metrize estimation errors if is unbounded. Instead, the following weaker Bochner norm is proposed to measure such error. Let be a centered probability measure supported on with finite fourth moment, , and denote its covariance operator by . The identity for any linear and for all , , yields
assuming is trace-class. This motivates the next fact.
Under Section 2.2, . In particular, .
By assumption, and is (strictly) contained in , so we immediately have . As for all [stuart2010inverse, Lem. 6.15], it follows that for all , where are the orthonormal eigenbases of corresponding to eigenvalues , respectively. The facts that is self-adjoint and form orthonormal bases of , respectively, yield
Therefore, by Eq. 8, -a.s. for , that is, has full measure under .
Our use of the weighted Hilbert–Schmidt space and its norm Eq. 8 is closely related to the notion of -measurable linear operators [mandelbaum1984linear], and similar considerations were made in [knapik2011bayesian, Prop. 3.2] for unbounded linear functionals. Since is compact, the norm in Eq. 8 is weak in the sense that [gawarecki2010stochastic, Sec. 2.2]. We further note that, if commutes with self-adjoint operator , then the -weighted Hilbert-Schmidt norm of corresponds to a weighted norm on the eigenvalues of , and we use this fact to perform explicit calculations in Section 2.4. Aligned with the notion of test error from machine learning, the Bochner norm leads to the following definition.
Definition 1 (Test Error: Bayesian)
The test error of the posterior sample estimator is
The two outer expectations are with respect to the data, and the inner expectation is with respect to the Bayesian posterior. The definition of test error for the posterior mean is similar.
Definition 2 (Test Error: Point)
The test error of the posterior mean estimator is
We say that Eq. 9 or Eq. 10 tests in-distribution if and out-of-distribution otherwise; these quantities are also referred to as prediction error [cai2006prediction]. In Section 3, we study the asymptotic performance of the estimators and as , using the notion of posterior contraction with respect to the error, leading to analysis of Eq. 9 and Eq. 10.
2.3 Statistical Learning
We now adopt a statistical learning theory perspective for the operator regression problem. Letdenote the joint probability measure implied by data model Eq. 2 and Section 2.2. The given data is then i.i.d., . Since regression is the focus, it is natural to work with the square loss on so that and define the expected risk and empirical risk, respectively, with minimizers of the latter viewed as point estimators. These two standard definitions need careful interpretation, however, since under the infinite-dimensional model with not trace-class on , the cylinder measure is not a proper Gaussian measure on so that a.s. ; as a consequence, both risks are infinite a.s. . No generality is lost by directly working in this setting, as pre-whitening yields the transformed equation , where
is Gaussian white noise (andis not trace-class on ); the learning can then proceed as usual with data and unknown (in general, unbounded) operator . In fact, we assume more. and commutes with on . Under Section 2.3, so that pre-whitening may be applied to the data pairs themselves: , while remains unchanged. Henceforth, it suffices to only consider the white noise case, and we impose this explicitly. The Gaussian noise covariance operator is white: , .
The Bayesian formula Eq. 7 provides inspiration for an alternative definition of risk in infinite dimensions, namely, to redefine it as the negative log likelihood of [alberti2021learning, nickl2020convergence].
Definition 3 (Expected Risk)
The expected risk (or population risk) is
Definition 4 (Empirical Risk)
The empirical risk is
and the regularized empirical risk is
where is a linear weighting operator and is as in Section 2.2.
Equations 12 and 11 are indeed finite: the infinite “constant” is subtracted out and the inner product contributions are finite a.s. even though a.s., as may be deduced from a limiting procedure [stuart2010inverse, Thm. 6.14]. The second term in Eq. 13 acts like a Tikhonov penalty on the least squares functional, but generalized to operators.
Learning an input-output model for given finite data requires a suitable hypothesis class of linear operators on , henceforth denoted by , and leads to the optimization problems
For now, we take to be the separable Hilbert space from Section 2.2. The first problem in Eq. 14 is known as empirical risk minimization [mathieu2021excess, tabaghi2019learning]; the second is its regularized counterpart, and its minimizer, denoted by , is the focus of our discussion. These problems are meant to approximate the best linear estimator given infinite data, denoted by , which is a solution to ; this agrees with solutions to since . Indeed, it is well known [mandelbaum1984linear, Thm. 2] that
Relatedly, formal computations reveal that is essentially a regularized empirical approximation to the right hand side of Eq. 15 and additionally may be identified as the posterior mean from Eq. 5 whenever (the prior covariance). In this case the quadratic penalty in Eq. 13 is the Cameron–Martin norm of the prior , and the maximum a posteriori estimator (which equals the posterior mean in our linear Gaussian setting) must minimize the Onsager–Machlup functional Eq. 13 [dashti2013map]; henceforth we write . To quantify the performance of , we employ the following well known error functionals.
Definition 5 (Excess Risk)
The excess risk of the posterior mean is
Definition 6 (Generalization Gap)
The generalization gap of the posterior mean is
The excess risk is always nonnegative and provides a notion of consistency for the estimator , while the generalization gap can take any sign and is defined with respect to the unregularized empirical risk. Subtracting the generalization gap from the excess risk gives which, if known, enables estimation of the expected risk of the best linear estimator because is computable. We now elaborate on Definitions 6 and 5.
2.3.1 Excess Risk
Under Section 2.2, it holds that
The excess risk of the posterior mean estimator is .
2.3.2 Generalization Gap
2.4 Theorem Setting
We now describe the mathematical setting under which our theorems are developed; this setting is not as general as that outlined in the previous two subsections, but it is more amenable to analysis. To this end, let be an orthonormal basis of . We specialize our hypothesis class of linear operators from Section 2.3 to the set
where defines the input Hilbert space for and has eigenvectors and corresponding eigenvalues with property for some ; thus is trace-class. The series representation of converges in , which is equivalent to convergence of eigenvalue sequence in : . Although taking with violates the imposed trace-class assumption, the norm on is still well-defined, and hence we abuse notation and allow any to determine .
In Eq. 21, we view any as a densely defined operator with
It is straightforward to check that any such equipped with domain Eq. 22 is self-adjoint on . In words, consists of densely defined, self-adjoint linear operators diagonal in the basis with controlled growth or decay of their eigenvalues , . In particular, contains unbounded operators, for example, any operator with such that . However, self-adjoint linear operators with nonempty continuous spectra fall outside the scope of our framework and require more advanced techniques to handle (see, e.g., [colbrook2021computing]).
2.4.1 Main Assumptions
is the input training data probability distribution,is the observational noise distribution, and is the prior. As in Definition 1, is an arbitrary test data distribution which we restrict to be the Gaussian measure for simplicity. We impose the following main assumptions: Section 2.2, 2.3, and 2.3 hold; the truth is an element of the hypothesis class, that is, ; and are simultaneously diagonalizable in the same orthonormal basis of eigenvectors, denoted by , with eigenvalue sequences and , respectively; as a mapping on operators, is “doubly diagonal” in the same basis as , in the sense that ; there exist and such that , and are such that and , and for any ; and finally, and .
Remark 2 (Interpretation of Section 2.4.1)
Since by , the class does not introduce any approximation error, and hence we need only control the estimation error due to finite data; this is the well-specified setting. Condition is provided for convenience, while guarantees that samples from the prior are actually diagonal in the basis , and KL expansion shows that . The algebraic growth or decay of the sequences in closely align our setting with that in [knapik2011bayesian], although natural extensions to exponential rates is also possible, as in [agapiou2014bayesian]. Additionally, the parameter is introduced as the “largest” such that (and such is generally unknown a priori), but this lacks precision at the boundary for certain sequences, as noted in [knapik2011bayesian]. One remedy would be to consider slowly varying functions as in that paper, but we do not pursue this. Condition restates part of Section 2.2 in diagonal form, that is, ensures , while ensures that are trace-class on both and a.s.
The prior and test measure “smoothness”, and , are viewed as tunable while and are fixed by the data. Furthermore, by the Feldman–Hájek theorem [dashti2017, Thm. 37], test measure is singular with respect to training measure whenever .
Under Section 2.4.1, we may work entirely in coordinates with respect to the orthonormal bases and , leading to a prior and posterior on sequences. Many papers have studied statistical inverse problems in a similar sequence model form and under related commuting assumptions [agapiou2018posterior, agapiou2014bayesian, cavalier2008nonparametric, knapik2011bayesian]. However, our Gaussian diagonal setting is different as not only do commute, but so do ( whitened in Section 2.3) and ; there is no analog of this additional layer of diagonalizability in the previous works due to the unique linear operator structure of here. Also, the complete theory in [knapik2011bayesian] is not easy to apply in our setting since it requires an eigenbasis for .
We write the prior in sequence form, identifying with its eigenvalues . The subscript “seq” is used here, and in what follows below, to denote the sequence space representation of a measure that can be diagonalized. If and commute (and hence with also), then without loss of generality for all so that the action of on in coordinates is given by