Objects of interest in many problems in machine learning and inverse problems are best modeled as vectors in infinite-dimensional function spaces. This is the case in collaborative filtering in machine learning, non-linear inverse problems for the wave equation, and general regularized solutions to inverse problems . Relationships between these objects are often well-modeled by linear operators.
Among linear operators, compact operators are a natural target for learning because they are stable and appear commonly in applications. In infinite dimension, boundedness alone does not guarantee learnability (cf. Remark 4.2). Unbounded operators are poor targets for learning since they are not stable. A classical example from regularization of inverse problems is that if the direct operator is compact (for example, the Radon transform in computed tomography), then the inverse is unbounded, so we replace it by a compact approximation.
When addressing inverse and learning problems numerically, we need to discretize the involved operators. The fundamental properties of operator fitting, however, are governed by the continuous structure, and it is interesting to study properties of these continuous objects.
Learning continuous operators from samples puts forward two important questions:
Can a given class of operators be learned from samples? How many samples are needed to guarantee that we learn the “best” possible operator? Here we assume that the input samples and the output samples are drawn from some joint probability measure which admits a “best” that maps to , and we ask whether can be approximated from samples from .
What practical algorithms exist to learn operators belonging to certain classes of compact operators, given that those classes are infinite-dimensional, and in fact non-compact?
In this paper we address these questions for compact operators known as
-Schatten–von Neumann operators, whose singular value sequences have finitenorms. These operators find applications in a number of inverse problems, in particular those related to scattering. For the first question, we show that the class is learnable in the probably-approximately-correct sense for all , and we prove the dependence of sample complexity on
. We work within the Vapnik-Chervonenkis framework of statistical learning theory, and bound the sample complexity via computing the Rademacher complexity of the class as a function of.
For the second question, we adapt the results of by  who showed that infinite-dimensional learning problems similar to ours can be transformed into finite-dimensional optimization problems. In our proofs, we make explicit the fact about the non-compactness of the involved hypothesis classes.
1.1 Related work
The closest work to ours is that of Maurer on sample complexity for multitask learning, [3, 4]. He computes sample complexity of finite-rank operators , where is a Hilbert space. In general, there is quite a bit of work on complexity of finite-dimensional classes. Kakade et al. 
study the generalization properties of scalar-valued linear regression on Hilbert spaces. A survey of statistical bounds for estimation and classification in terms of Rademacher complexity is available in[6, 7]. To the best of our knowledge, this is the first paper to look at the sample complexity of learning infinite-dimensional operators, where the hypothesis class is non-compact.
On the algorithmic side Abernethy et al.  propose learning algorithms for a problem related to ours. They show how in the context of collaborative filtering, a number of existing algorithms can be abstractly modeled as learning compact operators, and derive a representer theorem which casts the problem as optimization over matrices for general losses and regularizers.
Our work falls under the purview of “machine learning for inverse problems”, an expanding field of learning inverse operators and regularizers from data, especially with the empirical successes of deep neural networks. Machine learning has been successfully applied to problems in computed tomography[8, 9], inverse scattering , and compressive sensing [bora:2017compressed], to name a few. A number of parallels between statistical learning and inverse problems are pointed out in .
2.1 Regularized Inverses in Imaging
Consider a linear operator equation
where is a Schatten-von Neumann operator. This is the case in a variety of “mildly” ill-posed inverse problems such as computed tomography, where is the Radon transform. It is well known that even when is injective, the compactness of makes the problem ill-posed in the sense that is not a continuous linear operator from . This becomes problematic whenever instead of we get to measure some perturbed data , as we always do. It also makes a poor target for learning since it does not make much sense to learn unstable operators.
A classical regularization technique is then to solve
which can be interpreted as a maximum a posteriori solution under Gaussian noise and signal priors. The resulting solution operator can formally be written as
which can be shown to be a Schatten-von Neumann operator. ( denotes the adjoint of .) If the forward operator is completely or partially unknown, or the noise and prior distributions are different from isotropic Gaussian, it is of interest to learn the best regularized linear operator from samples. Thus one of the central questions becomes that of sample complexity and generalization error.
2.2 Collaborative Filtering
In collaborative filtering the goal is to predict rankings of objects belonging to some class by users belonging to .  show that both the users and the objects are conveniently modeled as belonging to infinite-dimensional reproducing kernel Hilbert spaces and . The rankings can then be modeled by the following functional on ,
Given a training sample consisting of users and rankings embedded in their respective spaces, the “best” is estimated by minimizing regularized empirical risk. The regularizers used by  are of the form , where are the singular values of , and are non-decreasing penalty functions.
2.3 Schatten–von Neumann Operators in Wave Problems
Schatten–von Neumann operators play an important role in non-linear inverse problems associated with the wave equation. In particular, in inverse scattering approaches via boundary control  and scattering control [13, 14], the reconstruction algorithms are given as recursive procedures with data operators that belong to the Schatten–von Neumann class. Under conditions that the inverse problem yields a unique solution, the question whether the inverse map is learnable may be analyzed by studying whether a Schatten-Von Neumann operator is learnable. While the overall inverse maps in these cases are nonlinear, we see the results presented here as a gateway to studying the learning-theoretic aspects of these general nonlinear inverse problems.
3 Problem Statement and Main Result
The main goal of this paper is to study the learnability of Schatten–Von Neumann class of compact operators. We use a model-free or agnostic approach , in the sense that we do not require our training samples to satisfy for any putative , and the optimal risk can be nonzero. Instead, we are looking for an operator that provides the best fit to a given training set generated from an arbitrary distribution.
As usual, the training set consists of i.i.d. samples , where . The samples are generated from an unknown probability measure , where is the set of all probability measures defined on . We take to be the Cartesian product of input and output Hilbert spaces, namely . The hypothesis space is the set of all -Schatten–von Neumann operators:
[Schatten–von Neumann] If the sequence of singular values of a compact linear operator is -summable, for , then is said to belong to the -Schatten-von Neumann class, denoted .
For a given sample and hypothesis , we measure the goodness of fit, or the loss, as . The risk of a hypothesis is defined as its average loss with respect to the data-generating measure, .
Ideally, we would then like to find an operator that has the minimum risk among the hypothesis class. While we do not have access to the true data-generating distribution, we get to observe it through a finite number of random samples, , . Given a training set , a central question is whether we can guarantee that the hypothesis learned from training samples will do well on other samples as well. Put differently, we want to show that the risk of the hypothesis learned from a finite number of training samples is close to that of the hypothesis estimated with the knowledge of the true distribution, with high probability.
We show that the class is indeed learnable in the probably-approximately-correct (PAC) sense by showing that the risk achieved by the empirical risk minimization (ERM) converges to the true risk as the sample size grows. Concretely, assuming that and are bounded almost surely, we show that
with high probability over training samples, where
is a random minimizer of the empirical risk. (Existence of the empirical risk minimizer follows from Section 5.) We thus show that the class of -Schatten–von Neumann operators is PAC-learnable, since ERM produces a hypothesis whose risk converges to the best possible. The rate of convergence determines the number of samples required to guarantee a given target risk with high probability.
The result makes intuitive sense. To see this, note that the inclusion
for , implies that the smaller the , the easier it is to learn , as predicted by the theorem.
The minimization for involves optimization over an infinite-dimensional, non-compact set. In Section 5 we show that in fact, the ERM can be carried out by a well-defined finite dimensional optimization.
4 Learnability Theorem
In order to state and prove the learnability thorem, we first recall some facts about linear operators in Hilbert spaces . Then, we connect PAC-learnability of -Schatten class of operators with its Rademacher complexity in Section 4.2. Finally, we establish the learnability claim by showing the said Rademacher complexity vanishes with the number of measured samples, Section 4.2.
Let be complex Hilbert spaces, a compact operator, and let denote its adjoint operator. Let
be the eigenvectors of a compact, self-adjoint, and non-negative operator, and
be the corresponding eigenvalues. We define the absolute value ofas . The eigenvalues of , , are called the singular values of and denoted by . We always assume, without loss of generality, that the sequence of singular values is non-increasing.
Recall the definition of Schatten–von Neumann operators, Section 3. For , the class becomes a Banach space equipped with the norm ,
Specifically, is the set of Hilbert-Schmidt operators while is the algebra of trace class operators. Let , or simply , and be any orthonormal basis for (which exists because is a Hilbert space). The series
is well-defined and called the trace of . Finally, is the class of bounded operators endowed with the operator norm .
An important fact about the set of bounded Schatten operators is that it is not compact, which requires a bit of care when talking about the various minimizers. For , is a complete space with quasi-norm of .
4.2 Rademacher Complexity and Learnability of the Hypothesis Class
In this section, we show that ERM algorithm PAC-learns the class of -Schatten–von Neumann operators. The proofs of all formal results are given in Section 6.
Informally, ERM works if we have enough measurements to dismiss all bad near-minimizers of the empirical risk. In other words, generalization ability of the ERM minimizer is negatively affected by the complexity of the hypothesis class. A common complexity measure for classes of real-valued functions is the Rademacher complexity [17, 18].
[Rademacher Complexity] Let
be a probability distribution on a setand suppose that are independent samples selected according to . Let be a class of functions mapping from to . Given samples , the conditional Rademacher complexity of is defined as
are i.i.d Rademacher random variables. The Rademacher complexity ofis .
The following result connects PAC-learnability of our operators with the Rademacher complexity the hypothesis class. It shows that a class is PAC-learnable when Rademacher complexity converges to zero as the size of the training dataset grows.
[Learnability] Let be independently selected according to the probability measure . Moreover, assume and are bounded random variables, and almost surely. Then, for any , with probability at least over samples of length , we have
where , , and , and .
The proof is standard; we adapt it for the case of Schatten–von Neumann operators.
Our main result is the following bound on the Rademacher complexity of the loss class induced by .
[Vanishing Rademacher Complexity] Let be independently selected according to the probability measure and . Moreover, assume and almost surely. Then, we have
where , , and .
As already mentioned, the bound becomes worse as grows large, and it breaks down for . This makes intuitive sense: only tells us that the singular values are bounded. It includes situations where all singular values are equal to and thus approximating any finite number of singular vectors leaves room for arbitrarily large errors.
The bound saturates for and does not improve for . While this is likely an artifact of our proof technique, in terms of generalization error it is order optimal, due to the last term in Theorem 4.2.
Our proof of Theorem 4.2 uses properties of certain random finite-rank operators constructed from training data.
For notational convenience, let us define the operator such that for every , we have . To prove Theorem 4.2, we have to show that the operators and , with being i.i.d Rademacher random variables, are well-behaved in the sense that they do not grow too fast with the number of training samples . This is the subject of the following lemma.
Let be independently selected according to the probability measure and . Define random operators and . We have
These growth rates then imply the stated bounds on the Rademacher complexity.
5 The Learning Algorithm
In the developments so far we have not discussed the practicalities of learning infinite-dimensional operators. The ERM for assumes we can optimize over an infinite-dimensional, non-compact class. We address this problem by adapting the results of 
, who show for a different loss function that ERM for compact operators can be transformed into an optimization over rank-operators.
Denote by the linear span of , and by the linear span of . Let further be a projection onto and analogously for . Then, we have the following simple result:
Let and be projection operators. Then, we have for any operator and all .
Let . We have
Section 5 implies that for any
where is the empirical risk of .
We use this simple result to show that the in ERM is well-defined and can be achieved by solving a finite dimensional convex program over a compact set.
By definition of , for any there exists such that
Let us choose for a sequence such that . By construction, we have
From Section 5,
On the other hand, for a compact operator , we have 
so that is feasible for the ERM.
Consider an optimization over the set of finite rank operators . From (4), we have
Since is a finite-dimensional set isometric to matrices, and a closed ball in finite dimension is compact, this optimization is over a compact set and the minimum is achieved. In summary,
From the definition , any can be written as
where and are two sets of complete orthonormal bases for linear subspaces and , and . Hence, we can identify with a matrix . Similarly, and can be represented as vectors and . With this notation, we can implement ERM via a finite-dimensional optimization in Algorithm 1.
6 Proofs of Formal Results
6.1 Proof of Theorem 4.2
We define the function class as
where , and adopt the following simplifying notations:
As we mentioned before, the existence of the minimizer in the definition of follows from the discussion in Section 5.
The following known result bounds the empirical risk in terms of a quantity called the uniform deviation , which measures maximum discrepancy between empirical and generalized risks among the elements of the hypothesis class.
The empirical risk minimization (ERM) algorithm satisfies the following inequality:
where is the uniform deviation for a function class and given sequence of random samples, , .
The proof is standard up to a modification due to the noncompactness of our . Let be such that . We have,
since the first and the third bracketed term are bounded by by definition, and the third term is negative because is minimized by . The conclusion follows from the fact that we can find such that for any and that
We want to find a probabilistic upper bound for the uniform deviation. This can be achieved by showing has the bounded difference property and using McDiarmid’s inequality. Let be a probability measure on a set and suppose that and are independent samples selected according to . Define and . Then,
for the function class in eq. 5.
where (a) and (b) are due to triangle inequality. Finally, supremum in is achieved by
which gives us inequality (c). ∎
6.2 Proof of Theorem 4.2
A simple analysis gives us an upper bound for .
The equality (a) simply follows from the definition of expected Rademacher complexity of , see Section 4.2, and (b) from triangle inequality and subadditivity property of supremum. In (c), we used Jensen’s inequality along with which leads to .
The following results explain the remaining inequalities.
Let and with bounded norms, and be a bounded operator. Then,
Define and . Let , be a set of orthonormal basis for and . Then,
Finally, proves the inequality .
We want to bound the expected Schatten -norms of random operators and . From Section 4.2,
Since and are bounded random variables, we conclude
6.3 Proof of Lemma 4.2
To prove Lemma 4.2, we start by stating two results about compact operators. Let denote the spectrum of an operator . The we have:
 Let be convex. Then the functional is convex on the set and is the set of finite rank, self-adjoint operators.
The following lemma is a standard application of Hölder’s inequality.
Let be a non-negative, compact linear operator of rank at most . Then,
Let be the sequence of singular values of (with multiplicities), and . Then,
We now proceed to prove Lemma 4.2. If , we have: