# Learning Schatten--Von Neumann Operators

We study the learnability of a class of compact operators known as Schatten--von Neumann operators. These operators between infinite-dimensional function spaces play a central role in a variety of applications in learning theory and inverse problems. We address the question of sample complexity of learning Schatten-von Neumann operators and provide an upper bound on the number of measurements required for the empirical risk minimizer to generalize with arbitrary precision and probability, as a function of class parameter p. Our results give generalization guarantees for regression of infinite-dimensional signals from infinite-dimensional data. Next, we adapt the representer theorem of Abernethy et al. to show that empirical risk minimization over an a priori infinite-dimensional, non-compact set, can be converted to a convex finite dimensional optimization problem over a compact set. In summary, the class of p-Schatten--von Neumann operators is probably approximately correct (PAC)-learnable via a practical convex program for any p < ∞.

## Authors

• 5 publications
• 4 publications
• 25 publications
• ### Probably Approximately Correct Constrained Learning

As learning solutions reach critical applications in social, industrial,...
06/09/2020 ∙ by Luiz F. O. Chamon, et al. ∙ 0

• ### Towards a topological-geometrical theory of group equivariant non-expansive operators for data analysis and machine learning

The aim of this paper is to provide a general mathematical framework for...
12/31/2018 ∙ by Mattia G. Bergomi, et al. ∙ 0

• ### A unified approach to calculation of information operators in semiparametric models

The infinite-dimensional information operator for the nuisance parameter...
10/15/2018 ∙ by Lu Mao, et al. ∙ 0

• ### Learning Co-Sparse Analysis Operators with Separable Structures

In the co-sparse analysis model a set of filters is applied to a signal ...
03/09/2015 ∙ by Matthias Seibert, et al. ∙ 0

• ### Stein variational gradient descent on infinite-dimensional space and applications to statistical inverse problems

For solving Bayesian inverse problems governed by large-scale forward pr...
02/19/2021 ∙ by Junxiong Jia, et al. ∙ 0

• ### Convex Risk Minimization and Conditional Probability Estimation

This paper proves, in very general settings, that convex risk minimizati...
06/15/2015 ∙ by Matus Telgarsky, et al. ∙ 0

• ### The strong converse exponent of discriminating infinite-dimensional quantum states

The sandwiched Rényi divergences of two finite-dimensional density opera...
07/16/2021 ∙ by Milán Mosonyi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Objects of interest in many problems in machine learning and inverse problems are best modeled as vectors in infinite-dimensional function spaces. This is the case in collaborative filtering in machine learning

[1], non-linear inverse problems for the wave equation, and general regularized solutions to inverse problems [2]. Relationships between these objects are often well-modeled by linear operators.

Among linear operators, compact operators are a natural target for learning because they are stable and appear commonly in applications. In infinite dimension, boundedness alone does not guarantee learnability (cf. Remark 4.2). Unbounded operators are poor targets for learning since they are not stable. A classical example from regularization of inverse problems is that if the direct operator is compact (for example, the Radon transform in computed tomography), then the inverse is unbounded, so we replace it by a compact approximation.

When addressing inverse and learning problems numerically, we need to discretize the involved operators. The fundamental properties of operator fitting, however, are governed by the continuous structure, and it is interesting to study properties of these continuous objects.

Learning continuous operators from samples puts forward two important questions:

1. Can a given class of operators be learned from samples? How many samples are needed to guarantee that we learn the “best” possible operator? Here we assume that the input samples and the output samples are drawn from some joint probability measure which admits a “best” that maps to , and we ask whether can be approximated from samples from .

2. What practical algorithms exist to learn operators belonging to certain classes of compact operators, given that those classes are infinite-dimensional, and in fact non-compact?

In this paper we address these questions for compact operators known as

-Schatten–von Neumann operators, whose singular value sequences have finite

norms. These operators find applications in a number of inverse problems, in particular those related to scattering. For the first question, we show that the class is learnable in the probably-approximately-correct sense for all , and we prove the dependence of sample complexity on

. We work within the Vapnik-Chervonenkis framework of statistical learning theory, and bound the sample complexity via computing the Rademacher complexity of the class as a function of

.

For the second question, we adapt the results of by [1] who showed that infinite-dimensional learning problems similar to ours can be transformed into finite-dimensional optimization problems. In our proofs, we make explicit the fact about the non-compactness of the involved hypothesis classes.

### 1.1 Related work

The closest work to ours is that of Maurer on sample complexity for multitask learning, [3, 4]. He computes sample complexity of finite-rank operators , where is a Hilbert space. In general, there is quite a bit of work on complexity of finite-dimensional classes. Kakade et al. [5]

study the generalization properties of scalar-valued linear regression on Hilbert spaces. A survey of statistical bounds for estimation and classification in terms of Rademacher complexity is available in

[6, 7]. To the best of our knowledge, this is the first paper to look at the sample complexity of learning infinite-dimensional operators, where the hypothesis class is non-compact.

On the algorithmic side Abernethy et al. [1] propose learning algorithms for a problem related to ours. They show how in the context of collaborative filtering, a number of existing algorithms can be abstractly modeled as learning compact operators, and derive a representer theorem which casts the problem as optimization over matrices for general losses and regularizers.

Our work falls under the purview of “machine learning for inverse problems”, an expanding field of learning inverse operators and regularizers from data, especially with the empirical successes of deep neural networks. Machine learning has been successfully applied to problems in computed tomography

[8, 9], inverse scattering [10], and compressive sensing [bora:2017compressed], to name a few. A number of parallels between statistical learning and inverse problems are pointed out in [11].

## 2 Motivation

### 2.1 Regularized Inverses in Imaging

Consider a linear operator equation

 y=Ax,

where is a Schatten-von Neumann operator. This is the case in a variety of “mildly” ill-posed inverse problems such as computed tomography, where is the Radon transform. It is well known that even when is injective, the compactness of makes the problem ill-posed in the sense that is not a continuous linear operator from . This becomes problematic whenever instead of we get to measure some perturbed data , as we always do. It also makes a poor target for learning since it does not make much sense to learn unstable operators.

A classical regularization technique is then to solve

 \whx=\argminx\normyδ−Ax2+λ\normx2,

which can be interpreted as a maximum a posteriori solution under Gaussian noise and signal priors. The resulting solution operator can formally be written as

 Rλ=(A∗A+λ⋅Id)−1A∗,

which can be shown to be a Schatten-von Neumann operator. ( denotes the adjoint of .) If the forward operator is completely or partially unknown, or the noise and prior distributions are different from isotropic Gaussian, it is of interest to learn the best regularized linear operator from samples. Thus one of the central questions becomes that of sample complexity and generalization error.

### 2.2 Collaborative Filtering

In collaborative filtering the goal is to predict rankings of objects belonging to some class by users belonging to . [1] show that both the users and the objects are conveniently modeled as belonging to infinite-dimensional reproducing kernel Hilbert spaces and . The rankings can then be modeled by the following functional on ,

 [ranking of y by x]=\inprodx,Fy.

Given a training sample consisting of users and rankings embedded in their respective spaces, the “best” is estimated by minimizing regularized empirical risk. The regularizers used by [1] are of the form , where are the singular values of , and are non-decreasing penalty functions.

### 2.3 Schatten–von Neumann Operators in Wave Problems

Schatten–von Neumann operators play an important role in non-linear inverse problems associated with the wave equation. In particular, in inverse scattering approaches via boundary control [12] and scattering control [13, 14], the reconstruction algorithms are given as recursive procedures with data operators that belong to the Schatten–von Neumann class. Under conditions that the inverse problem yields a unique solution, the question whether the inverse map is learnable may be analyzed by studying whether a Schatten-Von Neumann operator is learnable. While the overall inverse maps in these cases are nonlinear, we see the results presented here as a gateway to studying the learning-theoretic aspects of these general nonlinear inverse problems.

## 3 Problem Statement and Main Result

The main goal of this paper is to study the learnability of Schatten–Von Neumann class of compact operators. We use a model-free or agnostic approach [15], in the sense that we do not require our training samples to satisfy for any putative , and the optimal risk can be nonzero. Instead, we are looking for an operator that provides the best fit to a given training set generated from an arbitrary distribution.

As usual, the training set consists of i.i.d. samples , where . The samples are generated from an unknown probability measure , where is the set of all probability measures defined on . We take to be the Cartesian product of input and output Hilbert spaces, namely . The hypothesis space is the set of all -Schatten–von Neumann operators:

[Schatten–von Neumann] If the sequence of singular values of a compact linear operator is -summable, for , then is said to belong to the -Schatten-von Neumann class, denoted .

For a given sample and hypothesis , we measure the goodness of fit, or the loss, as . The risk of a hypothesis is defined as its average loss with respect to the data-generating measure, .

Ideally, we would then like to find an operator that has the minimum risk among the hypothesis class. While we do not have access to the true data-generating distribution, we get to observe it through a finite number of random samples, , . Given a training set , a central question is whether we can guarantee that the hypothesis learned from training samples will do well on other samples as well. Put differently, we want to show that the risk of the hypothesis learned from a finite number of training samples is close to that of the hypothesis estimated with the knowledge of the true distribution, with high probability.

We show that the class is indeed learnable in the probably-approximately-correct (PAC) sense by showing that the risk achieved by the empirical risk minimization (ERM) converges to the true risk as the sample size grows. Concretely, assuming that and are bounded almost surely, we show that

 EPL(y,ˆTNx)≤infT∈TpEPL(y,Tx)+O(N−min{1p,12}) (1)

with high probability over training samples, where

 ˆTN=\argminT∈Tp1NN∑n=1L(yn,Txn),

is a random minimizer of the empirical risk. (Existence of the empirical risk minimizer follows from Section 5.) We thus show that the class of -Schatten–von Neumann operators is PAC-learnable, since ERM produces a hypothesis whose risk converges to the best possible. The rate of convergence determines the number of samples required to guarantee a given target risk with high probability.

The result makes intuitive sense. To see this, note that the inclusion

 Tp1⊂Tp2,

for , implies that the smaller the , the easier it is to learn , as predicted by the theorem.

The minimization for involves optimization over an infinite-dimensional, non-compact set. In Section 5 we show that in fact, the ERM can be carried out by a well-defined finite dimensional optimization.

## 4 Learnability Theorem

In order to state and prove the learnability thorem, we first recall some facts about linear operators in Hilbert spaces [16]. Then, we connect PAC-learnability of -Schatten class of operators with its Rademacher complexity in Section 4.2. Finally, we establish the learnability claim by showing the said Rademacher complexity vanishes with the number of measured samples, Section 4.2.

### 4.1 Preliminaries

Let be complex Hilbert spaces, a compact operator, and let denote its adjoint operator. Let

be the eigenvectors of a compact, self-adjoint, and non-negative operator

, and

be the corresponding eigenvalues. We define the absolute value of

as . The eigenvalues of , , are called the singular values of and denoted by . We always assume, without loss of generality, that the sequence of singular values is non-increasing.

Recall the definition of Schatten–von Neumann operators, Section 3. For , the class becomes a Banach space equipped with the norm ,

 ∥T∥Sp=(∞∑k=1sk(T)p)1/p. (2)

Specifically, is the set of Hilbert-Schmidt operators while is the algebra of trace class operators. Let , or simply , and be any orthonormal basis for (which exists because is a Hilbert space). The series

 Tr(T)=∑k≥1⟨Tψk,ψk⟩ (3)

is well-defined and called the trace of . Finally, is the class of bounded operators endowed with the operator norm .

An important fact about the set of bounded Schatten operators is that it is not compact, which requires a bit of care when talking about the various minimizers. For , is a complete space with quasi-norm of .

### 4.2 Rademacher Complexity and Learnability of the Hypothesis Class

In this section, we show that ERM algorithm PAC-learns the class of -Schatten–von Neumann operators. The proofs of all formal results are given in Section 6.

Informally, ERM works if we have enough measurements to dismiss all bad near-minimizers of the empirical risk. In other words, generalization ability of the ERM minimizer is negatively affected by the complexity of the hypothesis class. A common complexity measure for classes of real-valued functions is the Rademacher complexity [17, 18].

be a probability distribution on a set

and suppose that are independent samples selected according to . Let be a class of functions mapping from to . Given samples , the conditional Rademacher complexity of is defined as

 RN(F|ZN)=E[supf∈F|1NN∑n=1σif(Zn)|∣∣ZN]

where

is .

The following result connects PAC-learnability of our operators with the Rademacher complexity the hypothesis class. It shows that a class is PAC-learnable when Rademacher complexity converges to zero as the size of the training dataset grows.

[Learnability] Let be independently selected according to the probability measure . Moreover, assume and are bounded random variables, and almost surely. Then, for any , with probability at least over samples of length , we have

 EL(y, +4RN(L∘Tp)+2(Cy+BCx)2√1Nlog1δ

where , , and , and .

The proof is standard; we adapt it for the case of Schatten–von Neumann operators.

Our main result is the following bound on the Rademacher complexity of the loss class induced by .

[Vanishing Rademacher Complexity] Let be independently selected according to the probability measure and . Moreover, assume and almost surely. Then, we have

 RN(L∘Tp)≤Cy√N+N−min{12,1p}(B2Cx+2BCxCy)

where , , and .

As already mentioned, the bound becomes worse as grows large, and it breaks down for . This makes intuitive sense: only tells us that the singular values are bounded. It includes situations where all singular values are equal to and thus approximating any finite number of singular vectors leaves room for arbitrarily large errors.

The bound saturates for and does not improve for . While this is likely an artifact of our proof technique, in terms of generalization error it is order optimal, due to the last term in Theorem 4.2.

The bounds from Theorem 4.2 are illustrated inFigure 1.

Our proof of Theorem 4.2 uses properties of certain random finite-rank operators constructed from training data.

For notational convenience, let us define the operator such that for every , we have . To prove Theorem 4.2, we have to show that the operators and , with being i.i.d Rademacher random variables, are well-behaved in the sense that they do not grow too fast with the number of training samples . This is the subject of the following lemma.

Let be independently selected according to the probability measure and . Define random operators and . We have

 E∥Txx∥Sq≤Nmax{12,1q}√E∥x∥4, E∥Tyx∥Sq≤Nmax{12,1q}√E∥x∥2∥y∥2.

These growth rates then imply the stated bounds on the Rademacher complexity.

## 5 The Learning Algorithm

In the developments so far we have not discussed the practicalities of learning infinite-dimensional operators. The ERM for assumes we can optimize over an infinite-dimensional, non-compact class. We address this problem by adapting the results of [1]

, who show for a different loss function that ERM for compact operators can be transformed into an optimization over rank-

operators.

Denote by the linear span of , and by the linear span of . Let further be a projection onto and analogously for . Then, we have the following simple result:

Let and be projection operators. Then, we have for any operator and all .

###### Proof.

Let . We have

 L(y,ΠYTΠXx) =\normy−ΠYTΠXx2 =\normΠY(y−Tx)2 ≤L(y,Tx).

Section 5 implies that for any

 J(ΠYTΠX)≤J(T),

where is the empirical risk of .

We use this simple result to show that the in ERM is well-defined and can be achieved by solving a finite dimensional convex program over a compact set.

By definition of , for any there exists such that

 infT∈TpJ(T)≤J(ˆTϵ)≤infT∈TpJ(T)+ϵ.

Let us choose for a sequence such that . By construction, we have

 limn→+∞J(ˆTϵn)=infT∈TpJ(T).

From Section 5,

 limn→+∞J(ΠYˆTϵnΠX)≤infT∈TpJ(T). (4)

On the other hand, for a compact operator , we have [1]

 σn(ΠYTΠX)≤σn(T), ∀n∈N,

Therefore,

 \normΠYˆTϵnΠXSp≤\normˆTϵnSp≤B,

so that is feasible for the ERM.

Consider an optimization over the set of finite rank operators . From (4), we have

 infT∈TNJ(T)≤limn→+∞J(ΠYˆTϵnΠX)≤infT∈TpJ(T).

Since is a finite-dimensional set isometric to matrices, and a closed ball in finite dimension is compact, this optimization is over a compact set and the minimum is achieved. In summary,

 minT∈TNJ(T)=infT∈TpJ(T).

From the definition , any can be written as

 T=Ny∑i=1Nx∑j=1αi,jviu∗j,

where and are two sets of complete orthonormal bases for linear subspaces and , and . Hence, we can identify with a matrix . Similarly, and can be represented as vectors and . With this notation, we can implement ERM via a finite-dimensional optimization in Algorithm 1.

## 6 Proofs of Formal Results

### 6.1 Proof of Theorem 4.2

We define the function class as

 F:={f:Hx×Hy→\R+:f(z)=L(y,Tx),T∈Tp}, (5)

where , and adopt the following simplifying notations:

 f(z) := L(y,Tx) P(f) := EP[f] ˆfN := \argminf∈FPN(f):=\argminf∈F1N∑nf(zn) L∗(F) := inff∈FP(f) ∥P−P′∥F := supf∈F|P(f)−P′(f)|.

As we mentioned before, the existence of the minimizer in the definition of follows from the discussion in Section 5.

The following known result bounds the empirical risk in terms of a quantity called the uniform deviation , which measures maximum discrepancy between empirical and generalized risks among the elements of the hypothesis class.

The empirical risk minimization (ERM) algorithm satisfies the following inequality:

 P(ˆfN)≤L∗(F)+2ΔN(zN)

where is the uniform deviation for a function class and given sequence of random samples, , [17].

###### Proof.

The proof is standard up to a modification due to the noncompactness of our . Let be such that . We have,

 P(ˆfN)−L∗(F) =P(ˆfN)−PN(ˆfN)+PN(ˆfN)−PN(f∗ϵ) +PN(f∗ϵ)−P(f∗ϵ)+P(f∗ϵ)−L∗(\calF) ≤2ΔN(zN)+P(f∗ϵ)−L∗(\calF),

since the first and the third bracketed term are bounded by by definition, and the third term is negative because is minimized by . The conclusion follows from the fact that we can find such that for any and that

 P(f∗ϵ)−L∗(\calF)ϵ→0⟶0

We want to find a probabilistic upper bound for the uniform deviation. This can be achieved by showing has the bounded difference property and using McDiarmid’s inequality. Let be a probability measure on a set and suppose that and are independent samples selected according to . Define and . Then,

 |ΔN(zN)−ΔN(zN¯i)|≤1N(Cy+BCx)2

for the function class in eq. 5.

###### Proof.

Define .

 |ΔN(zN)−ΔN(zN¯i)|def.=|∥PN−P∥F−∥PN¯i−P∥F| =∥PN¯i+1Nf(zi)−1Nf(¯zi)−P∥F−∥PN¯i−P∥F| (a)≤1N∥f(¯zi)−f(zi)∥F (b)≤1NsupT∈TpL(¯yi,T¯xi)+1NsupT∈TpL(yi,Txi) (c)≤2N(Cy+BCx)2,

where (a) and (b) are due to triangle inequality. Finally, supremum in is achieved by

 ˆT=−B∥y∥∥x∥yx∗∈Tp

which gives us inequality (c). ∎

Now, McDiarmid’s inequality gives us a bound for tail probability of as

 P(ΔN(zN)≥EΔN(zN)+t)≤exp{−Nt2(Cy+BCx)4}.

Finally, we conclude the proof by choosing

 t=(Cy+BCx)2√1Nlog1δ,

and using the following theorem: [19, 20] Fix a space and let be a class of functions. Then for any probability measure

 EΔN(zN)≤2RN(F).

### 6.2 Proof of Theorem 4.2

A simple analysis gives us an upper bound for .

 RN(L∘Tp)(a)=EsupT∈Tp|1NN∑n=1σn∥yn−Txn∥2| (b)≤1NE|N∑n=1σn∥yn∥2|+1NEsupT∈Tp|N∑n=1σn∥Txn∥2| +2NEsupT∈Tp|N∑n=1σnRe⟨Txn,yn⟩| +2NEsupT∈TpTr(TN∑n=1σnxny∗n)| (d)≤1√N√E∥y∥4+1NEsupT∈Tp∥T∗T∥Sp∥N∑n=1σnxnx∗n∥Sq +2NE∥T∥Sp∥N∑n=1σnynx∗n∥Sq (e)≤1√N√E∥y∥4+B2NE∥N∑n=1σnxnx∗n∥Sq +2BNE∥N∑n=1σnynx∗n∥Sq

The equality (a) simply follows from the definition of expected Rademacher complexity of , see Section 4.2, and (b) from triangle inequality and subadditivity property of supremum. In (c), we used Jensen’s inequality along with which leads to .

The following results explain the remaining inequalities.

Let and with bounded norms, and be a bounded operator. Then,

• ,

• .

###### Proof.

Define and . Let , be a set of orthonormal basis for and . Then,

 Tr(T∗Txx∗) =∑n⟨T∗Txx∗ex,n,ex,n⟩+⟨T∗Txx∗ex,ex⟩ =⟨T∗Tx∥x∥,ex⟩ =⟨T∗Tx,x⟩ =∥Tx∥2,

and

 Tr(Txy∗) =∑n⟨Txy∗ey,n,ey,n⟩+⟨Txy∗ey,ey⟩ =⟨Tx∥y∥,ey⟩ =⟨Tx,y⟩

The second part of is now a direct consequence of Section 6.2. Note that and that is a linear operator. The following theorem explains inequality . [21] Let and . If and then

 T∗1T2∈S1(Hx,Hy)

and

 |Tr(T∗1T2)|≤\normT∗1T2S1≤∥T1∥Sp∥T2∥Sq.

Finally, proves the inequality .

We want to bound the expected Schatten -norms of random operators and . From Section 4.2,

 E∥Txx∥Sq≤Nmax{12,1q}√E∥x∥4,
 E∥Tyx∥q≤Nmax{12,1q}√E∥x∥2∥y∥2.

Since and are bounded random variables, we conclude

 RN(L∘Tp)≤Cy√N+N−min{12,1p}(B2Cx+2BCxCy).

### 6.3 Proof of Lemma 4.2

To prove Lemma 4.2, we start by stating two results about compact operators. Let denote the spectrum of an operator . The we have:

[22] Let be convex. Then the functional is convex on the set and is the set of finite rank, self-adjoint operators.

The following lemma is a standard application of Hölder’s inequality.

Let be a non-negative, compact linear operator of rank at most . Then,

 ∥T∥Sp≤Np−1−q−1∥T∥Sq

where .

###### Proof.

Let be the sequence of singular values of (with multiplicities), and . Then,

 ∥T∥Sp =(N∑n=1sn(T)p)1p H\"{o}lder ineq.≤((N∑n=1sn(T)q)pq(N∑n=11qq−p)1−pq)1p =Np−1−q−1∥T∥Sq.

We now proceed to prove Lemma 4.2. If , we have:

 E∥Txx∥q (a)=ExEσ(Tr((TxxT∗xx)q/2))1/q (c)=Ex(Tr((N∑n=1∥