Let be a distribution over . Informally speaking, in the statistical query model (SQ), one learns about as follows. Given a query , the SQ oracle with tolerance reports perturbed by error of scale roughly . The SQ model was introduced in [Kea98] as a way to capture “learning algorithms that construct a hypothesis based on statistical properties of large samples rather than on the idiosyncrasies of a particular sample.”
The original motivation for the SQ framework was to provide an evidence of computational hardness of various learning problems (beyond sample complexity) by proving lower bounds on their SQ complexity. Indeed, many learning algorithms (see [Fel16b] for an overview) can be captured by the SQ framework, and, furthermore, the only known technique that gives a polynomial-time algorithm for a learning problem with exponential SQ complexity [Kea98] is Gaussian elimination over finite fields
, whose utility for learning is currently extremely limited. This reasoning suggests the following heuristic:
If solving a learning problem to accuracy requires SQ queries with tolerance , then it is unlikely to be doable in time using any algorithm.
This heuristic together with the respective SQ lower bounds provided strong evidence of hardness of many problems such as: learning parity with noise [Kea98], learning intersection of half-spaces [KS07], the planted clique problem [FGR13a], robust estimation of high-dimensional Gaussians and non-Gaussian component analysis [DKS17]
, learning a small neural network[SVWX17], adversarial learning [BPR18]
, robust linear regression[DKS19], among others.
However, over time, the SQ model has generated significant intrinsic interest [Fel16a], in part due to the connections to distributed learning [SVW16] and local differential privacy [KLN11]. In particular, the new goal is to understand the trade-off between the number and the tolerance of SQ queries, and the accuracy of the resulting solution for various learning problems, which is more nuanced than what is necessary for the above “crude” heuristic. In a paper by Feldman, Guzman, and Vempala [FGV17], this was done for perhaps the most basic learning problem, mean estimation, which is formulated as follows.
[Mean estimation using statistical queries] Let be a distribution over the unit ball of a normed space , and suppose we are allowed statistical queries with tolerance . What is the smallest , for which we can always recover a point such that holds w.h.p. Clearly, , and, as [FGV17] showed, for every norm. We say that a norm over is tractable if one can achieve (with queries of tolerance ). The main result of [FGV17] can be stated as follows. [[FGV17]] The norm over is tractable if and only if . The fact that the norm is tractable is trivial, since we can estimate each coordinate of the mean separately. However, the corresponding algorithm for norms for is more delicate and is based on random rotations, while the naïve coordinate-by-coordinate estimator merely gives . [FGV17] raise several intriguing open problems, among them the following two:
Characterize tractable norms beyond ;
Solve Problem 1 for the spectral norm and other Schatten- norms of matrices;
In this paper, we make progress towards solving the first problem and completely resolve the second one.
1.1 Our results
Our first result gives a complete characterization of symmetric tractable norms. A norm is symmetric if it is invariant under all permutations of coordinates and sign flips (for many examples beyond norms, see [ANN17]). Recently there has been substantial progress in understanding various algorithmic tasks for general symmetric norms [BBC17, ANN17, SWZ18, ALS18]. In this paper, we significantly extend Theorem 1 to all the symmetric norms. To formulate our result, we need to define the type- constant of a normed space, which is one of the standard bi-Lipschitz invariants ([Woj96]).
For a normed space , the type- constant of , denoted by , is defined as the smallest such that the following holds. For every sequence of vectors and for uniformly random , one has:
We are now ready to state our result. A symmetric normed space is tractable iff . Theorem 1.1 easily implies Theorem 1, since for , , while for one has and ([BCL94]). For a quantitative version of Theorem 1.1, see Theorem 3.1 and Theorem 3.2.
Recall that for a matrix , the Schatten- norm of is the
norm of the singular values of. In particular, the Schatten- norm of is simply the spectral norm of , and the Schatten-
norm corresponds to the Frobenius norm. Such norms are very well-studied and arise naturally in many applications in learning and probability theory. Our second main result settles the tractability of Schatten-norms, resolving a question of [FGV17]. The Schatten- norm is tractable iff . For a quantitative version of Theorem 1.1, see Theorem 4. Theorem 1.1 shows that one cannot remove “symmetric” from Theorem 1.1, since type- constants of Schatten- spaces are essentially the same as for the corresponding spaces ([BCL94]). Specifically, for , Schatten- spaces have small type- constant, but are intractable. In particular, we show that the best mean estimation algorithm for Schatten- can be obtained by embedding the space into (via the identity map) and then using the estimation algorithm from [FGV17].
The main technical tool underlying the algorithm for mean estimation in symmetric norms is the following geometric statement. For any symmetric norm , consider the set consisting of the level- ring, i.e., all points whose non-zero coordinates have absolute value between and , and consider the smallest radius where . Then,
Given the above geometric statement, which generalizes the similar statement for norms from [FGV17], we generalize the algorithm from [FGV17] to the symmetric norms setting. Specifically, we divide the distribution into distributions, each lying on a level- ring of , so that the sum of the estimates of the distributions is a good estimate for the original distribution. By the first inclusion in (2), we may use the mean estimation algorithms for and on each ring after an appropriate scaling with error . Running these two algorithms, we can get an approximation to mean of the distribution on the ring up to error in and in . Via the second inclusion in (2), this will be a good estimate in provided is small.
The lower bound for norms with large type- constants is a generalization of the result in [FGV17]; in particular, the hard distributions for from [FGV17] are supported on basis vectors, which are exactly those achieving in (1). For general norms , we consider the analogous distributions supported on an arbitrary set of vectors achieving in (1); however, the fact that we have much less control on the vectors necessitates additional care.
The Schatten- norms, for , do satisfy , so new ideas are required in proving the lower bound. We show the lower bound for carefully crafted hard distributions, using hypercontractivity to show concentration of the result of an arbitrary statistical query.
Here we introduce some basic notions about normed spaces and statistical algorithms. We will use boldfaced letters for random variables, and the notationwill mean that is a random vector chosen uniformly from .
For any vector , we let be the vector with each coordinate replaced by its absolute value, and let be the vector obtained by applying the permutation matrix to which sorts coordinates of by order of non-increasing value. A normed space is symmetric if holds for every .
We recall that is the normed space over with the norm of a vector given by . The Schatten- space is defined over matrices with real entries, and the norm of a matrix is defined as the norm of its singular values. We omit the superscript and just write and when this does not cause confusion.
For a normed space , let be the unit ball of the norm . Furthermore, for , we let be the normed space over sequences of vectors where
Next we define the type of a normed space. Let be a normed space, , and . Let be the infimum over such that:
for all . We let , and say has type with constant . Note that, by the parallelogram identity, the Euclidean space has type with constant , and in fact the inequality becomes an equality. Together with John’s theorem, this implies that any -dimensional normed space has type with constant at most . However, we are typically interested in spaces that have type with constant independent of dimension. It follows from the results in [BCL94] that for , has type with constant , and for , has type with constant ; at the same time, considering the standard basis of shows that for , the type constant of goes to infinity with the dimension . Moreover, these results also hold for Schatten- spaces.
For a normed space , let be the unit ball of the norm . Furthermore, for , we let be the normed space over sequences of vectors where .
Finally, we define formally statistical algorithms and the and oracles. We follow the definitions from [FGR13b]. Let be a distribution supported on . For a tolerance parameter , the oracle takes a query function , and returns some value satisfying . For a sample size parameter , the oracle takes a query function and returns some value such that , for , and .
We call an algorithm that accesses the distribution only via one of the above oracles a statistical algorithm. Clearly, is at least as strong as and no stronger than . The lower bounds presented will follow the framework of [FPV18]. The discrimination norm for a distribution supported on and a set of distributions supported on is given by:
where is sampled uniformly at random, and . The decision problem is the problem of distinguishing whether an unknown distribution or is sampled uniformly from . The statistical dimension with discrimination norm , , is the largest integer such that for a finite subset , any subset of size at least satisfies .
[Theorem 7.1 in [FPV18]] For , let for a distribution and set of distributions supported on a domain . Any randomized statistical algorithm that solves with probability at least requires calls to .
3 Symmetric norms
3.1 Mean estimation using SQ for type-2 symmetric norms
Let be any symmetric norm with . Let be the maximum number of coordinates set to in a vector within the unit ball of , i.e.,
and be the maximum norm of a vector within the unit ball of with coordinates set to , i.e.,
The following is the main lemma needed for the statistical query algorithm for type-2 symmetric norms. The lemma is a generalization of Lemma 3.12 from [FGV17] from norms (with ) to arbitrary type- symmetric norms. The lemma bounds the norm in of an arbitrary vector , given corresponding bounds on the and .
Let be a symmetric norm with type- constant . Fix any , and let satisfy and . Then, .
Given the vector , consider the sets for given by
and let be the vector given by letting the first coordinates be , and the remaining coordinates be 0. Because is symmetric with respect to changing the sign of any coordinate of , the triangle inequality easily implies that is monotone with respect to for any . Then, by the triangle inequality and the fact that is symmetric with , ; thus, it remains to bound for every .
We then have , where, in the first inequality, we used the fact that , and, in the second inequality, we used the definition of . As a result, we have , so consider partitioning the non-zero coordinates of into at most groups, each of size at most , and let be the vectors so . We have
where the equality uses the symmetry of with respect to changing signs of coordinates, the inequality (a) uses the definition of type constants, and the inequality (b) follows from the definition of . We obtain the desired lemma by summing over all , for . ∎
With this structural result, we now show: Let be a symmetric norm with type-2 constant normalized so . There exists an algorithm for mean estimation over making queries to , where the accuracy satisfies
For , and , let be the level vector of , i.e., . For any fixed distribution supported on the unit ball of , we may consider the distribution given by where . Denote and , so that distributions satisfy . As a result, the sum of -approximations of would result in an -approximation of .
The algorithm proceeds by estimating the mean of each distribution and then taking the sum of all estimates:
For each , we consider as the distribution given by where , and as the distribution given by . Note that is supported on , and is supported on .
Perform the mean estimation algorithms for and as given in [FGV17] with error parameter where to obtain vectors , and let and where
Find one vector where and , and return as an estimate for .
Given estimates for all , output .
We note that the inequalities in (3) follow from the fact that and are -approximations for (in ) and (in ), respectively, and that
In order to see that is a good estimate for , let be the error vector in the approximation. From the triangle inequality, and the definition of , we have and , so that Lemma 3.1 implies . ∎
3.2 Lower bounds for normed spaces with large type-2 constants
We now give a lower bound for normed spaces which have large type-2 constant.
Let be a normed space with type-2 constant . There exists an such that any statistical algorithm for mean estimation in with error making queries to must make
such queries. The immediate corollary of Theorem 3.2 shows the upper bound from Theorem 3.1 is tight up to poly-logarithmic factors. Let be a normed space with type- constant . Any algorithm for mean estimation in making -queries to must have
We set up some notation and basic observations leading to a proof of Theorem 3.2.
Let be a normed space with type- constant . Then, for any , there exists some , as well as a sequence of vectors , where for every , and
with for an absolute constant .
Since , there exists a sequence such that . A well-known comparison inequality between Rademacher and Gaussian averages (see e.g. Lemma 4.5. in [LT11]) gives that for a sequence of independent standard Gaussian random variables , . Let us assume, without loss of generality, that for every . For any , define the sequence to consist of copies of and a single copy of , and note for every and . Observe also that, if are independent standard Gaussian random variables, then is distributed identically to , and, moreover, . Therefore, we have By the Gaussian version of the Khinntchine-Kahane inequalities (Corollary 3.2. in [LT11]) and the Zygmund-Paley inequality, we have that for some absolute constant , with probability at least , .
We define the sequence to contain copies of each vector , for some large enough integer
. By the central limit theorem, as, converges in disribution to . Then, for a large enough , with probability at least , we have that The lemma follows with , since the left hand side above is always non-negative. ∎
Description of the lower bound instance
In this section we describe the instance which achieves the lower bound in Theorem 3.2.
Fix a sequence satisfying (4) guaranteed to exists by Lemma 3.2, and let the sequence be defined by . In the language of [FGV17], let be the reference distribution supported on given by sampling where for all ,
so that . We will let be so that . For , let be the distribution supported on given by sampling where for all ,
Consider the distribution on distributions which is uniform over all where . Then, we have111Here and in the rest of the paper we use to mean that there exists an absolute constant , independent of all other parameters, such that , and, analogously, to mean :
where (9) and (10) follow from the Khintchine-Kahane inequalities and the definition of . By the Payley-Zygmund inequality, , for some . We thus conclude the following lemma, which follows from the preceding discussion. Suppose there exists a statistical algorithm for mean estimation over with error making queries to , then for distribution as in (5) and set as in (6), has a statistical algorithm making queries of accuracy which succeeds with constant probability.
We now turn to computing the statistical dimension of , as described in Definition 2.
Let be any function with . Note that
so that by the Hoeffding inequality, any satisfies