Agnostic Learning of a Single Neuron with Gradient Descent

05/29/2020 ∙ by Spencer Frei, et al. ∙ 4

We consider the problem of learning the best-fitting single neuron as measured by the expected squared loss E_(x,y)∼D[(σ(w^ x)-y)^2] over an unknown joint distribution of the features and labels by using gradient descent on the empirical risk induced by a set of i.i.d. samples S ∼D^n. The activation function σ is an arbitrary Lipschitz and non-decreasing function, making the optimization problem nonconvex and nonsmooth in general, and covers typical neural network activation functions and inverse link functions in the generalized linear model setting. In the agnostic PAC learning setting, where no assumption on the relationship between the labels y and the features x is made, if the population risk minimizer v has risk OPT, we show that gradient descent achieves population risk O( OPT^1/2 )+ϵ in polynomial time and sample complexity. When labels take the form y = σ(v^ x) + ξ for zero-mean sub-Gaussian noise ξ, we show that gradient descent achieves population risk OPT + ϵ. Our sample complexity and runtime guarantees are (almost) dimension independent, and when σ is strictly increasing and Lipschitz, require no distributional assumptions beyond boundedness. For ReLU, we show the same results under a nondegeneracy assumption for the marginal distribution of the features. To the best of our knowledge, this is the first result for agnostic learning of a single neuron using gradient descent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we describe the properties of gradient descent for learning the best possible single neuron that captures the relationship between a set of features and labels as measured by the expected squared loss over some unknown joint distribution . In particular, for a given activation function , we define the population risk associated with a set of weights as

(1.1)

The activation function is assumed to be non-decreasing and Lipschitz, and includes nearly all activation functions used in neural networks such as the rectified linear unit (ReLU), sigmoid,

, and so on. In the agnostic PAC learning setting of Kearns et al. (1994), no structural assumption is made regarding the relationship of the features and the labels, and so the best-fitting neuron could, in the worst case, have nontrivial population risk. Concretely, if we denote

(1.2)

then the goal of a learning algorithm is to (efficiently) return weights such that the population risk is close to the best possible risk . The agnostic learning framework stands in contrast to the realizable PAC learning setting, where one assumes , so that there is some such that the labels are given as .

The learning algorithm we use in this paper is vanilla gradient descent. We assume we have access to a set of i.i.d. samples , and we run gradient descent with a fixed step size on the empirical risk induced by the empirical distribution of the samples.

Surprisingly little is known about gradient descent-trained neural networks in the agnostic PAC learning framework. We are aware of two works in the improper agnostic learning setting, where the goal is to return a hypothesis that achieves population risk close to , where is the smallest possible population risk achieved by a disjoint set of hypotheses  (Allen-Zhu et al. (2019); Allen-Zhu and Li (2019)). Another work considered the random features setting where only the final layer of the network is trained and the marginal distribution over the features is uniform on the unit sphere (Vempala and Wilmes (2019)). But none of these address the simplest possible neural network: that of a single neuron . We believe a full characterization of what we can (or cannot) guarantee for gradient descent in the single neuron setting will help us understand what is possible in the more complicated deep neural network setting. Indeed, two of the most common hurdles in the analysis of deep neural networks trained by gradient descent—nonconvexity and nonsmoothness—are also present in the case of the single neuron. We hope that our analysis in this relatively simple setup will be suggestive of what is possible in more complicated neural network models.

Our main contributions can be summarized as follows.

  1. Agnostic setting. Without any assumptions on the relationship between and , and assuming only boundedness of the marginal distributions on and , we show that for any , gradient descent finds a point with population risk when is strictly increasing and Lipschitz. We can show the same result for ReLU when the marginal distribution of satisfies a marginal spread condition (Assumption 3.2). The sample complexity is of the order and runtime of order , with both complexities independent of the input dimension.

  2. Noisy teacher network setting. When , where is mean zero and sub-Gaussian (and possibly dependent on ), we demonstrate that gradient descent finds satisfying for activation functions that are strictly increasing and Lipschitz assuming only boundedness of the marginal distribution over . The same result holds for ReLU under a marginal spread assumption given below in Assumption 3.2. The runtime and sample complexity is of order , with logarithmic dependence on the input dimension. When the noise is bounded, our guarantees are dimension independent. If we further know , i.e. the learning problem is in the realizable rather than agnostic setting, we can improve the complexity guarantees from to

    by using online stochastic gradient descent.

2 Related work

Below, we provide a high-level summary of related literature in the agnostic learning and teacher network settings. Detailed comparisons with the most related works will appear after we present our main theorems in Sections 3 and 4. In Appendix A, we provide tables that describe the assumptions and complexity guarantees of our work in comparison to related works.

Agnostic learning: The simplest version of the agnostic regression problem is that of finding a hypothesis that matches the performance of the best linear predictor. In our setting, this corresponds to being the identity function. This problem is essentially completely characterized: Shamir (2015) showed that any algorithm that returns a linear predictor has risk when the labels satisfy and the features are supported on the unit ball, matching upper bounds proved by Srebro et al. (2010) using mirror descent.

When is not the identity, related works are scarce. The only work on agnostic learning of a single neuron that we are aware of is Goel et al. (2019), where the authors considered the problem of learning a single ReLU when the features are standard -dimensional Gaussians. In this setting, they showed learning up to risk in polynomial time is as hard as the problem of learning sparse parities with noise, long believed to be computationally intractable. By reducing the problem of learning a ReLU to one of learning a halfspace, they use an algorithm of Awasthi et al. (2017) to show learnability up to . In a related but incomparable set of results, Allen-Zhu et al. (2019) and Allen-Zhu and Li (2019) studied improper agnostic learnability for neural networks in the multilayer setting when the labels are generated by some multilayer network with a smooth activation function and the hypothesis class is a deep ReLU network. Vempala and Wilmes (2019) studied agnostic learning of a one-hidden-layer neural network when the first layer is fixed at its (random) initial values and the second layer is trained.

Teacher network: The literature refers to the case of for some possible mean zero noise variously as the “noisy teacher network” or “generalized linear model” (GLM) setting, and is related to the probabilistic concepts model introduced by Kearns and Schapire (1994). In the GLM setting,

plays the role of the inverse link function; in the case of logistic regression,

is the sigmoid.

The results in the teacher network setting can be broadly characterized by (1) whether they cover arbitrary distributions over the features and (2) the presence of noise (or lackthereof). The GLMTron algorithm proposed by Kakade et al. (2011), itself a modification of the Isotron algorithm of Kalai and Sastry (2009), is known to learn a noisy teacher network up to risk for any -Lipschitz and non-decreasing and any distribution with bounded marginals over Mei et al. (2018) showed that regularized gradient descent learns the noisy teacher network under a smoothness assumption of the activation function for a large class of distributions. Foster et al. (2018) provided a meta-algorithm for translating -stationary points of the empirical risk to minimal points of the population risk under certain conditions, and showed that such conditions are satisfied by regularized gradient descent. A recent work by Mukherjee and Muthukumar (2020) develops a modified SGD algorithm for learning a ReLU with bounded noise on distributions where the features are bounded.

Of course, any guarantee that holds for a neural network with a single fully connected hidden layer of arbitrary width holds for the single neuron, so in a sense our work connects to a larger body of work on the analysis of gradient descent used for learning neural networks. The majority of such works are restricted to particular distributions of the feature set, whether it is Gaussian or uniform distributions 

(Soltanolkotabi, 2017; Tian, 2017; Soltanolkotabi et al., 2019; Zhang et al., 2019; Goel et al., 2018; Cao and Gu, 2019)Du et al. (2018)

showed that in the noiseless (a.k.a. realizable) setting, a single neuron can be learned with SGD if the feature distribution satisfies a certain subspace eigenvalue property. 

Yehudai and Shamir (2020) studied the properties of learning a single neuron for a variety of increasing and Lipschitz activation functions using gradient descent, as we do in this paper, although their analysis was restricted to the noiseless setting.

3 Agnostic setting

We begin our analysis by assuming there is no a priori relationship between and , and so the population risk of the population risk minimizer defined in 1.2 may, in general, be a large quantity. If , then a.s., and so we are in the realizable PAC learning setting. In this case, we can use a modified proof technique to improve our guarantee from to , with sample and runtime complexity of order by using online stochastic gradient descent; see Appendix B for the complete theorems and proofs in this setting. In what follows, we will therefore assume without loss of generality that .

The gradient descent method we use in this paper is as follows. We assume we have samples , and define the empirical risk for weight by

We perform full-batch gradient updates on the empirical risk using a fixed step size ,

(3.1)

After running updates, the algorithm outputs .

We begin by describing one set of activation functions under consideration in this paper.

Assumption 3.1.
  1. is continuous, non-decreasing, and differentiable almost everywhere.

  2. For any , there exists such that . If is not differentiable at , assume that every subgradient on the interval satisfies .

  3. is -Lipschitz, i.e. for all .

We note that if is strictly increasing and continuous, then satisfies Assumption 3.1(b) since its derivative is never zero. In particular, the assumption covers the typical activation functions in neural networks like leaky ReLU, softplus, sigmoid, tanh, etc., but excludes ReLU. Yehudai and Shamir (2020) recently showed that when is ReLU, there exists a distribution supported on the unit ball and unit length target neuron such that even in the realizable case of , if the weights are initialized randomly using a product distribution, then there exists a constant

such that with high probability,

throughout the trajectory of gradient descent. This suggests that gradient-based methods for learning ReLUs are likely to fail without additional assumptions. Because of this, they introduced the following marginal spread assumption to allow for convergence guarantees.

Assumption 3.2.

There exist constants such that the following holds. For , denote by the marginal distribution of on , viewed as a distribution over , and let be its density function. Then .

This assumption covers, for instance, standard Gaussian distributions and centered uniform distributions with

, and holds for any distribution mixed with some Gaussian or uniform noise. We note that a similar assumption was used in recent work by Diakonikolas et al. (2020) on learning halfspaces with Massart noise. We will use this assumption for all of our results when is ReLU. Additionally, although the ReLU is not differentiable at the origin, we will denote by its subgradient, with the convention that

. Such a convention is consistent with the implementation of ReLUs in modern deep learning software packages.

With the above in hand, we can describe our main theorem.

Theorem 3.3.

Suppose the marginals of satisfy a.s. and a.s. Let . Assume gradient descent is initialized at and fix a step size . If satisfies Assumption 3.1, let the constant corresponding to . For any , with probability at least , gradient descent run for iterations finds weights , , such that

(3.2)

where .

When is ReLU, further assume that satisfies Assumption 3.2 for constants , and let . Then 3.2 holds by replacing with , where , , and .

In comparison to recent work, Goel et al. (2019) considered the agonstic setting for the ReLU activation when the marginal distribution over is a standard Gaussian and showed that learning up to risk is as hard as learning sparse parities with noise, long believed to be computationally intractable. By using an approximation algorithm of Awasthi et al. (2017), they were able to show that one can learn up to with runtime and sample complexity. By contrast, we use gradient descent to learn up to a (weaker) risk of but for any joint distribution with bounded marginals when satisfies Assumption 3.1. In the case of ReLU, our guarantee holds for the class of distributions over with finite support that satisfy the marginal spread condition of Assumption 3.2, and for all activation functions we consider, the runtime and sample complexity guarantees do not have (explicit) dependence on the dimension. (We note that for some distributions, the term may hide an implicit dependence on ; more detailed comments on this are given in Appendix A.) Moreover, we shall see in the next section that if the data is known to come from a noisy teacher network, the guarantees of gradient descent improve from to .

In the remainder of this section we will prove Theorem 3.3. Our proof relies upon the following two auxiliary errors for the true risk :

(3.3)

We will denote the corresponding empirical risks by and . We first note that trivially upper bounds : this follows by a simple application of Young’s inequality and, when , by using iterated expectations.

Claim 3.4.

For any joint distribution

, for any vector

, and any continuous activation function ,

If additionally we know that , we have .

To see that is an upper bound for , it is easy to see that if , then implies . However, the only typical activation function that is covered by such an assumption is the leaky ReLU. Fortunately, when satisfies Assumption 3.1, or when is ReLU and satisfies Assumption 3.2, Lemma 3.5 below shows that is still an upper bound for . The proof is left for Appendix B.

Lemma 3.5.

If satisfies Assumption 3.1, a.s., and , then for corresponding to , implies . If is ReLU and satisfies Assumption 3.2 for some constants , and for some , , then implies holds.

We can now focus on showing that gradient descent finds a point where is small. In Lemma 3.6 below, we show that is a natural quantity of the gradient descent algorithm that in a sense tells us how good of a direction the gradient is pointing at time , and that can be as small as . Our proof technique is similar to that of Kakade et al. (2011), who studied the GLMTron algorithm in the (non-agnostic) noisy teacher network setup.

Lemma 3.6.

Suppose that a.s. under . Suppose is non-decreasing and -Lipschitz. Assume . Gradient descent run with fixed step size from initialization finds weights satisfying within iterations, with for each .

Before beginning the proof, we first note the following simple fact, which allows for us to connect terms that appear in the gradient to the squared loss.

Fact 3.7.

If is non-decreasing and -Lipschitz, then for any in the domain of ,

Proof of Lemma 3.6.

The proof comes from the following induction statement. We claim that for every , either (a) for some , or (b) holds. If this claim is true, then at every iteration of gradient descent, we either have or . Since , this means there can be at most iterations until we reach . This shows the induction statement implies the theorem.

We begin with the proof by supposing the induction hypothesis holds for , and want to consider the case . If (a) holds, then we are done. So now consider the case that for every , we have . Since (a) does not hold, holds for each , and so implies

(3.4)

We can therefore bound

(3.5)
(3.6)

In the first inequality, we have used Fact 3.7 and that for the first term. For the second term, we use Cauchy–Schwarz. The last inequality is a consequence of 3.4, Cauchy–Schwarz, and that and . As for the gradient upper bound at , we have

(3.7)

The first inequality uses Young’s inequality, and the second uses Jensen’s inequality and that is -Lipschitz and . Putting 3.6 and 3.7 together, the choice of ensures

(3.8)

The last line comes from the induction hypothesis that and since . This completes the proof. ∎

Since the auxiliary error is controlled by , we need to bound , which we can do by demonstrating a bound on . Since the marginals of are bounded, Lemma 3.8 below shows that concentrates around at rate by Hoeffding’s inequality; for completeness, the proof is given in Appendix E.

Lemma 3.8.

If and a.s. under and respectively, and if is non-decreasing, then for and , we have with probability at least ,

The final ingredient to the proof is translating the bounds for the empirical risk to one for the population risk. Since is bounded and since we showed in Lemma 3.6 that throughout the gradient descent trajectory, we can use standard properties of Rademacher complexity to translate the training loss bound to one for the test loss. The proof for Lemma 3.9 can be found in Appendix E.

Lemma 3.9.

For training set , let denote the empirical Rademacher complexity of a class of functions , and suppose is -Lipschitz. Suppose a.s. Denote and

Then

With Lemmas 3.6, 3.8 and 3.9 in hand, the bound for the population risk follows in a straightforward manner.

Proof of Theorem 3.3.

By Lemma 3.6, there exists some with and , such that . For satisfying Assumption 3.1, Lemmas 3.5 and 3.8 imply that

(3.9)

Since implies , standard results from Rademacher complexity imply (e.g. Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)) that with probability at least ,

where is the loss and is the function class defined in Lemma 3.9. For the second term above, Lemma 3.9 and rescaling yields that

This shows that . By Claim 3.4,

completing the proof when is strictly increasing.

When is ReLU, the proof has one technical difference. Although Lemma 3.9

applies to the loss function

, the same results hold for the loss function for ReLU, since a.e. and so is still -Lipschitz. We thus have

(3.10)

With this in hand, the proof is essentially identical: By Lemmas 3.6 and 3.8,

(3.11)

so that we have

(3.12)

Since satisfies Assumption 3.2 and , Lemma 3.5 yields . Then Claim 3.4 completes the proof. ∎

Remark 3.10.

An inspection of the proof of Theorem 3.3 shows that when satisfies Assumption 3.1, any initialization with bounded by a universal constant will suffice. In particular, if we use Gaussian initialization for

, then by concentration of the chi-squared distribution the theorem holds with (exponentially) high probability over the random initialization. For ReLU, initialization at the origin greatly simplifies the proof since Lemma

3.6 shows that for all . When , this implies that throughout the trajectory of gradient descent, and thus allows for an easy application of Lemma 3.5. For isotropic Gaussian initialization, one can show that with probability approaching 1/2 that

provided its variance satisfies

(see e.g. Lemma 5.1 of Yehudai and Shamir (2020)). In this case, the theorem will hold with constant probability over the random initialization.

4 Noisy teacher network setting

We now assume the joint distribution of is given by a target neuron (with ) plus zero-mean and -sub-Gaussian noise,

We assume throughout this section that ; we deal with the realizable setting separately (and achieve improved sample complexity) in Appendix D. We note that this is precisely the setup of the generalized linear model with (inverse) link function . We further note that we only assume that , i.e., the noise is not assumed to be independent of the features , and thus falls into the probabilistic concept learning model of Kearns and Schapire (1994).

With the additional structural assumption of a noisy teacher, we can improve the agnostic result from to exactly , as well as improve the order of the sample complexity from to . The key difference with the agnostic proof is that when trying to show the gradient points in a good direction as in 3.5, since we know , the average of terms of the form will concentrate around zero provided the are bounded. This allows for us to improve the lower bound from to one of the form . The full proof of Theorem 4.1 is given in Appendix C.

Theorem 4.1.

Suppose satisfies a.s. and that for some . Assume that is -sub-Gaussian. Assume gradient descent is initialized at and fix a step size . If satisfies Assumption 3.1, let be the constant corresponding to . There exists an absolute constant such that for any , with probability at least , gradient descent run for finds weights , , satisfying

(4.1)

where , , and . When is ReLU, further assume that satisfies Assumption 3.2 for constants , and let . Then 4.1 holds for , , and .

We first note that although 4.1 contains a term, this term can be removed if we assume that the noise is bounded rather than sub-Gaussian; details for this are given in Appendix C. As mentioned previously, if we are in the realizable setting, i.e. , we can improve the sample and runtime complexity to by using online SGD and a martingale Bernstein inequality. For details on the realizable case, see Appendix D.

In comparison with recent literature, Kakade et al. (2011) proposed GLMTron to show learnability of the noisy teacher network for any non-decreasing and Lipschitz activation when the noise is bounded. (A close inspection of the proof shows that subgaussian noise can be handled with the same norm sub-Gaussian concentration that we use for our results.) In GLMTron, updates take the form where , while in gradient descent, the updates take the form where . Intuitively, when the weights are in a bounded region and is strictly increasing and Lipschitz, then the derivative satisfies and so the additional factor should not substantially affect the algorithm. For ReLU this is more complicated as the gradient could in the worst case be zero in a large region of the input space, preventing effective learnability using gradient-based optimization, as was demonstrated in the negative result of Yehudai and Shamir (2020). For this reason, a type of nondegeneracy condition like our Assumption 3.2 is natural for gradient descent on ReLUs.

In terms of other results for ReLU, recent work by Mukherjee and Muthukumar (2020) introduced another modified version of SGD, where updates now take the form , where , where is an upper bound for the noise term. Using this modified SGD, they showed learnability of the ReLU in the noisy teacher network setting with bounded noise under the nondegeneracy condition that the matrix is positive definite. A similar assumption was used by Du et al. (2018) in the realizable setting.

Our GLM result is also comparable to recent work by Foster et al. (2018), where the authors provide a meta-algorithm for translating guarantees for -stationary points of the empirical risk to guarantees for the population risk under Polyak–Łojasiewicz-like (PL-like) conditions on the population risk, provided the algorithm can guarantee that the weights remain bounded (see their Proposition 3). By considering GLMs with bounded, strictly increasing, Lipschitz activations, they show the PL-type condition holds, and that any algorithm that can find a stationary point of an -regularized empirical risk objective is guaranteed a population risk bound. In contrast, our result concretely shows that vanilla gradient descent learns the GLM, even in the ReLU setting.

5 Conclusion and remaining open problems

In this work, we showed that gradient descent can achieve population risk in the agnostic setting, and in the noisy teacher network and realizable settings, for the most common activation functions used in practice. Is it possible to show stronger results for gradient descent, or are there distributions for which gradient descent cannot learn better than without further assumptions? This question remains for neural networks with one or more hidden layers as well. Additionally, we focused on the regression problem with real-valued label outputs. Understanding the properties of gradient descent for the agnostic learning of halfspaces generated by single neurons remains an interesting open problem.

Appendix A Detailed comparisons with related work

Algorithm Activations Pop. risk Sample
Complexity
Halfspace reduction
(Goel et al., 2019)
ReLU standard
Gaussian
Gradient Descent
(This paper)
strictly
increasing
+ Lipschitz
bounded
Gradient Descent
(This paper)
ReLU bounded
+ marginal
spread
Table 1: Comparison of results in the agnostic setting
Algorithm Activations Sample
Complexity
GLMTron
(Kakade et al., 2011)
increasing
+ Lipschitz
bounded
Modified Stochastic Gradient Descent
(Mukherjee and Muthukumar, 2020)
ReLU bounded
+ subspace eigenvalue
Meta-algorithm
(Foster et al., 2018)
strictly
increasing
+ Lipschitz
+ Lipschitz
bounded
Gradient Descent
(Mei et al., 2018)
strictly increasing
+ diff’ble
+ Lipschitz
+ Lipschitz
+ Lipschitz
centered
+ sub-Gaussian
+
Gradient Descent
(This paper)
strictly increasing
+ Lipschitz
bounded
Gradient Descent
(This paper)
ReLU bounded
+ marginal spread
Table 2: Comparison of results in the noisy teacher network setting
Algorithm Activations Sample
Complexity
Stochastic Gradient Descent
(Du et al., 2018)
ReLU bounded
+ subspace eigenvalue
Projected Regularized
Gradient Descent
(Soltanolkotabi, 2017)
ReLU standard
Gaussian
Population Gradient Descent
(Yehudai and Shamir, 2020)
leaky ReLU bounded
+
Population Gradient Descent
(Yehudai and Shamir, 2020)

+ Lipschitz
bounded
+ marginal spread
Population Gradient Flow
(Yehudai and Shamir, 2020)
ReLU marginal spread
+ spherical symmetry
Stochastic Gradient Descent
(Yehudai and Shamir, 2020)

+ Lipschitz
bounded
+ marginal spread
Population Gradient Descent
+ Stochastic Gradient Descent
(This paper)
strictly increasing
+ Lipschitz
bounded
Population Gradient Descent
+ Stochastic Gradient Descent
(This paper)
ReLU bounded
+ marginal spread
Table 3: Comparison of results in the realizable setting

Here, we describe comparisons of our results to those in the literature and give detailed comments on the specific rates we achieve. In Table 1, we compare our agnostic learning result with that of Goel et al. (2019). We note the guarantees for the population risk in the fourth column, the marginal distributions over for which the bounds hold in the fifth column, and the sample complexity required to reach the specified level of risk plus some in the final column. Our results in this setting come from Theorem 3.3. The Big-O notation hides constant that may depend on the parameters of the distribution or activation function, but does not hide explicit dependence on the dimension . However, the parameters of the distribution itself may have implicit dependence on the dimension. In particular, for bounded distributions that satisfy , the hides multiplicative factors that depend on . This means that if depends on , so will our bounds. For non-ReLU, the worst-case activation functions under consideration in Assumption 3.1 (e.g. the sigmoid) can have , making the runtime and sample complexity depend on , in which case it is better to assume that is a constant independent of the dimension.

In Table 2, we provide comparisons of our noisy teacher network (also known as the generalized linear model or the probabilistic concepts model) results. Our results in this setting come from Theorem 4.1. The complexity column here denotes the sample complexity required to reach population risk . The subspace eigenvalue assumption given by Mukherjee and Muthukumar (2020) is that . Of course, any result that holds for the agnostic setting also holds in the generalized linear model setting, but for all results we consider, the population risk guarantee is strictly worse than what is achieved in the noisy teacher network setting.

Finally, in Table 3, we provide comparisons with results in the realizable setting. (Our results in this setting are given in Theorem D.1 in Appendix D.) For G.D. and S.G.D., the complexity column denotes the sample complexity required to reach population risk . For G.D. or gradient flow on the population risk (‘Pop. G.D.’), it refers to the runtime complexity only as there are no samples in this setting. For Du et al. (2018), the subspace eigenvalue assumption is that for any and for the target neuron , it holds that . This is a nondegeneracy assumption that is related to the marginal spread condition given in Assumption 3.2, in the sense that it allows for one to show that is an upper bound for . Finally, we note that any result in the agnostic or noisy teacher network settings applies in the realizable setting as well.

Appendix B Proof of Lemma 3.5

To prove Lemma 3.5, we use the following result of Yehudai and Shamir (2020).

Lemma B.1 (Lemma B.1, Yehudai and Shamir).

Under Assumption 3.2, for any two vectors satisfying for , it holds that

Proof of Lemma 3.5.

By assumption,

We first consider the case when satisfies Assumption 3.1. Since the term in the expectation is nonnegative, restricting the integral to a smaller set only decreases its value, so that

(B.1)

For , since , the inclusion holds. We thus have

Dividing both sides by completes this part of the proof.

For ReLU, denote the event

and define