Learning a Single Neuron with Adversarial Label Noise via Gradient Descent
We study the fundamental problem of learning a single neuron, i.e., a function of the form 𝐱↦σ(𝐰·𝐱) for monotone activations σ:ℝ↦ℝ, with respect to the L_2^2-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution D on (𝐱, y)∈ℝ^d ×ℝ such that there exists 𝐰^∗∈ℝ^d achieving F(𝐰^∗)=ϵ, where F(𝐰)=𝐄_(𝐱,y)∼ D[(σ(𝐰·𝐱)-y)^2]. The goal of the learner is to output a hypothesis vector 𝐰 such that F(𝕨)=C ϵ with high probability, where C>1 is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity O(d/ϵ), which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity Õ(d (1/ϵ)). Prior to our work, the best known constant-factor approximate learner had sample complexity Ω̃(d/ϵ). In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) L_2^2-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.
READ FULL TEXT