On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation
We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class ℋ = {x↦⟨v, σ∘ Wx + b⟩ : b∈ℝ^d, W ∈ℝ^T× d, v∈ℝ^T} where the spectral norm of W and v is bounded by O(1), the Frobenius norm of W is bounded from its initialization by R > 0, and σ is a Lipschitz activation function. We prove that if σ is element-wise, then the sample complexity of ℋ is width independent and that this complexity is tight. Moreover, we show that the element-wise property of σ is essential for width-independent bound, in the sense that there exist non-element-wise activation functions whose sample complexity is provably width-dependent. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach, that will hopefully inspire future works.
READ FULL TEXT