A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods

05/08/2020 ∙ by Anirbit Mukherjee, et al. ∙ Johns Hopkins University 0

In this work we demonstrate provable guarantees on the training of depth-2 neural networks in new regimes than previously explored. (1) We start with a simple stochastic algorithm that can train a ReLU gate in the realizable setting with significantly milder conditions on the data distribution than previous results. Leveraging some additional distributional assumptions we also show near-optimal guarantees of training a ReLU gate when an adversary is allowed to corrupt the true labels. (2) Next we analyze the behaviour of noise assisted gradient descent on a ReLU gate in the realizable setting. While making no further distributional assumptions, we locate a ball centered at the origin such that all the iterates remain inside it with high probability. (3) Lastly we demonstrate a non-gradient iterative algorithm for which we give near optimal guarantees for training a class of depth-2 neural networks in the presence of an adversary who is additively corrupting the true labels. This analysis brings to light the advantage of having a large width for the network while defending against an adversary. We demonstrate that faced with data poisoning attacks of the kind we instantiate, for our chosen class of nets, the accuracy achieved by the algorithm in recovering the ground truth parameters, scales inversely with the width.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I A review of provable neural training

In this paper we prove results about trainability of certain depth neural nets under more general settings than previously analyzed. Specialized to neural nets, learning theory seeks to solve the following function optimization/risk minimization problem,

(1)

where is some lower-bounded non-negative function, members of are continuous piecewise linear functions representable by some chosen neural net architecture and we only have sample access to the distribution . This reduces to the empirical risk minimization question when this

is an uniform distribution on a finite set of points. To the best of our knowledge about the state-of-the-art in deep-learning either of these two optimization problems is typically solvable in either of the following two mutually exclusive scenarios :

Scenario : Semi-Realizable Data i.e the data comes as tuples with being the noise corrupted output of a net (of known architecture) when given as an input. And Scenario : Semi-Agnostic Data i.e data comes as tuples with no obvious functional relationship between and but there could be geometrical or statistical assumptions about the and .


We note that its not very interesting to work in the fully agnostic setting as in that case training even a single ReLU gate can be SPN-hard as shown in [goel2016reliably] On the other hand the simplifications that happen for infinitely large networks have been discussed since [neal1996priors] and this theme has had a recent resurgence in works like [chizat2018global, jacot2018neural]. Eventually this lead to an explosion of literature getting linear time training of various kinds of neural nets when their width is a high degree polynomial in training set size, inverse accuracy and inverse confidence parameters (a somewhat unrealistic regime), [lee2017deep, wu2019global, du2018gradient, su2019learning, kawaguchi2019gradient, huang2019dynamics, allen2019convergenceDNN, allen2019learning, allen2019convergenceRNN, du2018power, zou2018stochastic, zou2019improved, arora2019exact, arora2019harnessing, li2019enhanced, arora2019fine]. The essential essential proximity of this regime to kernel methods have been thought of separately in works like [allen2019can, wei2019regularization]


Even in the wake of this progress, it remains unclear as to how any of this can help establish rigorous guarantees about smaller neural networks or more pertinently for constant size neural nets which is a regime closer to what is implemented in the real world. Thus motivated we can summarize what is open about training depth nets into the following two questions,


  1. Question Can any algorithm train a ReLU gate to accuracy in time using neither symmetry nor compact support assumptions on the distribution?

    • Question Can a single ReLU gate be trained using (S) GD with (a) random/arbitrary initialization and (b) weakly constrained data distribution - at least allowing it to be non-Gaussian and preferably non-compactly supported?

  2. Question Can a neural training algorithm work with the following naturally wanted properties being simultaneously true?

    1. Nets of depth with a constant/small number of gates.

    2. The training data instances (and maybe also the noise) would have non-Gaussian non-compactly supported distributions.

    3. Less structural assumptions on the weight matrices than being of the single filter convolutional type.

    4. approximate answers be obtainable in at most time.

Ii A summary of our results

We make progress on some of the above fronts by drawing inspiration from two distinct streams of literature and often generalizing and blending techniques from them. First of them are the different avatars of the iterative stochastic non-gradient Tron algorithms analyzed in the past like, [rosenblatt1958perceptron, Pal1992MultilayerPF, freund1999large, kakade2011efficient, klivans2017learning, goel2017learning, goel2018learning]. The second kind of historical precedence that we are motivated by are the different works which have shown how some of the desired theorems about gradient descent can be proven if designed noise is injected into the algorithm in judicious ways, [raginsky2017non, xu2018global, zhang2017hitting, durmus2019analysis, lee2019online, jin2018local, mou2018generalization, li2019generalization]

We will be working exclusively with the function class of neural nets whose activation function can be the

function which maps for being its weight. Hence for such nets the corresponding empirical or the population risk is neither convex nor smooth in how it depends on the weights. Thus to the best of our knowledge none of the convergence results among these provable noise assisted algorithms cited above can be directly applied to our neural nets because these proofs crucially leverage either convexity or very strong smoothness assumptions on the optimization objective.

In our work we focus on the following class of depth nets,

Definition 1 (Single Filter Neural Nets Of Depth ).

Given a set of matrices and an activation function we call the following depth , width neural net to be a single filter neural net defined by the matrices

and where is the Leaky- which maps as, for some

Note that the above class of nets includes any single gate for and it also includes any depth convolutional neural net with a single filter by setting the to be matrices such that each row has exactly one and each column has at most one .  

Our results in this work can be said to be of two types:  

Ii-a Guarantees without the assumption of parity symmetry on the data distribution


We show kinds of results in this category. In Section III first we have shown a very simple iterative stochastic algorithm to recover the underlying parameter of the gate when the realizable data allowed to be sampled online is of the form . The distributional condition is very mild which essentially just captures the intuition that enough of our samples are such that

To the best of our knowledge all previous attempts at this problem in the exactly realizable case have solved this only for specific data distributions like the Gaussian distributions

[soltanolkotabi2017learning, kalan2019fitting]. And results like [goel2018learning] have contained as special cases a solution to this problem but only under symmetry assumptions on the distribution. In contrast our assumptions on the distribution are significantly milder than all previous attempts.

But more importantly by making some extra distributional assumptions, the theorem in Section III also encompasses the case when during training the true labels are being additively distorted by the oracle by a bounded perturbation. In this case we show that the accuracy of the algorithm in recovering is nearly optimal. To the best of our knowledge this is the first guarantee on training a gate in the adversarial setting.

In Section LABEL:sec:GDNoiseReLU we show the first-of-its-kind analysis of gradient descent on a gate albeit when assisted with the injection of certain kinds of noise. We assume that the labels in the data are realizable but we make no assumptions about the distribution on the domain. We make progress by showing that such a noise assisted GD in such a situation has a diffusive behaviour about the global minima i.e after T steps of the algorithm starting from anywhere, w.h.p all the steps of the algorithm would be within distance of the global minima of the function. The key idea here is that of coupling which shows that from the iterates of noise injected gradient descent on the squared loss of a gate one can create a discrete submartingale.

Remark.

We would like to emphasize to the reader that in such a distribution free regime as above, there are no algorithms known yet which can provably train. Also note that the result is parametric in the magnitude of the added noise and hence one can make the algorithm be arbitrarily close to being a pure gradient descent.


In the short Section LABEL:sec:GLMTron we do a quick re-analysis of a known algorithm called GLM-Tron under more general conditions than previously to show how well it can do (empirical) risk minimization on any Lipschitz gate with Lipschitz constant (in particular a gate) in the noisily realizable setting while no assumptions are being made on the distribution of the noise beyond their boundedness - hence the noise can be adversarial. We also point out how the result can be improved under some assumptions on the noise making it more benign. Note that in contrast to the training result in Section III which used a stochastic algorithm, here we are using full-batch iterative updates to gain these extra abilities to deal with more general gates and having essentially no distributional assumptions on training data.

Ii-B Guarantees with the assumption of parity symmetry on the data distribution


In Section LABEL:sec:NeuroTronD2FBSingle we allow ourselves a symmetry assumption on the data to show a generalization of the GLM-Tron algorithm (to what we call the Neuro-Tron) that can provably do approximate parameter recovery for our single filter nets of depth with Leaky- activation (in particular with activations). This multi-gate form of GLM-Tron turns out to be significantly more mathematically complicated to analyze and more so because we allow for adversarial label corruption. Working in the context of allowing for attack on the labels, brings to light advantages of the largeness of the width of the net and shows how that helps the training guarantees. We give a data-dependent threshold value of the width s.t if we consider nets wider than this constant value, then we can get better accuracies of recovery while also allowing for non-trivial attacks on the labels.  

We would like to emphasize the following two salient points here, Firstly it is to be noted that in the above while we leverage parity symmetry of the training data, we are also able to write the proof for any finite width depth nets of the type that we focus on. Along the way the theorem in Section LABEL:sec:NeuroTronD2FBSingle also shows that the ground truth parameters for the finitely large nets that we consider can be recovered using a finite training set with only a symmetry condition being imposed on it. We are not aware of this result being implied by any previous result.

Secondly we would like to emphasize to the reader that [gao2019convergence] is the only previous result we are aware of about provable training of a neural net under any kind of model of adversarial attack. But this result as it stands only applies when the net is asymptotically large/in the NTK regime. In contrast for the class of nets that we consider we give an accuracy-width trade-off for all widths above a constant threshold value showing that once our adversary is fixed, arbitrarily good accuracy of recovery can be achieved by using appropriately wide nets. Also as opposed to the previous such result the adversarial setup in Sections III, LABEL:sec:GLMTron and LABEL:sec:NeuroTronD2FBSingle give guarantees while training with a data poisoning adversary [wang2018data], [zhang2019online], [koh2018stronger] - the attack is allowed to happen on the training data in an online fashion. To the best of our knowledge this work (and our companion paper [AnirbitRam2020Neuro1]) are the first such guarantees about training a net in the presence of a data poisoning attack.

Iii Learning a ReLU gate in the realizable setting and with adversarial attacks on labels

Given distribution s.t , suppose the corresponding true labels are generated as for some unknown . We assume that we have sampling access to and an adversarial oracle that returns labels with adversarial noise . To learn the true labeling function in the noisily realizable setting we try to solve the following optimization problem,

In contrast to previous work we show that the simple algorithm given below solves this learning problem by leveraging the intuition that if we see enough labels where

, then solving the linear regression problem on this subset of samples, gives a

which is a close to . In the situation with adversarial corruption () we show in subsection LABEL:opt:dadush that our recovery guarantee is a nearly optimal and and in the realizable case (), our setup learns to arbitrary accuracy the true filter using much milder distributional constraints than previous such results that we are aware of.

1:Input: Sampling access to a distribution on .
2:Input: Oracle access to labels when queried with some
3:Input: An arbitrarily chosen starting point of and a constant
4:for  do
5:     Sample and query the (adversarial) oracle with it.
6:     The oracle replies back with such that
7:     Form the gradient (proxy),
8:     
9:end for
Algorithm 1 Modified SGD for a gate with an adversarial oracle
Theorem III.1.

Case I : Realizable setting, .

Suppose (a) and the covariance matrix exist and (b) is s.t is positive definite and let and . If Algorithm 1 is run with and starting from starting from then for we would have,

Case II : With bounded adversarial corruption of the true labels, .

Suppose additionally the distribution and true filter are such that (a) The following expectations are finite and nonzero, and (b)

Then if we choose the step size in Algorithm 1 as,

Then there exists where , such that for all noise bound ,

for all noise bounds , accuracy , confidence , and distribution such that,

(2)