The Reflectron: Exploiting geometry for learning generalized linear models

06/15/2020 ∙ by Nicholas M. Boffi, et al. ∙ MIT Harvard University 0

Generalized linear models (GLMs) extend linear regression by generating the dependent variables through a nonlinear function of a predictor in a Reproducing Kernel Hilbert Space. Despite nonconvexity of the underlying optimization problem, the GLM-tron algorithm of Kakade et al. (2011) provably learns GLMs with guarantees of computational and statistical efficiency. We present an extension of the GLM-tron to a mirror descent or natural gradient-like setting, which we call the Reflectron. The Reflectron enjoys the same statistical guarantees as the GLM-tron for any choice of the convex potential function ψ used to define mirror descent. Central to our algorithm, ψ can be chosen to implicitly regularize the learned model when there are multiple hypotheses consistent with the data. Our results extend to the case of multiple outputs with or without weight sharing. We perform our analysis in continuous-time, leading to simple and intuitive derivations, with discrete-time implementations obtained by discretization of the continuous-time dynamics. We supplement our theoretical analysis with simulations on real and synthetic datasets demonstrating the validity of our theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generalized linear models (GLMs) represent a powerful extension of linear regression. In a GLM, the dependent variables are assumed to be given as a known nonlinear “link” function of a linear predictor of the covariates,

, for some fixed vector of parameters

. GLMs are readily kernelizable, which captures the more flexible setting where is a feature map in a Reproducing Kernel Hilbert Space (RKHS) and

denotes the RKHS inner product. A prominent example of a GLM that arises in practice is logistic regression, which has wide-reaching applications in the natural, social, and medical sciences 

(Sur and Candès, 2019). Similarly, an immediate example of a kernel-based GLM is kernel logistic regression. Extensive details on GLMs can be found in the standard reference (McCullagh and Nelder, 1989).

The GLM-tron of Kakade et al. (2011) is the first computationally and statistically efficient algorithm for learning GLMs. Inspired by the Isotron of Kalai and Sastry (2009)

, it is a simple and intuitive Perceptron-like algorithm applicable for learning arbitrary GLMs with a nondecreasing and Lipschitz link function. In this work, we revisit the GLM-tron from a new perspective, leveraging recent developments continuous-time optimization and adaptive control theory 

(Boffi and Slotine, 2019). We consider the continuous-time limit of the GLM-tron, and generalize the resulting continuous-time dynamics to a mirror descent-like (Beck and Teboulle, 2003; Krichene et al., 2015) setting, which we call the Reflectron. By their equivalence in continuous-time, our analysis also applies to natural gradient variants of the GLM-tron (Amari, 1998; Pascanu and Bengio, 2013). We prove non-asymptotic generalization error bounds for the resulting family of continuous-time dynamics – parameterized by the choice of potential function – and we further prove convergence rates in the realizable setting. Our continuous-time Reflectron immediately gives rise to a wealth of discrete-time algorithms by choice of discretization method, allowing us to leverage the vast body of work in numerical analysis (Butcher, 2001)

and the widespread availability of off-the-shelf black-box ordinary differential equation solvers.

We connect the Reflectron algorithm with the growing body of literature on the implicit bias of optimization algorithms by applying a recent continuous-time limit (Boffi and Slotine, 2019) of a simple proof methodology for analyzing the implicit regularization of stochastic mirror descent (Azizan et al., 2019; Azizan and Hassibi, 2019). We show that, in the realizable setting, the choice of potential function implicitly biases the learned parameters to minimize . We extend our results to a vector-valued setting which allows for weight sharing between output components, extending the Euclidean variant with independent weights first considered by Foster et al. (2020). We prove that convergence, implicit regularization, and similar generalization error bounds hold in this setting.

1.1 Related work and significance

The GLM-tron has recently seen impressive applications in both statistical learning and learning-based control theory. The original work applied the GLM-tron to efficiently learn Single Index Models (SIMs) (Kakade et al., 2011). A recent extension (the BregmanTron) uses Bregman divergences to obtain improved guarantees for learning SIMs, though their use of Bregman divergences is different from ours (Nock and Menon, 2020).  Foster et al. (2020) utilized the GLM-tron to develop an adaptive control law for stochastic, nonlinear, and discrete-time dynamical systems. Goel and Klivans (2017)

use the kernelized GLM-tron, the Alphatron, to provably learn two hidden layer neural networks, while 

Goel et al. (2018)

generalize the Alphatron to the Convotron for provable learning of one hidden layer convolutional neural networks. Orthogonally, but much like 

(Foster et al., 2020), GLM-tron-like update laws have been developed in the adaptive control literature (Tyukin et al., 2007), along with mirror descent and momentum-like variants (Boffi and Slotine, 2019), where they can be used for provable control of unknown and nonlinearly parameterized dynamical systems. Our work extends this line of research by allowing for the incorporation of local geometry into the GLM-tron update for regularization of the learned parameters.

Similarly, continuous-time approaches in machine learning and optimization have become increasingly fruitful tools for analysis. 

Su et al. (2016) derive a continuous-time limit of Nesterov’s celebrated accelerated gradient method (Nesterov, 1983), and show that this limit enables intuitive proofs via Lyapunov stability theory. Krichene et al. (2015) perform a similar analysis for mirror descent, while  Zhang et al. (2018) show that using standard Runge-Kutta integrators on the second-order dynamics of Su et al. (2016) preserves acceleration. Lee et al. (2016) show via dynamical systems theory that saddle points are almost surely avoided by gradient descent with a random initialization. Boffi and Slotine (2020)

use a continuous-time view of distributed stochastic gradient descent methods to analyze the effect of distributed coupling on SGD noise. Remarkably, 

Wibisono et al. (2016)Betancourt et al. (2018), and Wilson et al. (2016)

show that many accelerated optimization algorithms can be generated by discretizing the Euler-Lagrange equations of a certain functional known as the Bregman Lagrangian. For deep learning

Chen et al. (2018) show that residual networks can be interpreted as a forward-Euler discretization of a continuous-time dynamical system, and use higher-order integrators to arrive at alternative architectures. Our work continuous in this promising line of recent work, and highlights that continuous-time offers clean and intuitive proofs that can later be discretized for guarantees on discrete-time algorithms.

As exemplified by the field of deep learning, modern machine learning frequently takes place in a high-dimensional regime with more parameters than examples. It is now well-known that deep networks will interpolate noisy data, yet exhibit low generalization error

despite interpolation when trained on meaningful data (Zhang et al., 2016). Defying classical statistical wisdom, an explanation for this apparent paradox has been given in the implicit bias of optimization algorithms and the double-descent curve (Belkin et al., 2019a). The notion of implicit bias captures the preference of an optimization algorithm to converge to a particular kind of interpolating solution – such as a minimum norm solution – when many options exist. Surprisingly, similar “harmless” or “benign” interpolation phenomena have been observed even in much simpler systems such as overparameterized linear regression (Bartlett et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) and random feature models (Belkin et al., 2019b; Mei and Montanari, 2019). Understanding implicit bias has thus become an important area of research, with applications ranging from modern deep learning to pure statistics.

Implicit bias has been categorized for separable classification problems (Soudry et al., 2018; Nacson et al., 2018), regression problems using mirror descent (Gunasekar et al., 2018b), and multilayer models (Gunasekar et al., 2018a; Woodworth et al., 2020; Gunasekar et al., 2017). Approximate results and empirical evidence are also available for nonlinear deep networks (Azizan et al., 2019)

. Our work contributes to the understanding of implicit bias in a practically relevant class of nonconvex learning problems, where proofs of convergence and bounds on the generalization error are attainable: GLM regression. Our algorithms have applications in recovering the weights of an unknown recurrent neural network, and may be useful for learning single-layer neural network models 

(Bai et al., 2019).

2 Problem setting and background

Our setup follows the original work of Kakade et al. (2011). We assume the dataset is sampled i.i.d. from a distribution supported on , where for a finite-dimensional feature map with associated kernel and a fixed, unknown vector of parameters. is a known, nondecreasing, and -Lipschitz link function. We assume that for some fixed bound and that for all for some fixed bound . Our goal is to approximate as measured by the expected squared loss. To this end, for a hypothesis we define the quantities

(1)
(2)

and we denote their empirical counterparts over the dataset as and . Above, measures the generalization error of , while measures the excess risk compared to the Bayes-optimal predictor. Towards minimizing , we present a family of mirror descent-like algorithms for minimizing over parametric hypotheses of the form . Via standard statistical bounds (Bartlett and Mendelson, 2002), we transfer our guarantees on to , which in turn implies a small . The starting point of our analysis is the GLM-tron of Kakade et al. (2011). The GLM-tron is an iterative update law of the form

(3)

with initialization . (3) is a gradient-like update law, obtained from gradient descent on the square loss by dropping the derivative of . It admits a natural continuous-time limit,

(4)

where (4) is obtained from (3) via a forward-Euler discretization with a timestep . Throughout this paper, we will use the notation interchangeably for any time-dependent signal .

3 The Reflectron Algorithm

We define the Reflectron algorithm in continuous-time as the mirror descent-like dynamics

(5)

for a convex function. The parameters of the hypothesis at time are obtained by applying the inverse gradient of to the output of the Reflectron.

3.1 Statistical guarantees

The following theorem gives our statistical guarantees for the Reflectron. It implies that for any choice of strongly convex function , the Reflectron finds a nearly Bayes-optimal predictor if it is allowed to run for sufficiently long time.

Theorem 3.1 (Statistical guarantees for the Reflectron).

Suppose that are drawn i.i.d. from a distribution supported on with for a known nondecreasing and -Lipschitz link function , a kernel function with corresponding finite-dimensional feature map , and an unknown vector of parameters . Assume that where is -strongly convex with respect to and is a constant, and that for all where is a constant. Then, for any

, with probability at least

over the draws of the , there exists some time such that the hypothesis satisfies

where is output by the Reflectron at time with .

Proof.

Consider the rate of change of the Bregman divergence (Bregman, 1967)

between the parameters for the Bayes-optimal predictor and , . Note that we have the equality , and hence we find that

Using that is -Lipschitz and nondecreasing,

(6)

Now, note that each

is a zero-mean i.i.d. random variable with norm bounded by

almost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude that

Hence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by -strong convexity of with respect to . Furthermore, . Thus it can be until at most to satisfy . Hence there is some with such that . To transfer this bound on to , we need to bound the quantity . Combining Theorems C.1 and C.2 gives us a bound on the Rademacher complexity where , and clearly . Application of Theorem C.3 to the square loss111Note that while the square loss is not bounded or Lipschitz in general, it is both over the domain with bound and Lipshitz constant immediately implies with probability at least . The conclusion of the theorem follows by a union bound. ∎

Because = up to a constant, we can find a good predictor

by using a hold-out set to estimate

and by taking the best predictor on the hold-out set. Our proof of Theorem 3.1 is similar to the corresponding proofs of the GLM-tron (Kakade et al., 2011) and the Alphatron (Goel and Klivans, 2017), but has two primary modifications. First, we consider the Bregman divergence under between the Bayes-optimal parameters and the current parameters output by the Reflectron rather than the squared Euclidean distance. Second, rather than analyzing the iteration on as in the discrete-time case, we analyze the time derivative of the Bregman divergence. Taking recovers the guarantees of the Alphatron222Our setting corresponds to the case where in the notation of Goel and Klivans (2017), and taking recovers the guarantees of the GLM-tron, up to forward Euler discretization-specific details. As our analysis applies in the continuous-time setting, many algorithmic variants may be obtained by choice of discretization method. Because the Reflectron operates directly on rather than where , we require the feature map to be finite-dimensional.

3.2 Implicit regularization

We now consider an alternative setting, and probe how the choice of impacts the parameters learned by the Reflectron. We require the following assumption.

Assumption 3.1.

The dataset is realizable. That is, there exists some fixed parameter vector such that for all .

Assumption 3.1 allows us to understand both overfitting and interpolation by the Reflectron. In many cases, even the noisy dataset of Section 3.1 may satisfy Assumption 3.1. We begin by proving convergence of the Reflectron in the realizable setting.

Lemma 3.1 (Convergence of the Reflectron for a realizable dataset).

Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where is a known, nondecreasing, and -Lipschitz function. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the Reflectron at time with arbitrary. Furthermore, .

Proof.

Under the assumptions of the lemma, (6) shows that

Integrating both sides of the above gives the bound

Explicit computation shows that is bounded, so that is uniformly continuous in . By Barbalat’s Lemma (Lemma C.2), this implies that as .

Furthermore, this simple analysis immediately gives us a convergence rate. Indeed, one can write

so that . ∎

Lemma 3.1 shows that the Reflectron will converge to an interpolating solution in the realizable setting and that the best hypothesis up to time does so at an rate. It also implies that the Bregman divergence remains bounded, which in many cases implies boundedness of itself. In turn, for a realizable dataset, the strong convexity requirement of Theorem 3.1 can be relaxed to a requirement that remains invertible over the trajectory . Note that Lemma 3.1 allows for arbitrary initialization, while Theorem 3.1 requires .

In general, there may be many possible vectors consistent with the data. The following theorem provides insight into the parameters learned by the Reflectron.

Theorem 3.2 (Implicit regularization of the Reflectron).

Consider the setting of Lemma 3.1. Let be the set of parameters that interpolate the data, and assume that . Further assume that is invertible. Then . In particular, if , then .

Proof.

Let be arbitrary. Then,

Above, we used that and that is invertible, so that implies that . For clarity, define the error on example as . Integrating both sides of the above from to , we find that

The above relation is true for any . Furthermore, the integral on the right-hand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that . Choosing completes the proof. ∎

Theorem 3.2 elucidates the implicit bias of the Reflectron. Out of all possible interpolating parameters, the Reflectron finds those that minimize the Bregman divergence between the manifold of interpolating parameters and the initialization.

4 Vector-valued GLMs

In this section, we consider an extension to the case of vector-valued target variables . We assume that where each is -Lipschitz and nondecreasing in its argument. We define the expected and empirical error measures in this setting by replacing squared terms with squared Euclidean norms in the definitions (1) and (2).

In many cases, it is desirable to allow for weight sharing between output variables in a model. For instance, if a vector-valued GLM estimation problem originates in a control or system identification context, parameters can have physical meaning and may appear in multiple equations. Similarly, convolutional neural networks exploit weight sharing as a beneficial prior for imposing translation equivariance. We can provably learn and implicitly regularize weight-shared GLMs via a vector-valued Reflectron. Define the dynamics

(7)

with , , and convex. Note that (7) encompasses models of the form with and by unraveling into a vector of size and defining appropriately in terms of . Appendix D discusses this case explicitly, where tighter bounds are attainable and matrix-valued regularizers can be used. Our work thus generalizes the model of Foster et al. (2020) to the case of shared parameters and mirror descent. It is similar in spirit to the Convotron of Goel et al. (2018), but exploits mirror descent and applies for vector-valued outputs. (7) could in principle be extended to provably learn regularized single-layer convolutional networks with multiple outputs via the distributional assumptions of Goel et al. (2018), which allow an application of average pooling.

We can state analogous guarantees as in the scalar-valued case. We begin with convergence.

Lemma 4.1 (Convergence of the vector-valued Reflectron for a realizable dataset).

Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where the component functions are known, nondecreasing, and -Lipschitz functions. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the vector-valued Reflectron (7) at time with arbitrary. Furthermore, .

The proof is given in Appendix A.1. As in the scalar-valued case, the choice of implicitly biases .

Theorem 4.1 (Implicit regularization of the vector-valued Reflectron).

Consider the setting of Lemma 4.1. Assume that where interpolates the data, and assume that is invertible. Then where is defined analogously as in Theorem 3.2. In particular, if , then .

The proof is given in Appendix A.2. Finally, we may also state a statistical guarantee for (7).

Theorem 4.2 (Statistical guarantees for the vector-valued Reflectron).

Suppose that are drawn i.i.d. from a distribution supported on with for a known function and an unknown vector of parameters . Assume that where each is -Lipschitz and nondecreasing in its argument. Assume that is a known matrix with for all . Further assume that for where is the row of . Let where is -strongly convex with respect to and is a constant. Then, for any with probability at least over the draws of , there exists some time such that the hypothesis satisfies

where is output by the vector-valued Reflectron (7) at time with .

The proof is given in Appendix A.3.

5 Simulations

As a simple illustration of our theoretical results, we perform classification on the MNIST dataset using a single-layer multiclass classification model. Details of the simulation setup can be found in Appendix B.1. In Figure 1A we show the empirical risk and generalization error trajectories for the Reflectron with and . The latter option reduces to the GLM-tron while the first, following Theorem 3.2, approximates the norm and imposes sparsity. The dynamics are integrated with the dop853 integrator from scipy.integrate.ode. Both choices converge to similar values of err and . In Figure 1B, we show the training and test set accuracy. Both choices of norm converge to similar accuracy values. In Figure 1C, we show the curves and with for four integration methods. The curves lie directly on top of each other and are shifted arbitrarily for clarity. This agreement highlights the validity of our continuous-time analysis, and that many possible discrete-time algorithms are captured by the continuous-time dynamics.

[width=.47]figs_mnist/err_comp_plot.pdf

A,

[width=.47]figs_mnist/acc_comp_plot.pdf

BClassification accuracy

[width=.47]figs_mnist/alg_comp_plot.pdf

CIntegration method comparison

Figure 1: (A) Trajectories of the empirical risk and generalization error . is shown in solid while is shown dashed. (B) Trajectories of the training and test set accuracy. Training is shown in solid while testing is shown dashed. (C) A comparison of the empirical risk and generalization error dynamics as a function of the integration method with . All solid curves and all dashed curves lie directly on top of each other and are shifted for visual clarity.

In Figure 2, we show histograms of the final parameter matrices learned by the Reflectron with and . The histograms validate the prediction of Theorem 3.2. A sparse vector is found for , which obtains similar accuracy values to , as seen in Fig. 1C. of parameters have magnitude less than for , while only do for . Future work will apply the Reflectron to models that combine a fixed expressive representation of the data with sparse learning, such as the scattering transform (Mallat, 2011; Bruna and Mallat, 2012; Talmon et al., 2015; Oyallon et al., 2019; Zarka et al., 2019).

[width=.47]figs_mnist/ths_1p1_hist.pdf

A

[width=.47]figs_mnist/ths_2_hist.pdf

B

Figure 2: Histograms of parameter values for (A) the parameters found by the Reflectron with , and (B) the parameters found by the Reflectron with . This choice reduces to the GLM-tron. of parameters have magnitude less than with , while only do for . Lowering this threshold shows that the regularized solution has many near-zero parameters, while the solution is diffuse.

6 Broader Impact

In this work, we developed mirror-descent like variants of the GLM-tron algorithm of Kakade et al. (2011). We proved guarantees on convergence and generalization, and characterized the implicit bias of our algorithms in terms of the potential function . We generalized our results to the case of vector-valued target variables while allowing for the possibility of weight sharing between outputs. Our algorithms have applications in several settings. Using the techniques in Foster et al. (2020), they may be generalized to the adaptive control context for provably regularized online learning and control of stochastic dynamical systems. Applications in control will advance automation, but may have negative downstream consequences for those working in areas that can be replaced by adaptive control or robotic systems. Our algorithms can also be used for recovering the weights of a continuous- or discrete-time recurrent neural network from online data, which may have applications in recurrent network pruning (via sparsity promoting biases such as an approximation), or in computational neuroscience. We thank Stephen Tu for many helpful discussions.

References

Appendix A Omitted proofs

a.1 Proof of Lemma 4.1

Proof.

The proof is analogous to Lemma 3.1. We have that

so that

The conclusions of the lemma follow identically by the machinery of the proof of Lemma 3.1. ∎

a.2 Proof of Theorem 4.1

Proof.

The proof follows the same structure as in the scalar-valued case. Let . Then,

In the derivation above, we have replaced by following our assumptions that and is invertible. Integrating both sides of the above from to , we find that

The above relation is true for any . Furthermore, the integral on the right-hand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that

Initializing completes the proof. ∎

a.3 Proof of Theorem 4.2

Proof.

Consider the rate of change of the Bregman divergence between the parameters for the Bayes-optimal predictor and the parameters produced by the Reflectron at time . By the same method as in Lemma 4.1, we immediately have

Now, note that each is a zero-mean i.i.d. random variable with Euclidean norm bounded by almost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude that

Hence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by -strong convexity of with respect to . Furthermore, . Thus it can be until at most

until . Hence there is some with such that

Combining Theorems C.1 and C.2, we obtain a bound on the Rademacher complexity of the function class for each component of ,

Application of Theorem C.5 to bound the generalization error, noting that the square loss is -Lipschitz and -bounded over , gives us the bound

which completes the proof. ∎

Appendix B Further simulation details and results

b.1 MNIST simulation details

We implement the vector-valued Reflectron without weight sharing. Rather than implement the continuous-time dynamics (7), we directly utilize the dynamics (9), as they are more efficient without weight sharing. In Section 5, we show results for the mirror descent-like dynamics. Appendix B.2 shows similar results for the natural gradient-like dynamics. Our hypothesis at time is given by

where is an elementwise sigmoid, is an image from the MNIST dataset, and

is a matrix of parameters to be learned. We use one-hot encoding on the class labels, and the predicted class is obtained by taking an

over the components of . A training set of size and a test set of size are both randomly selected from the overall dataset, so that the number of parameters

. The training data is pre-processed to have zero mean and unit variance. The testing data is shifted by the mean of the training data and normalized by the standard deviation of the training data. A value of

is used for all cases, up to adaptive timestepping performed by the black-box ODE solvers. Convergence speed can be different for different choices of . To ensure convergence over similar timescales, we use a learning rate of in the continuous-time dynamics with .

Because we use an elementwise sigmoid, the output of our model is not required to be a