1 Introduction
Generalized linear models (GLMs) represent a powerful extension of linear regression. In a GLM, the dependent variables are assumed to be given as a known nonlinear “link” function of a linear predictor of the covariates,
, for some fixed vector of parameters
. GLMs are readily kernelizable, which captures the more flexible setting where is a feature map in a Reproducing Kernel Hilbert Space (RKHS) anddenotes the RKHS inner product. A prominent example of a GLM that arises in practice is logistic regression, which has widereaching applications in the natural, social, and medical sciences
(Sur and Candès, 2019). Similarly, an immediate example of a kernelbased GLM is kernel logistic regression. Extensive details on GLMs can be found in the standard reference (McCullagh and Nelder, 1989).The GLMtron of Kakade et al. (2011) is the first computationally and statistically efficient algorithm for learning GLMs. Inspired by the Isotron of Kalai and Sastry (2009)
, it is a simple and intuitive Perceptronlike algorithm applicable for learning arbitrary GLMs with a nondecreasing and Lipschitz link function. In this work, we revisit the GLMtron from a new perspective, leveraging recent developments continuoustime optimization and adaptive control theory
(Boffi and Slotine, 2019). We consider the continuoustime limit of the GLMtron, and generalize the resulting continuoustime dynamics to a mirror descentlike (Beck and Teboulle, 2003; Krichene et al., 2015) setting, which we call the Reflectron. By their equivalence in continuoustime, our analysis also applies to natural gradient variants of the GLMtron (Amari, 1998; Pascanu and Bengio, 2013). We prove nonasymptotic generalization error bounds for the resulting family of continuoustime dynamics – parameterized by the choice of potential function – and we further prove convergence rates in the realizable setting. Our continuoustime Reflectron immediately gives rise to a wealth of discretetime algorithms by choice of discretization method, allowing us to leverage the vast body of work in numerical analysis (Butcher, 2001)and the widespread availability of offtheshelf blackbox ordinary differential equation solvers.
We connect the Reflectron algorithm with the growing body of literature on the implicit bias of optimization algorithms by applying a recent continuoustime limit (Boffi and Slotine, 2019) of a simple proof methodology for analyzing the implicit regularization of stochastic mirror descent (Azizan et al., 2019; Azizan and Hassibi, 2019). We show that, in the realizable setting, the choice of potential function implicitly biases the learned parameters to minimize . We extend our results to a vectorvalued setting which allows for weight sharing between output components, extending the Euclidean variant with independent weights first considered by Foster et al. (2020). We prove that convergence, implicit regularization, and similar generalization error bounds hold in this setting.
1.1 Related work and significance
The GLMtron has recently seen impressive applications in both statistical learning and learningbased control theory. The original work applied the GLMtron to efficiently learn Single Index Models (SIMs) (Kakade et al., 2011). A recent extension (the BregmanTron) uses Bregman divergences to obtain improved guarantees for learning SIMs, though their use of Bregman divergences is different from ours (Nock and Menon, 2020). Foster et al. (2020) utilized the GLMtron to develop an adaptive control law for stochastic, nonlinear, and discretetime dynamical systems. Goel and Klivans (2017)
use the kernelized GLMtron, the Alphatron, to provably learn two hidden layer neural networks, while
Goel et al. (2018)generalize the Alphatron to the Convotron for provable learning of one hidden layer convolutional neural networks. Orthogonally, but much like
(Foster et al., 2020), GLMtronlike update laws have been developed in the adaptive control literature (Tyukin et al., 2007), along with mirror descent and momentumlike variants (Boffi and Slotine, 2019), where they can be used for provable control of unknown and nonlinearly parameterized dynamical systems. Our work extends this line of research by allowing for the incorporation of local geometry into the GLMtron update for regularization of the learned parameters.Similarly, continuoustime approaches in machine learning and optimization have become increasingly fruitful tools for analysis.
Su et al. (2016) derive a continuoustime limit of Nesterov’s celebrated accelerated gradient method (Nesterov, 1983), and show that this limit enables intuitive proofs via Lyapunov stability theory. Krichene et al. (2015) perform a similar analysis for mirror descent, while Zhang et al. (2018) show that using standard RungeKutta integrators on the secondorder dynamics of Su et al. (2016) preserves acceleration. Lee et al. (2016) show via dynamical systems theory that saddle points are almost surely avoided by gradient descent with a random initialization. Boffi and Slotine (2020)use a continuoustime view of distributed stochastic gradient descent methods to analyze the effect of distributed coupling on SGD noise. Remarkably,
Wibisono et al. (2016), Betancourt et al. (2018), and Wilson et al. (2016)show that many accelerated optimization algorithms can be generated by discretizing the EulerLagrange equations of a certain functional known as the Bregman Lagrangian. For deep learning,
Chen et al. (2018) show that residual networks can be interpreted as a forwardEuler discretization of a continuoustime dynamical system, and use higherorder integrators to arrive at alternative architectures. Our work continuous in this promising line of recent work, and highlights that continuoustime offers clean and intuitive proofs that can later be discretized for guarantees on discretetime algorithms.As exemplified by the field of deep learning, modern machine learning frequently takes place in a highdimensional regime with more parameters than examples. It is now wellknown that deep networks will interpolate noisy data, yet exhibit low generalization error
despite interpolation when trained on meaningful data (Zhang et al., 2016). Defying classical statistical wisdom, an explanation for this apparent paradox has been given in the implicit bias of optimization algorithms and the doubledescent curve (Belkin et al., 2019a). The notion of implicit bias captures the preference of an optimization algorithm to converge to a particular kind of interpolating solution – such as a minimum norm solution – when many options exist. Surprisingly, similar “harmless” or “benign” interpolation phenomena have been observed even in much simpler systems such as overparameterized linear regression (Bartlett et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) and random feature models (Belkin et al., 2019b; Mei and Montanari, 2019). Understanding implicit bias has thus become an important area of research, with applications ranging from modern deep learning to pure statistics.Implicit bias has been categorized for separable classification problems (Soudry et al., 2018; Nacson et al., 2018), regression problems using mirror descent (Gunasekar et al., 2018b), and multilayer models (Gunasekar et al., 2018a; Woodworth et al., 2020; Gunasekar et al., 2017). Approximate results and empirical evidence are also available for nonlinear deep networks (Azizan et al., 2019)
. Our work contributes to the understanding of implicit bias in a practically relevant class of nonconvex learning problems, where proofs of convergence and bounds on the generalization error are attainable: GLM regression. Our algorithms have applications in recovering the weights of an unknown recurrent neural network, and may be useful for learning singlelayer neural network models
(Bai et al., 2019).2 Problem setting and background
Our setup follows the original work of Kakade et al. (2011). We assume the dataset is sampled i.i.d. from a distribution supported on , where for a finitedimensional feature map with associated kernel and a fixed, unknown vector of parameters. is a known, nondecreasing, and Lipschitz link function. We assume that for some fixed bound and that for all for some fixed bound . Our goal is to approximate as measured by the expected squared loss. To this end, for a hypothesis we define the quantities
(1)  
(2) 
and we denote their empirical counterparts over the dataset as and . Above, measures the generalization error of , while measures the excess risk compared to the Bayesoptimal predictor. Towards minimizing , we present a family of mirror descentlike algorithms for minimizing over parametric hypotheses of the form . Via standard statistical bounds (Bartlett and Mendelson, 2002), we transfer our guarantees on to , which in turn implies a small . The starting point of our analysis is the GLMtron of Kakade et al. (2011). The GLMtron is an iterative update law of the form
(3) 
with initialization . (3) is a gradientlike update law, obtained from gradient descent on the square loss by dropping the derivative of . It admits a natural continuoustime limit,
(4) 
where (4) is obtained from (3) via a forwardEuler discretization with a timestep . Throughout this paper, we will use the notation interchangeably for any timedependent signal .
3 The Reflectron Algorithm
We define the Reflectron algorithm in continuoustime as the mirror descentlike dynamics
(5) 
for a convex function. The parameters of the hypothesis at time are obtained by applying the inverse gradient of to the output of the Reflectron.
3.1 Statistical guarantees
The following theorem gives our statistical guarantees for the Reflectron. It implies that for any choice of strongly convex function , the Reflectron finds a nearly Bayesoptimal predictor if it is allowed to run for sufficiently long time.
Theorem 3.1 (Statistical guarantees for the Reflectron).
Suppose that are drawn i.i.d. from a distribution supported on with for a known nondecreasing and Lipschitz link function , a kernel function with corresponding finitedimensional feature map , and an unknown vector of parameters . Assume that where is strongly convex with respect to and is a constant, and that for all where is a constant. Then, for any
, with probability at least
over the draws of the , there exists some time such that the hypothesis satisfieswhere is output by the Reflectron at time with .
Proof.
Consider the rate of change of the Bregman divergence (Bregman, 1967)
between the parameters for the Bayesoptimal predictor and , . Note that we have the equality , and hence we find that
Using that is Lipschitz and nondecreasing,
(6) 
Now, note that each
is a zeromean i.i.d. random variable with norm bounded by
almost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude thatHence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by strong convexity of with respect to . Furthermore, . Thus it can be until at most to satisfy . Hence there is some with such that . To transfer this bound on to , we need to bound the quantity . Combining Theorems C.1 and C.2 gives us a bound on the Rademacher complexity where , and clearly . Application of Theorem C.3 to the square loss^{1}^{1}1Note that while the square loss is not bounded or Lipschitz in general, it is both over the domain with bound and Lipshitz constant immediately implies with probability at least . The conclusion of the theorem follows by a union bound. ∎
Because = up to a constant, we can find a good predictor
by using a holdout set to estimate
and by taking the best predictor on the holdout set. Our proof of Theorem 3.1 is similar to the corresponding proofs of the GLMtron (Kakade et al., 2011) and the Alphatron (Goel and Klivans, 2017), but has two primary modifications. First, we consider the Bregman divergence under between the Bayesoptimal parameters and the current parameters output by the Reflectron rather than the squared Euclidean distance. Second, rather than analyzing the iteration on as in the discretetime case, we analyze the time derivative of the Bregman divergence. Taking recovers the guarantees of the Alphatron^{2}^{2}2Our setting corresponds to the case where in the notation of Goel and Klivans (2017), and taking recovers the guarantees of the GLMtron, up to forward Euler discretizationspecific details. As our analysis applies in the continuoustime setting, many algorithmic variants may be obtained by choice of discretization method. Because the Reflectron operates directly on rather than where , we require the feature map to be finitedimensional.3.2 Implicit regularization
We now consider an alternative setting, and probe how the choice of impacts the parameters learned by the Reflectron. We require the following assumption.
Assumption 3.1.
The dataset is realizable. That is, there exists some fixed parameter vector such that for all .
Assumption 3.1 allows us to understand both overfitting and interpolation by the Reflectron. In many cases, even the noisy dataset of Section 3.1 may satisfy Assumption 3.1. We begin by proving convergence of the Reflectron in the realizable setting.
Lemma 3.1 (Convergence of the Reflectron for a realizable dataset).
Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where is a known, nondecreasing, and Lipschitz function. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the Reflectron at time with arbitrary. Furthermore, .
Proof.
Under the assumptions of the lemma, (6) shows that
Integrating both sides of the above gives the bound
Explicit computation shows that is bounded, so that is uniformly continuous in . By Barbalat’s Lemma (Lemma C.2), this implies that as .
Furthermore, this simple analysis immediately gives us a convergence rate. Indeed, one can write
so that . ∎
Lemma 3.1 shows that the Reflectron will converge to an interpolating solution in the realizable setting and that the best hypothesis up to time does so at an rate. It also implies that the Bregman divergence remains bounded, which in many cases implies boundedness of itself. In turn, for a realizable dataset, the strong convexity requirement of Theorem 3.1 can be relaxed to a requirement that remains invertible over the trajectory . Note that Lemma 3.1 allows for arbitrary initialization, while Theorem 3.1 requires .
In general, there may be many possible vectors consistent with the data. The following theorem provides insight into the parameters learned by the Reflectron.
Theorem 3.2 (Implicit regularization of the Reflectron).
Consider the setting of Lemma 3.1. Let be the set of parameters that interpolate the data, and assume that . Further assume that is invertible. Then . In particular, if , then .
Proof.
Let be arbitrary. Then,
Above, we used that and that is invertible, so that implies that . For clarity, define the error on example as . Integrating both sides of the above from to , we find that
The above relation is true for any . Furthermore, the integral on the righthand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that . Choosing completes the proof. ∎
Theorem 3.2 elucidates the implicit bias of the Reflectron. Out of all possible interpolating parameters, the Reflectron finds those that minimize the Bregman divergence between the manifold of interpolating parameters and the initialization.
4 Vectorvalued GLMs
In this section, we consider an extension to the case of vectorvalued target variables . We assume that where each is Lipschitz and nondecreasing in its argument. We define the expected and empirical error measures in this setting by replacing squared terms with squared Euclidean norms in the definitions (1) and (2).
In many cases, it is desirable to allow for weight sharing between output variables in a model. For instance, if a vectorvalued GLM estimation problem originates in a control or system identification context, parameters can have physical meaning and may appear in multiple equations. Similarly, convolutional neural networks exploit weight sharing as a beneficial prior for imposing translation equivariance. We can provably learn and implicitly regularize weightshared GLMs via a vectorvalued Reflectron. Define the dynamics
(7) 
with , , and convex. Note that (7) encompasses models of the form with and by unraveling into a vector of size and defining appropriately in terms of . Appendix D discusses this case explicitly, where tighter bounds are attainable and matrixvalued regularizers can be used. Our work thus generalizes the model of Foster et al. (2020) to the case of shared parameters and mirror descent. It is similar in spirit to the Convotron of Goel et al. (2018), but exploits mirror descent and applies for vectorvalued outputs. (7) could in principle be extended to provably learn regularized singlelayer convolutional networks with multiple outputs via the distributional assumptions of Goel et al. (2018), which allow an application of average pooling.
We can state analogous guarantees as in the scalarvalued case. We begin with convergence.
Lemma 4.1 (Convergence of the vectorvalued Reflectron for a realizable dataset).
Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where the component functions are known, nondecreasing, and Lipschitz functions. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the vectorvalued Reflectron (7) at time with arbitrary. Furthermore, .
The proof is given in Appendix A.1. As in the scalarvalued case, the choice of implicitly biases .
Theorem 4.1 (Implicit regularization of the vectorvalued Reflectron).
Theorem 4.2 (Statistical guarantees for the vectorvalued Reflectron).
Suppose that are drawn i.i.d. from a distribution supported on with for a known function and an unknown vector of parameters . Assume that where each is Lipschitz and nondecreasing in its argument. Assume that is a known matrix with for all . Further assume that for where is the row of . Let where is strongly convex with respect to and is a constant. Then, for any with probability at least over the draws of , there exists some time such that the hypothesis satisfies
where is output by the vectorvalued Reflectron (7) at time with .
The proof is given in Appendix A.3.
5 Simulations
As a simple illustration of our theoretical results, we perform classification on the MNIST dataset using a singlelayer multiclass classification model. Details of the simulation setup can be found in Appendix B.1. In Figure 1A we show the empirical risk and generalization error trajectories for the Reflectron with and . The latter option reduces to the GLMtron while the first, following Theorem 3.2, approximates the norm and imposes sparsity. The dynamics are integrated with the dop853 integrator from scipy.integrate.ode. Both choices converge to similar values of err and . In Figure 1B, we show the training and test set accuracy. Both choices of norm converge to similar accuracy values. In Figure 1C, we show the curves and with for four integration methods. The curves lie directly on top of each other and are shifted arbitrarily for clarity. This agreement highlights the validity of our continuoustime analysis, and that many possible discretetime algorithms are captured by the continuoustime dynamics.
[width=.47]figs_mnist/err_comp_plot.pdf  [width=.47]figs_mnist/acc_comp_plot.pdf 
[width=.47]figs_mnist/alg_comp_plot.pdf 
In Figure 2, we show histograms of the final parameter matrices learned by the Reflectron with and . The histograms validate the prediction of Theorem 3.2. A sparse vector is found for , which obtains similar accuracy values to , as seen in Fig. 1C. of parameters have magnitude less than for , while only do for . Future work will apply the Reflectron to models that combine a fixed expressive representation of the data with sparse learning, such as the scattering transform (Mallat, 2011; Bruna and Mallat, 2012; Talmon et al., 2015; Oyallon et al., 2019; Zarka et al., 2019).
[width=.47]figs_mnist/ths_1p1_hist.pdf  [width=.47]figs_mnist/ths_2_hist.pdf 
6 Broader Impact
In this work, we developed mirrordescent like variants of the GLMtron algorithm of Kakade et al. (2011). We proved guarantees on convergence and generalization, and characterized the implicit bias of our algorithms in terms of the potential function . We generalized our results to the case of vectorvalued target variables while allowing for the possibility of weight sharing between outputs. Our algorithms have applications in several settings. Using the techniques in Foster et al. (2020), they may be generalized to the adaptive control context for provably regularized online learning and control of stochastic dynamical systems. Applications in control will advance automation, but may have negative downstream consequences for those working in areas that can be replaced by adaptive control or robotic systems. Our algorithms can also be used for recovering the weights of a continuous or discretetime recurrent neural network from online data, which may have applications in recurrent network pruning (via sparsity promoting biases such as an approximation), or in computational neuroscience. We thank Stephen Tu for many helpful discussions.
References
 Amari (1998) Amari, S.i. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276.
 Azizan and Hassibi (2019) Azizan, N. and Hassibi, B. (2019). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations.
 Azizan et al. (2019) Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv:1906.03830.
 Bai et al. (2019) Bai, S., Kolter, J. Z., and Koltun, V. (2019). Deep equilibrium models. arXiv:1909.01377.
 Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2019). Benign overfitting in linear regression. arXiv:1906.11300.
 Bartlett and Mendelson (2002) Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482.
 Beck and Teboulle (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 – 175.

Belkin et al. (2019a)
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019a).
Reconciling modern machinelearning practice and the classical bias–variance tradeoff.
Proceedings of the National Academy of Sciences, 116(32):15849–15854.  Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. (2019b). Two models of double descent for weak features. arXiv:1903.07571.
 Betancourt et al. (2018) Betancourt, M., Jordan, M. I., and Wilson, A. C. (2018). On symplectic optimization. arXiv:1802.03653.
 Boffi and Slotine (2019) Boffi, N. M. and Slotine, J.J. E. (2019). Higherorder algorithms and implicit regularization for nonlinearly parameterized adaptive control. arXiv:1912.13154.
 Boffi and Slotine (2020) Boffi, N. M. and Slotine, J.J. E. (2020). A continuoustime analysis of distributed stochastic gradient. Neural Computation, 32(1):36–96.
 Bregman (1967) Bregman, L. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217.
 Bruna and Mallat (2012) Bruna, J. and Mallat, S. (2012). Invariant scattering convolution networks. arXiv:1203.1513.
 Butcher (2001) Butcher, J. (2001). Numerical methods for ordinary differential equations in the 20th century.
 Chen et al. (2018) Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural ordinary differential equations. arXiv:1806.07366.
 Foster et al. (2020) Foster, D. J., Rakhlin, A., and Sarkar, T. (2020). Learning nonlinear dynamical systems from a single trajectory. arXiv:2004.14681.
 Goel and Klivans (2017) Goel, S. and Klivans, A. (2017). Learning neural networks with two nonlinear layers in polynomial time. arXiv:1709.06010.
 Goel et al. (2018) Goel, S., Klivans, A., and Meka, R. (2018). Learning one convolutional layer with overlapping patches. arXiv:1802.02547.
 Gunasekar et al. (2018a) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. (2018a). Characterizing implicit bias in terms of optimization geometry. arXiv:1802.08246.
 Gunasekar et al. (2018b) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. (2018b). Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems 31, pages 9461–9471. Curran Associates, Inc.
 Gunasekar et al. (2017) Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2017). Implicit regularization in matrix factorization. arXiv:1705.09280.
 Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in highdimensional ridgeless least squares interpolation. arXiv:1903.08560.
 Kakade et al. (2011) Kakade, S., Kalai, A. T., Kanade, V., and Shamir, O. (2011). Efficient learning of generalized linear and single index models with isotonic regression. arXiv:1104.2018.
 Kakade et al. (2009) Kakade, S. M., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 793–800. Curran Associates, Inc.
 Kalai and Sastry (2009) Kalai, A. T. and Sastry, R. (2009). The isotron algorithm: Highdimensional isotonic regression. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.
 Krichene et al. (2015) Krichene, W., Bayen, A., and Bartlett, P. L. (2015). Accelerated mirror descent in continuous and discrete time. In Advances in Neural Information Processing Systems 28, pages 2845–2853. Curran Associates, Inc.
 Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent converges to minimizers. arXiv:1602.04915.
 Mallat (2011) Mallat, S. (2011). Group invariant scattering. arXiv:1101.2286.
 Maurer (2016) Maurer, A. (2016). A vectorcontraction inequality for rademacher complexities. arXiv:1605.00251.
 McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. CRC Press.
 Mei and Montanari (2019) Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:1908.05355.
 Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. (2019). Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2299–2303.
 Nacson et al. (2018) Nacson, M. S., Lee, J. D., Gunasekar, S., Savarese, P. H. P., Srebro, N., and Soudry, D. (2018). Convergence of gradient descent on separable data. arXiv:1803.01905.
 Nesterov (1983) Nesterov, Y. (1983). A Method for Solving a Convex Programming Problem with Convergence Rate . Soviet Mathematics Doklady, 26:367–372.
 Nock and Menon (2020) Nock, R. and Menon, A. K. (2020). Supervised learning: No loss no cry. arXiv:2002.03555.
 Oyallon et al. (2019) Oyallon, E., Zagoruyko, S., Huang, G., Komodakis, N., LacosteJulien, S., Blaschko, M., and Belilovsky, E. (2019). Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2208–2221.
 Pascanu and Bengio (2013) Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. arXiv:1301.3584.
 Slotine and Li (1991) Slotine, J.J. and Li, W. (1991). Applied Nonlinear Control. Prentice Hall.
 Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. J. Mach. Learn. Res., 19(1):2822–2878.
 Su et al. (2016) Su, W., Boyd, S., and Candès, E. J. (2016). A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1–43.
 Sur and Candès (2019) Sur, P. and Candès, E. J. (2019). A modern maximumlikelihood theory for highdimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
 Talmon et al. (2015) Talmon, R., Mallat, S., Zaveri, H., and Coifman, R. R. (2015). Manifold learning for latent variable inference in dynamical systems. IEEE Transactions on Signal Processing, 63(15):3843–3856.
 Tyukin et al. (2007) Tyukin, I. Y., Prokhorov, D. V., and van Leeuwen, C. (2007). Adaptation and parameter estimation in systems with unstable target dynamics and nonlinear parametrization. IEEE Transactions on Automatic Control, 52(9):1543–1559.
 Wainwright (2019) Wainwright, M. J. (2019). HighDimensional Statistics: A NonAsymptotic Viewpoint. Cambridge University Press.
 Wibisono et al. (2016) Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358.
 Wilson et al. (2016) Wilson, A. C., Recht, B., and Jordan, M. I. (2016). A lyapunov analysis of momentum methods in optimization. arXiv:1611.02635.
 Woodworth et al. (2020) Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. arXiv:2002.09277.
 Zarka et al. (2019) Zarka, J., Thiry, L., Angles, T., and Mallat, S. (2019). Deep network classification by scattering and homotopy dictionary learning. arXiv:1910.03561.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv:1611.03530.
 Zhang et al. (2018) Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct rungekutta discretization achieves acceleration. arXiv:1805.00521.
Appendix A Omitted proofs
a.1 Proof of Lemma 4.1
a.2 Proof of Theorem 4.1
Proof.
The proof follows the same structure as in the scalarvalued case. Let . Then,
In the derivation above, we have replaced by following our assumptions that and is invertible. Integrating both sides of the above from to , we find that
The above relation is true for any . Furthermore, the integral on the righthand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that
Initializing completes the proof. ∎
a.3 Proof of Theorem 4.2
Proof.
Consider the rate of change of the Bregman divergence between the parameters for the Bayesoptimal predictor and the parameters produced by the Reflectron at time . By the same method as in Lemma 4.1, we immediately have
Now, note that each is a zeromean i.i.d. random variable with Euclidean norm bounded by almost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude that
Hence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by strong convexity of with respect to . Furthermore, . Thus it can be until at most
until . Hence there is some with such that
Combining Theorems C.1 and C.2, we obtain a bound on the Rademacher complexity of the function class for each component of ,
Application of Theorem C.5 to bound the generalization error, noting that the square loss is Lipschitz and bounded over , gives us the bound
which completes the proof. ∎
Appendix B Further simulation details and results
b.1 MNIST simulation details
We implement the vectorvalued Reflectron without weight sharing. Rather than implement the continuoustime dynamics (7), we directly utilize the dynamics (9), as they are more efficient without weight sharing. In Section 5, we show results for the mirror descentlike dynamics. Appendix B.2 shows similar results for the natural gradientlike dynamics. Our hypothesis at time is given by
where is an elementwise sigmoid, is an image from the MNIST dataset, and
is a matrix of parameters to be learned. We use onehot encoding on the class labels, and the predicted class is obtained by taking an
over the components of . A training set of size and a test set of size are both randomly selected from the overall dataset, so that the number of parameters. The training data is preprocessed to have zero mean and unit variance. The testing data is shifted by the mean of the training data and normalized by the standard deviation of the training data. A value of
is used for all cases, up to adaptive timestepping performed by the blackbox ODE solvers. Convergence speed can be different for different choices of . To ensure convergence over similar timescales, we use a learning rate of in the continuoustime dynamics with .Because we use an elementwise sigmoid, the output of our model is not required to be a
Comments
There are no comments yet.