Generalized linear models (GLMs) represent a powerful extension of linear regression. In a GLM, the dependent variables are assumed to be given as a known nonlinear “link” function of a linear predictor of the covariates,
, for some fixed vector of parameters. GLMs are readily kernelizable, which captures the more flexible setting where is a feature map in a Reproducing Kernel Hilbert Space (RKHS) and
denotes the RKHS inner product. A prominent example of a GLM that arises in practice is logistic regression, which has wide-reaching applications in the natural, social, and medical sciences(Sur and Candès, 2019). Similarly, an immediate example of a kernel-based GLM is kernel logistic regression. Extensive details on GLMs can be found in the standard reference (McCullagh and Nelder, 1989).
, it is a simple and intuitive Perceptron-like algorithm applicable for learning arbitrary GLMs with a nondecreasing and Lipschitz link function. In this work, we revisit the GLM-tron from a new perspective, leveraging recent developments continuous-time optimization and adaptive control theory(Boffi and Slotine, 2019). We consider the continuous-time limit of the GLM-tron, and generalize the resulting continuous-time dynamics to a mirror descent-like (Beck and Teboulle, 2003; Krichene et al., 2015) setting, which we call the Reflectron. By their equivalence in continuous-time, our analysis also applies to natural gradient variants of the GLM-tron (Amari, 1998; Pascanu and Bengio, 2013). We prove non-asymptotic generalization error bounds for the resulting family of continuous-time dynamics – parameterized by the choice of potential function – and we further prove convergence rates in the realizable setting. Our continuous-time Reflectron immediately gives rise to a wealth of discrete-time algorithms by choice of discretization method, allowing us to leverage the vast body of work in numerical analysis (Butcher, 2001)
and the widespread availability of off-the-shelf black-box ordinary differential equation solvers.
We connect the Reflectron algorithm with the growing body of literature on the implicit bias of optimization algorithms by applying a recent continuous-time limit (Boffi and Slotine, 2019) of a simple proof methodology for analyzing the implicit regularization of stochastic mirror descent (Azizan et al., 2019; Azizan and Hassibi, 2019). We show that, in the realizable setting, the choice of potential function implicitly biases the learned parameters to minimize . We extend our results to a vector-valued setting which allows for weight sharing between output components, extending the Euclidean variant with independent weights first considered by Foster et al. (2020). We prove that convergence, implicit regularization, and similar generalization error bounds hold in this setting.
1.1 Related work and significance
The GLM-tron has recently seen impressive applications in both statistical learning and learning-based control theory. The original work applied the GLM-tron to efficiently learn Single Index Models (SIMs) (Kakade et al., 2011). A recent extension (the BregmanTron) uses Bregman divergences to obtain improved guarantees for learning SIMs, though their use of Bregman divergences is different from ours (Nock and Menon, 2020). Foster et al. (2020) utilized the GLM-tron to develop an adaptive control law for stochastic, nonlinear, and discrete-time dynamical systems. Goel and Klivans (2017)
use the kernelized GLM-tron, the Alphatron, to provably learn two hidden layer neural networks, whileGoel et al. (2018)
generalize the Alphatron to the Convotron for provable learning of one hidden layer convolutional neural networks. Orthogonally, but much like(Foster et al., 2020), GLM-tron-like update laws have been developed in the adaptive control literature (Tyukin et al., 2007), along with mirror descent and momentum-like variants (Boffi and Slotine, 2019), where they can be used for provable control of unknown and nonlinearly parameterized dynamical systems. Our work extends this line of research by allowing for the incorporation of local geometry into the GLM-tron update for regularization of the learned parameters.
Similarly, continuous-time approaches in machine learning and optimization have become increasingly fruitful tools for analysis.Su et al. (2016) derive a continuous-time limit of Nesterov’s celebrated accelerated gradient method (Nesterov, 1983), and show that this limit enables intuitive proofs via Lyapunov stability theory. Krichene et al. (2015) perform a similar analysis for mirror descent, while Zhang et al. (2018) show that using standard Runge-Kutta integrators on the second-order dynamics of Su et al. (2016) preserves acceleration. Lee et al. (2016) show via dynamical systems theory that saddle points are almost surely avoided by gradient descent with a random initialization. Boffi and Slotine (2020)
use a continuous-time view of distributed stochastic gradient descent methods to analyze the effect of distributed coupling on SGD noise. Remarkably,Wibisono et al. (2016), Betancourt et al. (2018), and Wilson et al. (2016)
show that many accelerated optimization algorithms can be generated by discretizing the Euler-Lagrange equations of a certain functional known as the Bregman Lagrangian. For deep learning,Chen et al. (2018) show that residual networks can be interpreted as a forward-Euler discretization of a continuous-time dynamical system, and use higher-order integrators to arrive at alternative architectures. Our work continuous in this promising line of recent work, and highlights that continuous-time offers clean and intuitive proofs that can later be discretized for guarantees on discrete-time algorithms.
As exemplified by the field of deep learning, modern machine learning frequently takes place in a high-dimensional regime with more parameters than examples. It is now well-known that deep networks will interpolate noisy data, yet exhibit low generalization errordespite interpolation when trained on meaningful data (Zhang et al., 2016). Defying classical statistical wisdom, an explanation for this apparent paradox has been given in the implicit bias of optimization algorithms and the double-descent curve (Belkin et al., 2019a). The notion of implicit bias captures the preference of an optimization algorithm to converge to a particular kind of interpolating solution – such as a minimum norm solution – when many options exist. Surprisingly, similar “harmless” or “benign” interpolation phenomena have been observed even in much simpler systems such as overparameterized linear regression (Bartlett et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) and random feature models (Belkin et al., 2019b; Mei and Montanari, 2019). Understanding implicit bias has thus become an important area of research, with applications ranging from modern deep learning to pure statistics.
Implicit bias has been categorized for separable classification problems (Soudry et al., 2018; Nacson et al., 2018), regression problems using mirror descent (Gunasekar et al., 2018b), and multilayer models (Gunasekar et al., 2018a; Woodworth et al., 2020; Gunasekar et al., 2017). Approximate results and empirical evidence are also available for nonlinear deep networks (Azizan et al., 2019)
. Our work contributes to the understanding of implicit bias in a practically relevant class of nonconvex learning problems, where proofs of convergence and bounds on the generalization error are attainable: GLM regression. Our algorithms have applications in recovering the weights of an unknown recurrent neural network, and may be useful for learning single-layer neural network models(Bai et al., 2019).
2 Problem setting and background
Our setup follows the original work of Kakade et al. (2011). We assume the dataset is sampled i.i.d. from a distribution supported on , where for a finite-dimensional feature map with associated kernel and a fixed, unknown vector of parameters. is a known, nondecreasing, and -Lipschitz link function. We assume that for some fixed bound and that for all for some fixed bound . Our goal is to approximate as measured by the expected squared loss. To this end, for a hypothesis we define the quantities
and we denote their empirical counterparts over the dataset as and . Above, measures the generalization error of , while measures the excess risk compared to the Bayes-optimal predictor. Towards minimizing , we present a family of mirror descent-like algorithms for minimizing over parametric hypotheses of the form . Via standard statistical bounds (Bartlett and Mendelson, 2002), we transfer our guarantees on to , which in turn implies a small . The starting point of our analysis is the GLM-tron of Kakade et al. (2011). The GLM-tron is an iterative update law of the form
with initialization . (3) is a gradient-like update law, obtained from gradient descent on the square loss by dropping the derivative of . It admits a natural continuous-time limit,
3 The Reflectron Algorithm
We define the Reflectron algorithm in continuous-time as the mirror descent-like dynamics
for a convex function. The parameters of the hypothesis at time are obtained by applying the inverse gradient of to the output of the Reflectron.
3.1 Statistical guarantees
The following theorem gives our statistical guarantees for the Reflectron. It implies that for any choice of strongly convex function , the Reflectron finds a nearly Bayes-optimal predictor if it is allowed to run for sufficiently long time.
Theorem 3.1 (Statistical guarantees for the Reflectron).
Suppose that are drawn i.i.d. from a distribution supported on with for a known nondecreasing and -Lipschitz link function , a kernel function with corresponding finite-dimensional feature map , and an unknown vector of parameters . Assume that where is -strongly convex with respect to and is a constant, and that for all where is a constant. Then, for any , with probability at least
, with probability at leastover the draws of the , there exists some time such that the hypothesis satisfies
where is output by the Reflectron at time with .
Consider the rate of change of the Bregman divergence (Bregman, 1967)
between the parameters for the Bayes-optimal predictor and , . Note that we have the equality , and hence we find that
Using that is -Lipschitz and nondecreasing,
Now, note that each
is a zero-mean i.i.d. random variable with norm bounded byalmost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude that
Hence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by -strong convexity of with respect to . Furthermore, . Thus it can be until at most to satisfy . Hence there is some with such that . To transfer this bound on to , we need to bound the quantity . Combining Theorems C.1 and C.2 gives us a bound on the Rademacher complexity where , and clearly . Application of Theorem C.3 to the square loss111Note that while the square loss is not bounded or Lipschitz in general, it is both over the domain with bound and Lipshitz constant immediately implies with probability at least . The conclusion of the theorem follows by a union bound. ∎
Because = up to a constant, we can find a good predictor
by using a hold-out set to estimateand by taking the best predictor on the hold-out set. Our proof of Theorem 3.1 is similar to the corresponding proofs of the GLM-tron (Kakade et al., 2011) and the Alphatron (Goel and Klivans, 2017), but has two primary modifications. First, we consider the Bregman divergence under between the Bayes-optimal parameters and the current parameters output by the Reflectron rather than the squared Euclidean distance. Second, rather than analyzing the iteration on as in the discrete-time case, we analyze the time derivative of the Bregman divergence. Taking recovers the guarantees of the Alphatron222Our setting corresponds to the case where in the notation of Goel and Klivans (2017), and taking recovers the guarantees of the GLM-tron, up to forward Euler discretization-specific details. As our analysis applies in the continuous-time setting, many algorithmic variants may be obtained by choice of discretization method. Because the Reflectron operates directly on rather than where , we require the feature map to be finite-dimensional.
3.2 Implicit regularization
We now consider an alternative setting, and probe how the choice of impacts the parameters learned by the Reflectron. We require the following assumption.
The dataset is realizable. That is, there exists some fixed parameter vector such that for all .
Assumption 3.1 allows us to understand both overfitting and interpolation by the Reflectron. In many cases, even the noisy dataset of Section 3.1 may satisfy Assumption 3.1. We begin by proving convergence of the Reflectron in the realizable setting.
Lemma 3.1 (Convergence of the Reflectron for a realizable dataset).
Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where is a known, nondecreasing, and -Lipschitz function. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the Reflectron at time with arbitrary. Furthermore, .
Under the assumptions of the lemma, (6) shows that
Integrating both sides of the above gives the bound
Explicit computation shows that is bounded, so that is uniformly continuous in . By Barbalat’s Lemma (Lemma C.2), this implies that as .
Furthermore, this simple analysis immediately gives us a convergence rate. Indeed, one can write
so that . ∎
Lemma 3.1 shows that the Reflectron will converge to an interpolating solution in the realizable setting and that the best hypothesis up to time does so at an rate. It also implies that the Bregman divergence remains bounded, which in many cases implies boundedness of itself. In turn, for a realizable dataset, the strong convexity requirement of Theorem 3.1 can be relaxed to a requirement that remains invertible over the trajectory . Note that Lemma 3.1 allows for arbitrary initialization, while Theorem 3.1 requires .
In general, there may be many possible vectors consistent with the data. The following theorem provides insight into the parameters learned by the Reflectron.
Theorem 3.2 (Implicit regularization of the Reflectron).
Consider the setting of Lemma 3.1. Let be the set of parameters that interpolate the data, and assume that . Further assume that is invertible. Then . In particular, if , then .
Let be arbitrary. Then,
Above, we used that and that is invertible, so that implies that . For clarity, define the error on example as . Integrating both sides of the above from to , we find that
The above relation is true for any . Furthermore, the integral on the right-hand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that . Choosing completes the proof. ∎
Theorem 3.2 elucidates the implicit bias of the Reflectron. Out of all possible interpolating parameters, the Reflectron finds those that minimize the Bregman divergence between the manifold of interpolating parameters and the initialization.
4 Vector-valued GLMs
In this section, we consider an extension to the case of vector-valued target variables . We assume that where each is -Lipschitz and nondecreasing in its argument. We define the expected and empirical error measures in this setting by replacing squared terms with squared Euclidean norms in the definitions (1) and (2).
In many cases, it is desirable to allow for weight sharing between output variables in a model. For instance, if a vector-valued GLM estimation problem originates in a control or system identification context, parameters can have physical meaning and may appear in multiple equations. Similarly, convolutional neural networks exploit weight sharing as a beneficial prior for imposing translation equivariance. We can provably learn and implicitly regularize weight-shared GLMs via a vector-valued Reflectron. Define the dynamics
with , , and convex. Note that (7) encompasses models of the form with and by unraveling into a vector of size and defining appropriately in terms of . Appendix D discusses this case explicitly, where tighter bounds are attainable and matrix-valued regularizers can be used. Our work thus generalizes the model of Foster et al. (2020) to the case of shared parameters and mirror descent. It is similar in spirit to the Convotron of Goel et al. (2018), but exploits mirror descent and applies for vector-valued outputs. (7) could in principle be extended to provably learn regularized single-layer convolutional networks with multiple outputs via the distributional assumptions of Goel et al. (2018), which allow an application of average pooling.
We can state analogous guarantees as in the scalar-valued case. We begin with convergence.
Lemma 4.1 (Convergence of the vector-valued Reflectron for a realizable dataset).
Suppose that are drawn i.i.d. from a distribution supported on and that Assumption 3.1 is satisfied where the component functions are known, nondecreasing, and -Lipschitz functions. Let be any convex function with invertible Hessian over the trajectory . Then where is the hypothesis with parameters output by the vector-valued Reflectron (7) at time with arbitrary. Furthermore, .
The proof is given in Appendix A.1. As in the scalar-valued case, the choice of implicitly biases .
Theorem 4.1 (Implicit regularization of the vector-valued Reflectron).
Theorem 4.2 (Statistical guarantees for the vector-valued Reflectron).
Suppose that are drawn i.i.d. from a distribution supported on with for a known function and an unknown vector of parameters . Assume that where each is -Lipschitz and nondecreasing in its argument. Assume that is a known matrix with for all . Further assume that for where is the row of . Let where is -strongly convex with respect to and is a constant. Then, for any with probability at least over the draws of , there exists some time such that the hypothesis satisfies
where is output by the vector-valued Reflectron (7) at time with .
The proof is given in Appendix A.3.
As a simple illustration of our theoretical results, we perform classification on the MNIST dataset using a single-layer multiclass classification model. Details of the simulation setup can be found in Appendix B.1. In Figure 1A we show the empirical risk and generalization error trajectories for the Reflectron with and . The latter option reduces to the GLM-tron while the first, following Theorem 3.2, approximates the norm and imposes sparsity. The dynamics are integrated with the dop853 integrator from scipy.integrate.ode. Both choices converge to similar values of err and . In Figure 1B, we show the training and test set accuracy. Both choices of norm converge to similar accuracy values. In Figure 1C, we show the curves and with for four integration methods. The curves lie directly on top of each other and are shifted arbitrarily for clarity. This agreement highlights the validity of our continuous-time analysis, and that many possible discrete-time algorithms are captured by the continuous-time dynamics.
In Figure 2, we show histograms of the final parameter matrices learned by the Reflectron with and . The histograms validate the prediction of Theorem 3.2. A sparse vector is found for , which obtains similar accuracy values to , as seen in Fig. 1C. of parameters have magnitude less than for , while only do for . Future work will apply the Reflectron to models that combine a fixed expressive representation of the data with sparse learning, such as the scattering transform (Mallat, 2011; Bruna and Mallat, 2012; Talmon et al., 2015; Oyallon et al., 2019; Zarka et al., 2019).
6 Broader Impact
In this work, we developed mirror-descent like variants of the GLM-tron algorithm of Kakade et al. (2011). We proved guarantees on convergence and generalization, and characterized the implicit bias of our algorithms in terms of the potential function . We generalized our results to the case of vector-valued target variables while allowing for the possibility of weight sharing between outputs. Our algorithms have applications in several settings. Using the techniques in Foster et al. (2020), they may be generalized to the adaptive control context for provably regularized online learning and control of stochastic dynamical systems. Applications in control will advance automation, but may have negative downstream consequences for those working in areas that can be replaced by adaptive control or robotic systems. Our algorithms can also be used for recovering the weights of a continuous- or discrete-time recurrent neural network from online data, which may have applications in recurrent network pruning (via sparsity promoting biases such as an approximation), or in computational neuroscience. We thank Stephen Tu for many helpful discussions.
- Amari (1998) Amari, S.-i. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276.
- Azizan and Hassibi (2019) Azizan, N. and Hassibi, B. (2019). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations.
- Azizan et al. (2019) Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv:1906.03830.
- Bai et al. (2019) Bai, S., Kolter, J. Z., and Koltun, V. (2019). Deep equilibrium models. arXiv:1909.01377.
- Bartlett et al. (2019) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2019). Benign overfitting in linear regression. arXiv:1906.11300.
- Bartlett and Mendelson (2002) Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482.
- Beck and Teboulle (2003) Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 – 175.
Belkin et al. (2019a)
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019a).
Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854.
- Belkin et al. (2019b) Belkin, M., Hsu, D., and Xu, J. (2019b). Two models of double descent for weak features. arXiv:1903.07571.
- Betancourt et al. (2018) Betancourt, M., Jordan, M. I., and Wilson, A. C. (2018). On symplectic optimization. arXiv:1802.03653.
- Boffi and Slotine (2019) Boffi, N. M. and Slotine, J.-J. E. (2019). Higher-order algorithms and implicit regularization for nonlinearly parameterized adaptive control. arXiv:1912.13154.
- Boffi and Slotine (2020) Boffi, N. M. and Slotine, J.-J. E. (2020). A continuous-time analysis of distributed stochastic gradient. Neural Computation, 32(1):36–96.
- Bregman (1967) Bregman, L. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 – 217.
- Bruna and Mallat (2012) Bruna, J. and Mallat, S. (2012). Invariant scattering convolution networks. arXiv:1203.1513.
- Butcher (2001) Butcher, J. (2001). Numerical methods for ordinary differential equations in the 20th century.
- Chen et al. (2018) Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural ordinary differential equations. arXiv:1806.07366.
- Foster et al. (2020) Foster, D. J., Rakhlin, A., and Sarkar, T. (2020). Learning nonlinear dynamical systems from a single trajectory. arXiv:2004.14681.
- Goel and Klivans (2017) Goel, S. and Klivans, A. (2017). Learning neural networks with two nonlinear layers in polynomial time. arXiv:1709.06010.
- Goel et al. (2018) Goel, S., Klivans, A., and Meka, R. (2018). Learning one convolutional layer with overlapping patches. arXiv:1802.02547.
- Gunasekar et al. (2018a) Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. (2018a). Characterizing implicit bias in terms of optimization geometry. arXiv:1802.08246.
- Gunasekar et al. (2018b) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. (2018b). Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems 31, pages 9461–9471. Curran Associates, Inc.
- Gunasekar et al. (2017) Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2017). Implicit regularization in matrix factorization. arXiv:1705.09280.
- Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560.
- Kakade et al. (2011) Kakade, S., Kalai, A. T., Kanade, V., and Shamir, O. (2011). Efficient learning of generalized linear and single index models with isotonic regression. arXiv:1104.2018.
- Kakade et al. (2009) Kakade, S. M., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 793–800. Curran Associates, Inc.
- Kalai and Sastry (2009) Kalai, A. T. and Sastry, R. (2009). The isotron algorithm: High-dimensional isotonic regression. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.
- Krichene et al. (2015) Krichene, W., Bayen, A., and Bartlett, P. L. (2015). Accelerated mirror descent in continuous and discrete time. In Advances in Neural Information Processing Systems 28, pages 2845–2853. Curran Associates, Inc.
- Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent converges to minimizers. arXiv:1602.04915.
- Mallat (2011) Mallat, S. (2011). Group invariant scattering. arXiv:1101.2286.
- Maurer (2016) Maurer, A. (2016). A vector-contraction inequality for rademacher complexities. arXiv:1605.00251.
- McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. CRC Press.
- Mei and Montanari (2019) Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:1908.05355.
- Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. (2019). Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2299–2303.
- Nacson et al. (2018) Nacson, M. S., Lee, J. D., Gunasekar, S., Savarese, P. H. P., Srebro, N., and Soudry, D. (2018). Convergence of gradient descent on separable data. arXiv:1803.01905.
- Nesterov (1983) Nesterov, Y. (1983). A Method for Solving a Convex Programming Problem with Convergence Rate . Soviet Mathematics Doklady, 26:367–372.
- Nock and Menon (2020) Nock, R. and Menon, A. K. (2020). Supervised learning: No loss no cry. arXiv:2002.03555.
- Oyallon et al. (2019) Oyallon, E., Zagoruyko, S., Huang, G., Komodakis, N., Lacoste-Julien, S., Blaschko, M., and Belilovsky, E. (2019). Scattering networks for hybrid representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2208–2221.
- Pascanu and Bengio (2013) Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. arXiv:1301.3584.
- Slotine and Li (1991) Slotine, J.-J. and Li, W. (1991). Applied Nonlinear Control. Prentice Hall.
- Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias of gradient descent on separable data. J. Mach. Learn. Res., 19(1):2822–2878.
- Su et al. (2016) Su, W., Boyd, S., and Candès, E. J. (2016). A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1–43.
- Sur and Candès (2019) Sur, P. and Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
- Talmon et al. (2015) Talmon, R., Mallat, S., Zaveri, H., and Coifman, R. R. (2015). Manifold learning for latent variable inference in dynamical systems. IEEE Transactions on Signal Processing, 63(15):3843–3856.
- Tyukin et al. (2007) Tyukin, I. Y., Prokhorov, D. V., and van Leeuwen, C. (2007). Adaptation and parameter estimation in systems with unstable target dynamics and nonlinear parametrization. IEEE Transactions on Automatic Control, 52(9):1543–1559.
- Wainwright (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
- Wibisono et al. (2016) Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358.
- Wilson et al. (2016) Wilson, A. C., Recht, B., and Jordan, M. I. (2016). A lyapunov analysis of momentum methods in optimization. arXiv:1611.02635.
- Woodworth et al. (2020) Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., and Srebro, N. (2020). Kernel and rich regimes in overparametrized models. arXiv:2002.09277.
- Zarka et al. (2019) Zarka, J., Thiry, L., Angles, T., and Mallat, S. (2019). Deep network classification by scattering and homotopy dictionary learning. arXiv:1910.03561.
- Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv:1611.03530.
- Zhang et al. (2018) Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct runge-kutta discretization achieves acceleration. arXiv:1805.00521.
Appendix A Omitted proofs
a.1 Proof of Lemma 4.1
a.2 Proof of Theorem 4.1
The proof follows the same structure as in the scalar-valued case. Let . Then,
In the derivation above, we have replaced by following our assumptions that and is invertible. Integrating both sides of the above from to , we find that
The above relation is true for any . Furthermore, the integral on the right-hand side is independent of . Hence the of the two Bregman divergences must be equal, which shows that
Initializing completes the proof. ∎
a.3 Proof of Theorem 4.2
Consider the rate of change of the Bregman divergence between the parameters for the Bayes-optimal predictor and the parameters produced by the Reflectron at time . By the same method as in Lemma 4.1, we immediately have
Now, note that each is a zero-mean i.i.d. random variable with Euclidean norm bounded by almost surely. Then, by Lemma C.1, with probability at least where . Assuming that at time , we conclude that
Hence, either , or . In the latter case, our result is proven. In the former, by our assumptions , and hence by -strong convexity of with respect to . Furthermore, . Thus it can be until at most
until . Hence there is some with such that
Application of Theorem C.5 to bound the generalization error, noting that the square loss is -Lipschitz and -bounded over , gives us the bound
which completes the proof. ∎
Appendix B Further simulation details and results
b.1 MNIST simulation details
We implement the vector-valued Reflectron without weight sharing. Rather than implement the continuous-time dynamics (7), we directly utilize the dynamics (9), as they are more efficient without weight sharing. In Section 5, we show results for the mirror descent-like dynamics. Appendix B.2 shows similar results for the natural gradient-like dynamics. Our hypothesis at time is given by
where is an elementwise sigmoid, is an image from the MNIST dataset, and
is a matrix of parameters to be learned. We use one-hot encoding on the class labels, and the predicted class is obtained by taking anover the components of . A training set of size and a test set of size are both randomly selected from the overall dataset, so that the number of parameters
. The training data is pre-processed to have zero mean and unit variance. The testing data is shifted by the mean of the training data and normalized by the standard deviation of the training data. A value ofis used for all cases, up to adaptive timestepping performed by the black-box ODE solvers. Convergence speed can be different for different choices of . To ensure convergence over similar timescales, we use a learning rate of in the continuous-time dynamics with .