In recent years we have witnessed remarkable advances in many areas of artificial intelligence. In large part this progress has been due to the success of machine learning methods, notably deep neural networks, applied to very large datasets. These networks are typically trained using variants of stochastic gradient descent (SGD), allowing training on large data using modern hardware. Despite intense recent research and significant progress toward understanding SGD and deep architectures, it has not been easy to understand the underlying causes of that success. Broadly speaking, it can be attributed to (a) the structure of the function space represented by the network or (b) the properties of the optimization algorithms used. While these two aspects of learning are intertwined, they are distinct and there is hope that they may be disentangled.
As learning in deep neural networks is still largely resistant to theoretical analysis, progress both in theory and practice can be made by exploring the limits of shallow methods on large datasets. Shallow methods, such as kernel methods, are a subject of an extensive and diverse literature. Theoretically, kernel machines are known to be universal learners, capable of learning nearly arbitrary functions given a sufficient number of examples [STC04, SC08]. Kernel methods are easily implementable and show state-of-the-art performance on smaller datasets (see [CK11, HAS14, DXH14, LML14, MGL17] for some comparisons with DNN’s). On the other hand, there has been significantly less progress in applying these methods to large modern data111However, see [HAS14, MGL17] for some notable successes.. The goal of this work is to make a step toward understanding the subtle interplay between architecture and optimization for shallow algorithms and to take practical steps to improve performance of kernel methods on large data.
The paper consists of two main parts. First, we identify a basic underlying limitation of using gradient descent-based methods in conjunction with smooth kernels typically used in machine learning. We show that only very smooth functions can be well-approximated after polynomially many steps of gradient descent. On the other hand, a less smooth target function cannot be approximated within using any polynomial number steps of gradient descent for kernel regression. This phenomenon is a result of the fast spectral decay of smooth kernels and can be readily understood in terms of the spectral structure of the gradient descent operator in the least square regression/classification setting, which is the focus of our discussion. Note the marked contrast with the standard analysis of gradient descent for convex optimizations problems, requiring at most steps to get an -approximation of a minimum. The difference is due to the infinite (or, in practice, very high) dimensionality of the target space and the fact that the minimizer of a convex functional is not generally an element of the same space.
A direct consequence of this theoretical analysis is slow convergence of gradient descent methods for high-dimensional regression, resulting in severe over-regularization/underfitting and suboptimal performance for less smooth functions. These functions are arguably very common in practice, at least in the classification setting, where we expect sharp transitions or even discontinuities near the class boundaries. We give some examples on real data showing that the number of steps of gradient descent needed to obtain near-optimal classification is indeed very large even for smaller problems. This shortcoming of gradient descent is purely algorithmic and is not related to the sample complexity of the data, persisting even in the limit of infinite data. It is also not an intrinsic flaw of kernel architectures, which are capable of approximating arbitrary functions but potentially require a very large number of gradient descent steps. The issue is particularly serious for large data, where direct second order methods cannot be used due to the computational constraints. Indeed, even for a dataset with only data points, practical direct solvers require cubic, on the order of
operations, weeks of computational time for a fast processor/GPU. While many approximate second-order methods are available, they rely on low-rank approximations and, as we discuss below, also lead to over-regularization as important information is contained in eigenvectors with very small eigenvalues typically discarded in such approximations.
In the second part of the paper we address this problem by proposing EigenPro iteration (see github.com/EigenPro for the code), a direct and simple method to alleviate slow convergence resulting from fast eigen-decay for kernel (and covariance) matrices. EigenPro is a preconditioning scheme based on approximately computing a small number of top eigenvectors to modify the spectrum of these matrices. It can also be viewed as constructing a new kernel, specifically optimized for gradient descent. While EigenPro uses approximate second-order information, it is only employed to modify first-order gradient descent leading to the same mathematical solution (without introducing a bias). Moreover, only one second-order problem at the start of the iteration needs to be solved. EigenPro requires only a small overhead per iteration compared to standard gradient descent and is also fully compatible with SGD. We analyze the step size in the SGD setting and provide a range of experimental results for different kernels and parameter settings showing consistent acceleration by a factor from five to over thirty over the standard methods, such as Pegasos [SSSSC11] for a range of datasets and settings. For large datasets, when the computational budget is limited, that acceleration translates into significantly improved accuracy and/or computational efficiency. It also obviates the need for complex computational resources such as supercomputer nodes or AWS clusters typically used with other methods. In particular, we are able to improve or match the state-of-the-art recent results for large datasets in the kernel literature at a small fraction of their reported computational budget, using a single GPU.
Finally, we note that in the large data setting, we are limited to a small number of iterations of gradient descent and certain approximate second-order computations. Thus, investigations of algorithms based on the space of functions that can be approximated within a fixed computational budget of these operations (defined in terms of the input data size) reflect the realities of modern large-scale machine learning more accurately than the more traditional analyses of convergence rates. Moreover, many aspects of modern inference are best reflected by an infinite dimensional optimization problem whose properties are sometimes different from the standard finite-dimensional results. Developing careful analyses and insights into these issues will no doubt result in significant pay-offs both in theory and in practice.
2 Gradient descent for shallow methods
Shallow methods. In the context of this paper, shallow methods denote the family of algorithms consisting of a (linear or non-linear) feature map to a (finite or infinite-dimensional) Hilbert space
followed by a linear regression/classification algorithm. This is a simple yet powerful setting amenable to theoretical analysis. In particular, it includes the class of kernel methods, where the feature map typically takes us from finite dimensional input to an infinite dimensional Reproducing Kernel Hilbert Space (RKHS). In what follows we will employ the square loss which significantly simplifies the analysis and leads to efficient and competitive algorithms.
Linear regression. Consider labeled data points . To simplify the notation let us assume that the feature map has already been applied to the data, i.e.,
. Least square linear regression aims to recover the parameter vectorthat minimize the empirical loss as follows:
When is not uniquely defined, we can choose the smallest norm solution. We do not include the typical regularization term, for reasons which will become clear shortly222We will argue that explicit regularization is rarely needed when using kernel methods for large data as available computational methods tend to over-regularize even without additional regularization..
Minimizing the empirical loss is related to solving a linear system of equations. Define the data matrix333We will take some liberties with infinite dimensional objects by sometimes treating them as vectors/matrices and writing as . and the label vector , as well as the (non-centralized) covariance matrix/operator,
Rewrite the loss as . Since , minimizing is equivalent to solving the linear system
with . When , the time complexity of solving the linear system in Eq. 4 directly (using Gaussian elimination or other methods typically employed in practice) is .
For kernel methods we frequently have . Instead of solving Eq. 4, one solves the dual system where is the kernel matrix corresponding to the kernel function . The solution can be written as . A direct solution requires operations.
Gradient descent (GD). While linear systems of equations can be solved by direct methods, their computational demands make them impractical for large data. On the other hand, gradient descent-type iterative methods hold the promise of a small number of matrix-vector multiplications, a much more manageable task. Moreover, these methods can typically be used in a stochastic setting, reducing computational requirements and allowing for very efficient GPU implementations. These schemes are adopted in popular kernel methods implementations such as NORMA [KSW04], SDCA [HCL08], Pegasos [SSSSC11], and DSGD [DXH14].
For linear systems of equations gradient descent takes a particularly simple form known as Richardson iteration [Ric11]. It is given by
We see that
It is easy to see that for convergence of to as we need to ensure444In general is chosen as a function of . However, in the least squares setting can be chosen to be a constant as the Hessian matrix does not change. that . It follows that .
When is finite dimensional the inequality has to be strict. In infinite dimension convergence is possible even if as long as each eigenvalue of is strictly smaller than one in absolute value. That will be the case for kernel integral operators.
It is now easy to describe the computational reach of gradient descent , i.e. the set of vectors which can be -approximated by gradient descent after steps
It is important to note that any cannot be -approximated by gradient descent in less than iterations.
We typically care about the quality of the solution
, rather than the error estimating the parameter vectorwhere . Thus (noticing that and commute), we get in the definition.
Remark 2.4 (Initialization).
We assume that the initialization . Choosing a different starting point will not significantly change the analysis unless second order information is incorporated in the initialization conditions555This is different for non-convex methods where different initializations may result in convergence to different local minima.. We also note that if is not full rank, gradient descent will converge to the minimum norm solution of Eq. 4.
Remark 2.5 (Infinite dimensionality).
Some care needs to be taken when is infinite-dimensional. In particular, the space of parameters and are very different spaces when is an integral operator. The space of parameters is in fact a space of distributions (generalized functions). Sometimes that can be addressed by using instead of , as .
To get a better idea of the space consider the eigendecomposition of . Let be its eigenvalues and
the corresponding eigenvectors/eigenfunctions. We have
Writing Eq. 6 in terms of eigendirection yields
and hence, putting ,
Recalling that and using the fact that , we see that a necessary condition for
This is a convenient characterization, we will denote
Another convenient necessary condition for , is that
Applying logarithm and noting that results in the following inequality that must hold for all (assuming ):
The standard result (see, e.g., [BV04]) that the number of iterations necessary for uniform convergence is of the order of the condition number follows immediately. However, we are primarily interested in the case when is infinite or very large. The corresponding operators/matrices are extremely ill-conditioned with infinite or approaching infinity condition number. In that case instead of a single condition number, one should consider a sequence of “condition numbers” along each eigen-direction.
2.1 Gradient descent, smoothness, and kernel methods.
We now proceed to analyze the computational reach for kernel methods. We will start by discussing the case of infinite data (the population case). It is both easier to analyze and allows us to demonstrate the purely computational (non-statistical) nature of limitations of gradient descent.
We will show that when the kernel is smooth, the reach of gradient descent is limited to very smooth, at least infinitely differentiable functions. Moreover, to approximate a function with less smoothness within some accuracy in the norm one needs a super-polynomial (or even exponential) in number of iterations of gradient descent.
Let the data be sampled from a probability with densityon a compact domain . In the case of infinite data becomes an integral operator corresponding to a positive definite kernel . We have
This is a compact self-adjoint operator with an infinite positive spectrum with .
We start by stating some results on the decay of eigenvalues of .
If is an infinitely differentiable kernel, the rate of eigenvalue decay is super-polynomial, i.e.
Moreover, if is an infinitely differentiable radial kernel (e.g., a Gaussian kernel), there exist constants such that for large enough ,
Interestingly, while eigenvalue decay is nearly exponential, it becomes milder as the dimension increases, leading to an unexpected “blessing of dimensionality" for gradient-descent type methods in high dimension. On the other hand, while not reflected in Theorem 1, this depends on the intrinsic dimension of the data, moderating the effect.
The computational reach of gradient descent in kernel methods. Consider now the eigenfunctions of , , which form an orthonormal basis for by the Mercer’s theorem. We can write a function as . We have .
We can now describe the reach of kernel methods with smooth kernel (in the infinite data setting). Specifically, functions which can be approximated in a polynomial number of iterations must have super-polynomial coefficient decay in the basis of kernel eigenfunctions.
Suppose is such that it can be approximated within using a polynomial in number of gradient descent iterations, i.e., for some . Then for any and large enough .
Note that . We have and hence for any , such that . From Theorem 1 we have . Thus this inequality holds whenever for some . Writing in terms of yields for sufficiently large. ∎
It is easy to see that the eigenfunctions corresponding to an infinitely differentiable kernel are also infinitely differentiable. Suppose in addition that their derivatives grow at most polynomially in , i.e. , where is some polynomial and is a Sobolev space. Then by differentiating the expansion of in terms of eigenfunctions we have the following
Any that for any can be -approximated with polynomial in number of steps of gradient descent is infinitely differentiable. Thus, if is not infinitely differentiable it cannot be -approximated in using a polynomial number of gradient descent steps.
We contrast Theorem 2 showing extremely slow convergence of gradient descent with the analysis of gradient descent for convex objective functions. The standard analysis (e.g. [B15]) indicates that steps of gradient descent is sufficient to recover the optimal value with accuracy and at first glance seems to apply in general to infinite dimensional Hilbert spaces. However in this case the standard analysis cannot be used is that the sequence diverges in as the optimal solution is not generally a function666It can be viewed as a generalized function in the space . in . This is a consequence of the fact that the infinite-dimensional operator is unbounded (see Appendix D for some experimental results).
Note that in finite dimension every non-degenerate operator, e.g. an inverse of a kernel matrix, is bounded. Nevertheless, infinite dimensional unboundedness manifests itself as the norm of the optimal solution can increase very rapidly with the size of the kernel matrix.
Gradient descent for periodic functions on . Let us now consider a simple but important special case, where gradient descent and its reach can be analyzed very explicitly. Let be a circle with the uniform measure, or, equivalently, consider periodic functions on the interval . Let be the heat kernel on the circle [Ros97]. This kernel is very close to the Gaussian kernel . The eigenfunctions of the integral operator corresponding to are simply the Fourier harmonics777We use for the index to avoid confusion with the complex number . and . The corresponding eigenvalues are and the kernel can be written as
Given a function on , we can write its Fourier series . A direct computation shows that for any , we have . We see that the space is “frozen” as grows extremely slowly when the number of iterations increases. As a simple example consider the Heaviside step function (on a circle), taking and values for and , respectively. The step function can be written as . From the analysis above, we need iterations of gradient descent to obtain an -approximation to the function. It is important to note that the Heaviside function is a rather natural example in the classification setting, where it represents the simplest two-class classification problem.
In contrast, a direct computation of inner products yields exact function recovery for any function in using the amount of computation equivalent to just one step888Applying an integral operator, i.e. infinite dimensional matrix multiplication, is roughly equivalent to a countable number of inner products of gradient descent999Of course, direct computation of inner products requires knowing the basis explicitly and in advance. In higher dimensions it also incurs a cost exponential in the dimension of the space.. Thus, we see that the gradient descent is an extremely inefficient way to recover Fourier series for a general periodic function. See Figure 1 for an illustration of this phenomenon. We see that the approximation for the Heaviside function is only marginally improved by going from to iterations of gradient descent. On the other hand, just Fourier harmonics provide a far more accurate reconstruction.
Things are not much better for functions with more smoothness unless they happen to be extremely smooth with exponential Fourier component decay. Thus in the classification case we expect nearly exponential increase in computational requirements as the margin between classes decreases.
The situation is only mildly improved in dimension , where the span of at most eigenfunctions of a Gaussian kernel or eigenfunctions of an arbitrary -differentiable kernel can be approximated in iterations. The discussion above shows that the gradient descent with a smooth kernel can be viewed as a heavy regularization of the target function. It is essentially a band-limited approximation with Fourier harmonics for some . While regularization is often desirable from a generalization/finite sample point of view in machine learning, especially when the number of data points is small, the bias resulting from the application of the gradient descent algorithm cannot be overcome in a realistic number of iterations unless the target functions are extremely smooth or the kernel itself is not infinitely differentiable.
Remark 2.9 (Rate of convergence vs statistical fit).
Note that we can improve convergence by changing the shape parameter of the kernel, i.e. making it more “peaked” (e.g., decreasing the bandwidth in the definition of the Gaussian kernel) While that does not change the exponential nature of the asymptotics of the eigenvalues, it slows their decay. Unfortunately improved convergence comes at the price of overfitting. In particular, for finite data, using a very narrow Gaussian kernel results in an approximation to the
-NN classifier, a suboptimal method which is up to a factor of two inferior to the Bayes optimal classifier in the binary classification case asymptotically. See AppendixH for some empirical results on the bandwidth selection for Gaussian kernels. Another possibility is to use a kernel, such as the Laplace kernel, which is not differentiable at zero. However, it also seems to consistently under-perform more smooth kernels on real data, see Appendix E for some experiments.
Finite sample effects, regularization and early stopping. So far we have discussed the effects of the infinite-data version of gradient descent. We will now discuss issues related to the finite sample setting we encounter in practical machine learning. It is well known (e.g., [B05, RBV10]) that the top eigenvalues of kernel matrices approximate the eigenvalues of the underlying integral operators. Therefore computational obstructions encountered in the infinite case persist whenever the data set is large enough.
Note that for a kernel method, iterations of gradient descent for data points require operations. Thus, gradient descent is computationally pointless unless . That would allow us to fit only about eigenvectors. In practice we would like to have to be much smaller than , probably a reasonably small constant.
At this point we should contrast our conclusions with the important analysis of early stopping for gradient descent provided in [YRC07] (see also [RWY14, CARR16]). The authors analyze gradient descent for kernel methods obtaining the optimal number of iterations of the form . That seems to contradict our conclusion that a very large, potentially exponential, number of iterations may be needed to guarantee convergence. The apparent contradiction stems from the assumption in [YRC07] and other works that the regression function belongs to the range of some power of the kernel operator . For an infinitely differentiable kernel, that implies super-polynomial spectral decay ( for any ). In particular, it implies that belongs to any Sobolev space. We do not typically expect such high degree of smoothness in practice, particularly in classification problems. In general, we expect sharp transitions of label probabilities across class boundaries to be typical for many classifications datasets. The Heaviside step function seems to be a simple but reasonable model for that behavior in one dimension. These areas of near-discontinuity101010Interestingly these sharp transitions can lead to lower sample complexity for optimal classifiers (cf. Tsybakov margin condition [Tsy04]). will result in slow decay of Fourier coefficients of and a mismatch with any infinitely differentiable kernel. Thus a reasonable approximation of would require a large number of gradient descent iterations.
To illustrate this point with a real data example, consider the results in the table on the right. We show the results of gradient descent for two subsets of points from the MNIST and HINT-M datasets (see Section 6 for the description) respectively. We see that the regression error on the training set is roughly inverse to the number of iterations, i.e. every extra bit of precision requires twice the number of iterations for the previous bit. For comparison, as we are primarily interested in the generalization properties of the solution, we see that the minimum regression () error on both test sets is achieved at over iterations. This results in at least cubic computational complexity equivalent to that of a direct method. While HINT-M is a regression dataset, the optimal classification accuracy for MNIST is also achieved at about iterations.
Regularization “by impatience”/explicit regularization terms. The above discussion suggests that gradient descent applied to kernel methods would typically result in underfitting for most larger datasets. Indeed, even iterations of gradient descent is prohibitive when data size is more than . As we will see in the experimental results section this is indeed the case. SGD ameliorates the problem mildly by allowing us to take approximate steps much faster but even so running standard gradient descent methods to optimality is often impractical. In most cases we observe little need for explicit early stopping rules. Regularization is a result of computational constraints (cf. [YRC07, RWY14, CARR16]) and can be termed regularization “by impatience” as we run out of time/computational budget allotted to the task.
Note that typical forms of regularization, result a large bias along eigenvectors with small eigenvalues . For example, adding a term of the form
(Tikhonov regularization/ridge regression) replacesby . While this improves the condition number and hence the speed of convergence, it comes at a high cost in terms of over-regularization/under-fitting as it essentially discards information along eigenvectors with eigenvalues smaller than . In the Fourier series analysis example, introducing this is similar to considering band-limited functions with Fourier components. Even for (machine precision for double floats) and the kernel parameter we can only fit about Fourier components! We argue that in most cases there is little need for explicit regularization in the big data regimes as our primary concern is underfitting.
Remark 2.10 (Stochastic gradient descent).
Our discussion so far has been centered entirely on gradient descent. In practice stochastic gradient descent is often used for large data. In our setting, for fixed , using SGD results in the same expected step size in each eigendirection as gradient descent. Hence, using SGD does not expand the algorithmic reach of gradient descent, although it speeds up convergence in practice. On the other hand, SGD introduces a number of interesting algorithmic and statistical subtleties. We will address some of them below.
3 EigenPro iteration: extending the reach of gradient descent
We will now propose some practical measures to alleviate the issues related to over-regularization of linear regression by gradient descent. As seen above, one of the key shortcomings of shallow learning methods based on smooth kernels (and their approximations, e.g. Fourier and RBF features) is their fast spectral decay. That observation suggests modifying the corresponding matrix by decreasing its top eigenvalues. This “partial whitening” enables the algorithm to approximate more target functions in a fixed number of iterations.
It turns out that accurate approximations of the top eigenvectors can be obtained from small subsamples of the data with modest computational expenditure. Moreover, “partially whitened" iteration can be done in a way compatible with stochastic gradient descent thus obviating the need to materialize full covariance/kernel matrices in memory. Combining these observations we construct a low overhead preconditioned Richardson iteration which we call EigenPro iteration.
Preconditioned (stochastic) gradient descent. We will modify the linear system in Eq. 4
with an invertible matrix, called a left preconditioner.
It is easy to see that as long as it converges to , the solution of the original linear system.
Preconditioned SGD can be defined similarly by
where we define and using , a sampled mini-batch of size . This preconditioned iteration also converges to with properly chosen [Mur98].
Notice that the preconditioned covariance matrix does not in general have to be symmetric. It is sometimes convenient to consider the closely related iteration
Here is a symmetric matrix. We see that .
Preconditioning as a linear feature map. It is easy to see that preconditioned iteration in Eq. 17 is in fact equivalent to the standard Richardson iteration in Eq. 5 on a dataset transformed with the linear feature map,
This is a convenient point of view as the transformed data can be stored for future use. It also shows that preconditioning is compatible with most computational methods both in practice and, potentially, in terms of analysis.
3.1 Linear EigenPro
We will now discuss properties desired to make preconditioned GD/SGD methods effective on large scale problems. Thus for the modified iteration in Eq. 15 we would like to choose to meet the following targets:
Acceleration. The algorithm should provide high accuracy in a small number of iterations.
Initial cost. The preconditioning matrix should be accurately computable without materializing the full covariance matrix.
Cost per iteration. Preconditioning by should be efficient per iteration in terms of computation and memory.
The relative approximation error along the eigenvector for gradient descent after iterations is . Minimizing the error suggests choosing the preconditioner to maximize the ratio for each . We see that modifying the top eigenvalues of makes the most difference in convergence. For example, decreasing improves convergence along all directions, while decreasing any other eigenvalue only speeds up convergence in that direction . However, decreasing below does not help unless is decreased as well. Therefore it is natural to decrease the top eigenvalues to the maximum amount, i.e. to , leading to the preconditioner
In fact it can be readily seen that is the optimal preconditioner of the form , where is a low rank matrix. We will see that -preconditioned iteration accelerates convergence by approximately a factor of .
However, exact construction of involves computing the eigendecomposition of the matrix , which is not feasible for large data size. To avoid this, we use subsample randomized SVD [HMT11] to obtain an approximate preconditioner, defined as
where algorithm RSVD (see Appendix A) computes the approximate top eigenvectors and eigenvalues and for subsample covariance matrix . Alternatively, a Nyström method based SVD (see Appendix A) can be applied to obtain eigenvectors (slightly less accurate although with little impact on training in practice) through a highly efficient implementation for GPU.
Additionally, we introduce the parameter to counter the effect of approximate top eigenvectors “spilling” into the span of the remaining eigensystem. Using is preferable to the obvious alternative of decreasing the step size as it does not decrease the step size in the directions nearly orthogonal to the span of . That allows the iteration to converge faster in those directions. In particular, are computed exactly, the step size in other eigendirections will not be affected by the choice of .
3.2 Kernel EigenPro
While EigenPro iteration can be applied to any linear regression problem, it is particularly useful in conjunction with smooth kernels which have fast eigenvalue decay. We will now discuss modifications needed to work directly in the RKHS (primal) setting.
In this setting, a reproducing kernel implies a feature map from to an RKHS space (typically) of infinite dimension. The feature map can be written as . This feature map leads to the (shallow) learning problem
Using properties of RKHS, EigenPro iteration in becomes where covariance operator and . The top eigensystem of forms the preconditioner
Notice that by the Representer theorem [Aro50], admits a representation of the form . Parameterizing the above iteration accordingly and applying some linear algebra lead to the following iteration in a finite-dimensional vector space,
where is the kernel matrix and EigenPro preconditioner is defined using the top eigensystem of (assume ),
This preconditioner differs from that for the linear case (Eq. 19) with an extra factor of due to the difference between the parameter space of and the RKHS space. Table 2 details the SGD version of this iteration.
EigenPro as kernel learning. Another way to view EigenPro is in terms of kernel learning. Assuming that the preconditioner is computed exactly, we see that in the population case EigenPro is equivalent to computing the (distribution-dependent) kernel
Notice that the RKHS spaces corresponding to and contain the same functions but have different norms. The norm in is a finite rank modification of the norm in the RKHS corresponding to , a setting reminiscent of [SNB05]
where unlabeled data was used to “warp” the norm for semi-supervised learning. However, in our paper the “warping" is purely for computational efficiency.
3.3 Costs and Benefits
We will now discuss the acceleration provided by EigenPro and the overhead associated with the algorithm.
Acceleration. Assuming that the preconditioner can be computed exactly, EigenPro computes the solution exactly in the span of the top eigenvectors. For EigenPro provides the acceleration factor of along the th eigendirection. Assuming that , a simple calculation shows an acceleration factor of at least over the standard gradient descent. Note that this assumes full gradient descent and exact computation of the preconditioner. See below for an acceleration analysis in the SGD setting resulting in a potentially somewhat smaller acceleration factor.
Initial cost. To construct the preconditioner , we perform RSVD (Appendix A) to compute the approximate top eigensystem of covariance . Algorithm RSVD has time complexity (see [HMT11]). The subsample size can be much smaller than the data size while still preserving the accuracy of estimation for top eigenvectors. In addition, we need extra memory to store the top- eigenvectors.
Cost per iteration. For standard SGD with features (or kernel centers) and mini-batch of size , the computational cost per iteration is . In addition, applying the preconditioner in EigenPro requires left multiplication by a matrix of rank . That involves vector-vector dot products for vectors of length , resulting in operations per iteration. Thus EigenPro using top- eigen-directions needs operations per iteration. Note that these can be implemented efficiently on a GPU. See Section 6 for actual overhead per iteration achieved in practice.
4 Step Size Selection for EigenPro Preconditioned Methods
We will now discuss the key issue of the step size selection for EigenPro iteration. For iteration involving covariance matrix , results in optimal (within a factor of ) convergence.
This suggests choosing the corresponding step size . However, in practice this will lead to divergence due to (1) approximate computation of eigenvectors (2) the randomness inherent in SGD. One possibility would be to compute
at every step. That, however, is costly, requiring computing the top singular value for every mini-batch. As the mini-batch can be assumed to be chosen at random, we propose using a lower bound on(with high probability) as the step size to guarantee convergence at each iteration, which works well in practice.
Linear EigenPro. Consider the EigenPro preconditioned SGD in Eq. 16. For this analysis assume that is formed by the exact eigenvectors111111 Approximate preconditioner with instead of can also be analyzed using results from [HMT11]. of . Interpreting as a linear feature map as in Section 2, makes a random subsample on the dataset . Now applying Lemma 3 (Appendix C) results in
If for any and , has following upper bound with probability at least ,
Kernel EigenPro. For EigenPro iteration in RKHS space, we can bound with a similar theorem where is the subsample covariance operator and is the corresponding EigenPro preconditioner operator. Since is infinite, we introduce intrinsic dimension from [Tro15] where is an arbitrary operator. It can be seen as a measure of the number of dimensions where has significant spectral content. Let . Then by Lemma 4 (Appendix C), we have
If for any and , with probability at least we have,
Choice of the step size. In both spectral norm bounds Eq. 21 and Eq. 22, is the dominant term when the mini-batch size is large. However, in most large-scale settings, is small, and becomes the dominant term. This suggests choosing step size leading to acceleration on the order of over the standard (unpreconditioned) SGD. That choice works well in practice.
5 EigenPro and Related Work
Recall that the setting of large scale machine learning imposes some fairly specific requirements on the optimization methods. In particular, the computational budget allocated to the problem must not significantly exceed operations, i.e., a small number of matrix-vector multiplications. That restriction rules out most direct second order methods which require operations. Approximate second order methods are far more effective computationally. However, they typically rely on low rank matrix approximation, a strategy which underperforms in conjunction with smooth kernels as information along important eigen-directions with small eigenvalues is discarded. Similarly, regularization improves conditioning and convergence but also discards information by biasing small eigen-directions.
While first order methods preserve important information, they, as discussed in this paper, are too slow to converge along eigenvectors with small eigenvalues. It is clear that an effective method must thus be a hybrid approach using approximate second order information in a first order method.
EigenPro is an example of such an approach as the second order information is used in conjunction with an iterative first order method. The things that make EigenPro effective are the following:
1. The second order information (eigenvalues and eigenvectors) is computed efficiently from a subsample of the data. Due to the quadratic loss function, that computation needs to be conducted only once. Moreover, the step size can be fixed throughout the iteration.
2. Preconditioned Richardson iteration is efficient and has a natural stochastic version. Preconditioning by a low rank modification of the identity matrix results in low overhead per iteration. The preconditioned update is computed on the fly without a need to materialize the full preconditioned covariance.
3. EigenPro iteration converges (mathematically) to the same result independently of the preconditioning matrix121212We note, however, that convergence will be slow if is poorly approximated.. That makes EigenPro relatively robust to errors in the second order preconditioning term , in contrast to most approximate second order methods.
We will now discuss some related literature and connections.
First order optimization methods. Gradient based methods, such as gradient descent (GD), stochastic gradient descent (SGD), are classic textbook methods [She94, DJS96, BV04, Bis07]. Recent renaissance of neural networks had drawn significant attention to improving and accelerating these methods, especially, the highly scalable mini-batch SGD. Methods like SAGA [RSB12] and SVRG [JZ13]
improve stochastic gradient by periodically evaluated full gradient to achieve variance reduction. Another set of approaches[DHS11, TH12, KB14] compute adaptive step size for each gradient coordinate every iteration. The step size is normally chosen to minimize certain regret bound of the loss function. Most of these methods introduce affordable computation and memory overhead.
Interpreting EigenPro iteration as a linear “partial whitening" feature map, followed by Richardson iteration, we see that most of these first order methods are compatible with EigenPro. Moreover, many convergence bounds for these methods [BV04, RSB12, JZ13] involve the condition number . EigenPro iteration generically improves such bounds by (potentially) reducing the condition number to .
Second order/hybrid optimization methods. Second order methods use the inverse of the Hessian matrix or its approximation to accelerate convergence [SYG07, BBG09, MNJ16, BHNS16, ABH16]. A limitations of many of these methods is the need to compute the full gradient instead of the stochastic gradient every iteration [LN89, EM15, ABH16] making them harder to scale to large data.
We note the work [EM15] which analyzed a hybrid first/second order method for general convex optimization with a rescaling term based on the top eigenvectors of the Hessian. That can be viewed as preconditioning the Hessian at every iteration of gradient descent. A related recent work [GOSS16] analyses a hybrid method designed to accelerate SGD convergence for linear regression with ridge regularization. The data are preconditioned by preprocessing (rescaling) all data points along the top singular vectors of the data matrix. The authors provide a detailed analysis of the algorithm depending on the regularization parameter. Another recent second order method PCG [ACW16] accelerates the convergence of conjugate gradient on large kernel ridge regression using a novel preconditioner. The preconditioner is the inverse of an approximate covariance generated with random Fourier features. By controlling the number of random features, this method strikes a balance between preconditioning effect and computational cost. [TRVR16] achieves similar preconditioning effects by solving a linear system involving a subsampled kernel matrix every iteration. While not strictly a preconditioner Nyström with gradient descent(NYTRO) [CARR16] also improves the condition number. Compared to many of these methods EigenPro directly addresses the underlying issues of slow convergence without introducing a bias in directions with small eigenvalues and incurring only a small overhead per iteration both in memory and computation.
Finally, limited memory BFGS (L-BFGS) [LN89] and its variants [SYG07, MNJ16, BHNS16] are among the most effective second order methods for unconstrained nonlinear optimization problems. Unfortunately, they can introduce prohibitive memory and computation overhead for large multi-class problems.
Scalable kernel methods. There is a significant literature on scalable kernel methods including [KSW04, HCL08, SSSSC11, TBRS13, DXH14]. Most of these are first order optimization methods. To avoid the computation and memory requirement typically involved in constructing the kernel matrix, they often adopt approximations like RBF feature [WS01, QB16, TRVR16] or random Fourier features [RR07, DXH14, TRVR16], which reduces such requirement to . Exploiting properties of random matrices and the Hadamard transform, [LSS13] further reduces the requirement to computation and memory, respectively.
Remark 5.2 (Fourier and other feature maps).
As discussed above, most scalable kernel methods suffer from limited computational reach when used with Gaussian and other smooth kernels. Feature maps, such as Random Fourier Features [RR07]
, are non-linear transformations and are agnostic with respect to the optimization methods. Still they can be viewed as approximations of smooth kernels and thus suffer from the fast decay of eigenvalues.
Preconditioned linear systems. There is a vast literature on preconditioned linear systems with a number of recent papers focusing on preconditioning kernel matrices, such as for low-rank approximation [FM12, CARR16] and faster convergence [COCF16, ACW16]. In particular, we note [FM12] which suggests approximations using top eigenvectors of the kernel matrix as a preconditioner, an idea closely related to EigenPro.
6 Experimental Results
In this section, we will present a number of experimental results to evaluate EigenPro iteration on a range of datasets.
Computing Resource. All experiments were run on a single workstation equipped with 128GB main memory, two Intel Xeon(R) E5-2620 processors, and one Nvidia GTX Titan X (Maxwell) GPU.
Datasets. The table on the right summarizes the datasets used in experiments. For image datasets (MNIST [LBBH98], CIFAR-10 [KH09], and SVHN [NWC11]), color images are first transformed to grayscale images. We then rescale the range of each feature to . For other datasets (HINT-S, HINT-M [HYWW13], TIMIT [GLF93], SUSY [BSW14]
), we normalize each feature by z-score. In addition, all multiclass labels are mapped to multiple binary labels.
Metrics. For datasets with multiclass or binary labels, we measure the training result by classification error (c-error), the percentage of predicted labels that are incorrect; for datasets with real valued labels, we adopt the mean squared error (mse).
Kernel methods. For smaller datasets exact solution of kernel regularized least squares (KRLS) gives the error close to optimal for kernel methods with the specific kernel parameters. To handle large dataset, we adopt primal space method, Pegasos [SSSSC11] using the square loss and stochastic gradient. For even larger dataset, we combine SGD and Random Fourier Features [RR07] (RF, see Appendix B) as in [DXH14, TRVR16]. The results of these two methods are presented as the baseline. Then we apply EigenPro to Pegasos and RF as described in Section 3. In addition, we compare the state-of-the-art results of other kernel methods to that of EigenPro in this section.
Hyperparameters. For consistent comparison, all iterative methods use mini-batch of size . EigenPro preconditioner is constructed using the top eigenvectors of a subsampled dataset of size . For EigenPro iteration with random features, we set the damping factor . For primal EigenPro .
Overhead of EigenPro iteration. The right side figure shows that the computational overhead of EigenPro iteration over the standard SGD ranged between and . For which is the default setting in all other experiments, EigenPro overhead is approximately .
Convergence acceleration by EigenPro for different kernels. Table 3
presents the number of epochs needed by EigenPro and Pegasos to reach the error of the optimal kernel classifier (computed by a direct method on these smaller datasets). The actual error can be found in AppendixE. We see that EigenPro provides acceleration of to times in terms of the number of epochs required without any loss of accuracy. The actual acceleration is about less due to the overhead of maintaining and applying a preconditioner.
Kernel bandwidth selection. We have investigated the impact of kernel bandwidth selection over convergence and performance for Gaussian kernel. As expected, kernel matrix with smaller bandwidth has slower eigenvalue decay, which in turn accelerates convergence of gradient descent. However, selecting smaller bandwidth also decreases test set performance. When the bandwidth is very small, the Gaussian classifier converges to 1-nearest neighbor method, something which we observe in practice. While 1-NN classifier provides reasonable performance, it has up to twice the error of the optimal Bayes classifier in theory and far from the carefully selected Gaussian kernel classifier in practice. See Appendix H for detailed results.
Comparisons on large datasets. On datasets involving up to a few million points, EigenPro consistently outperforms Pegasos/SGD-RF by a large margin when training with the same number of epochs (Table 4).
Comparisons to the state-of-the-art. In Table 5 we provide a comparison to state-of-the-art results for large datasets recently reported in the kernel literature. All of them use significant computational resources and sometimes complex training procedures. We see that EigenPro improves or matches performance performance on each dataset typically at a small fraction of the computational budget. We notice that the very recent work [MGL17]
achieves a better 30.9% error rate on TIMIT (using an AWS cluster). It is not directly comparable to our result as it employs kernel features generated using a new supervised feature selection method. EigenPro can plausibly further improve the training error or decrease computational requirements using this new feature set.
7 Conclusion and perspective
In this paper we have considered a subtle trade-off between smoothness and computation for gradient-descent based method. While smooth output functions, such as those produced by kernel algorithms with smooth kernels, are often desirable and help generalization (as, for example, encoded in the notion of algorithmic stability [BE02]) there appears to be a hidden but very significant computational cost when the smoothness of the kernel is mismatched with that of the target function. We argue and provide experimental evidence that these mismatches are common in standard classification problems. In particular, we view effectiveness of EigenPro as another piece of supporting evidence.
An important direction of future work is to understand whether this is a universal phenomenon encompassing a range of learning methods or something pertaining to the kernel setting. Specifically, the implications of this idea for deep neural networks need to be explored. Indeed, there is a body of evidence indicating that training neural networks results in highly non-smooth functions. For one thing, they can easily fit data even when the labels are randomly assigned [ZBH16]. Moreover, the pervasiveness of adversarial [SZS13] and even universal adversarial examples common to different networks [MDFFF16] suggests that there are many directions non-smoothness in the neighborhood of nearly any data point. Why neural networks show generalization despite this evident non-smoothness, remains a key open question.
Finally, we have seen that training of kernel methods on large data can be significantly improved by simple algorithmic modifications of first order iterative algorithms using limited second order information. It appears that purely second order methods cannot provide major improvements as low-rank approximations needed for dealing large data discard information corresponding to the higher frequency components present in the data. Better understanding of the computational and statistical issues and the trade-offs inherent in training, would no doubt result in even better shallow algorithms making them more competitive with deep networks on a given computational budget.
We thank Adam Stiff and Eric Fosler-Lussier for preprocessing and providing the TIMIT dataset. We are also grateful to Jitong Chen and Deliang Wang for providing the HINT dataset. We thank the National Science Foundation for financial support (IIS 1550757 and CCF 1422830) Part of this work was completed while the second author visited the Simons Institute at Berkeley.
- [ABH16] N. Agarwal, B. Bullins, and E. Hazan. Second order stochastic optimization in linear time. arXiv preprint arXiv:1602.03943, 2016.
- [ACW16] H. Avron, K. Clarkson, and D. Woodruff. Faster kernel ridge regression using sketching and preconditioning. arXiv preprint arXiv:1611.03220, 2016.
- [Aro50] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
- [B05] M. L. Braun et al. Spectral properties of the kernel matrix and their relation to kernel methods in machine learning. PhD thesis, University of Bonn, 2005.
- [B15] S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
- [BBG09] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful quasi-newton stochastic gradient descent. JMLR, 10:1737–1754, 2009.
- [BE02] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002.
- [BHNS16] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
- [Bis07] C. Bishop. Pattern recognition and machine learning. Springer, New York, 2007.
- [BSW14] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5, 2014.
- [BV04] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- [CARR16] R. Camoriano, T. Angles, A. Rudi, and L. Rosasco. NYTRO: When subsampling meets early stopping. In AISTATS, pages 1403–1411, 2016.
- [CAS16] J. Chen, H. Avron, and V. Sindhwani. Hierarchically compositional kernels for scalable nonparametric learning. arXiv preprint arXiv:1608.00860, 2016.
- [CK11] C.-C. Cheng and B. Kingsbury. Arccosine kernels: Acoustic modeling with infinite neural networks. In ICASSP, pages 5200–5203. IEEE, 2011.
- [COCF16] K. Cutajar, M. Osborne, J. Cunningham, and M. Filippone. Preconditioning kernel matrices. In ICML, pages 2529–2538, 2016.
- [DHS11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159, 2011.
- [DJS96] J. E. Dennis Jr and R. B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996.
- [DXH14] B. Dai, B. Xie, N. He, Y. Liang, A. Raj, M. Balcan, and L. Song. Scalable kernel methods via doubly stochastic gradients. In NIPS, pages 3041–3049, 2014.
- [EM15] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. In NIPS, pages 3052–3060, 2015.
- [FM12] G. Fasshauer and M. McCourt. SIAM Journal on Scientific Computing, 34(2):A737–A762, 2012.
- [GLF93] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. NIST speech disc, 1-1.1, 1993.
- [GOSS16] A. Gonen, F. Orabona, and S. Shalev-Shwartz. Solving ridge regression using sketched preconditioned svrg. In ICML, pages 1397–1405, 2016.
- [HAS14] P.-S. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep neural networks on timit. In ICASSP, pages 205–209. IEEE, 2014.
- [HCL08] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In ICML, pages 408–415, 2008.
- [HMT11] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- [HYWW13] E. W. Healy, S. E. Yoho, Y. Wang, and D. Wang. An algorithm to improve speech recognition in noise for hearing-impaired listeners. The Journal of the Acoustical Society of America, 134(4):3029–3038, 2013.
- [JZ13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
- [KB14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [KH09] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
- [KSW04] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. Signal Processing, IEEE Transactions on, 52(8):2165–2176, 2004.
- [Küh87] T. Kühn. Eigenvalues of integral operators with smooth positive definite kernels. Archiv der Mathematik, 49(6):525–534, 1987.
- [LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
- [LML14] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan, M. Collins, B. Kingsbury, M. Picheny, and F. Sha. How to scale up kernel methods to be as good as deep neural nets. arXiv preprint arXiv:1411.4000, 2014.
- [LN89] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- [LSS13] Q. Le, T. Sarlós, and A. Smola. Fastfood-approximating kernel expansions in loglinear time. In ICML, 2013.
- [MDFFF16] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. arXiv preprint arXiv:1610.08401, 2016.
- [MGL17] A. May, A. B. Garakani, Z. Lu, D. Guo, K. Liu, A. Bellet, L. Fan, M. Collins, D. Hsu, B. Kingsbury, et al. Kernel approximation methods for speech recognition. arXiv preprint arXiv:1701.03577, 2017.
- [Min17] S. Minsker. On some extensions of bernstein’s inequality for self-adjoint operators. Statistics & Probability Letters, 2017.
- [MNJ16] P. Moritz, R. Nishihara, and M. Jordan. A linearly-convergent stochastic l-bfgs algorithm. In AISTATS, pages 249–258, 2016.
- [Mur98] N. Murata. A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, pages 63–92, 1998.
- [NWC11] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop, volume 2011, page 4, 2011.
- [PGB11] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. The kaldi speech recognition toolkit. In ASRU, 2011.
Q. Que and M. Belkin.
Back to the future: Radial basis function networks revisited.In AISTATS, pages 1375–1383, 2016.
- [RBV10] L. Rosasco, M. Belkin, and E. D. Vito. On learning with integral operators. JMLR, 11(Feb):905–934, 2010.
- [Ric11] L. F. Richardson. The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam. Philosophical Transactions of the Royal Society of London. Series A, 210:307–357, 1911.
- [Ros97] S. Rosenberg. The Laplacian on a Riemannian manifold: an introduction to analysis on manifolds. Cambridge University Press, 1997.
- [RR07] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177–1184, 2007.
- [RSB12] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In NIPS, pages 2663–2671, 2012.
- [RWY14] G. Raskutti, M. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. JMLR, 15(1):335–366, 2014.
- [SC08] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
- [She94] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA, 1994.
- [SNB05] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In ICML, pages 824–831, 2005.
- [SS16] G. Santin and R. Schaback. Approximation of eigenfunctions in kernel-based spaces. Advances in Computational Mathematics, 42(4):973–993, 2016.
- [SSSSC11] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical programming, 127(1):3–30, 2011.
- [STC04] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
- [SYG07] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-newton method for online convex optimization. In AISTATS, pages 436–443, 2007.
- [SZS13] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- [TBRS13] M. Takác, A. S. Bijral, P. Richtárik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In ICML (3), pages 1022–1030, 2013.
T. Tieleman and G. Hinton.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning, 4:2, 2012.
- [Tro15] J. A. Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
- [TRVR16] S. Tu, R. Roelofs, S. Venkataraman, and B. Recht. Large scale kernel learning using block coordinate descent. arXiv preprint arXiv:1602.05310, 2016.
- [Tsy04] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, pages 135–166, 2004.
- [WS01] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In NIPS, pages 682–688, 2001.
- [YRC07] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
- [ZBH16] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.