
Every Local Minimum is a Global Minimum of an Induced Model
For nonconvex optimization in machine learning, this paper proves that every local minimum achieves the global optimality of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this paper improves or complements several stateoftheart theoretical results in the literature with a simple and unified proof technique.
04/07/2019 ∙ by Kenji Kawaguchi, et al. ∙ 16 ∙ shareread it

A Stochastic FirstOrder Method for Ordered Empirical Risk Minimization
We propose a new stochastic firstorder method for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (minibatch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses, and that itself is an unbiased gradient estimator of an ordered modification of the empirical average loss. On the theory side, we show that the proposed algorithm is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for nonconvex loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, we present extensive numerical experiments, in which our proposed method consistently improves the test errors compared with the standard minibatch SGD in various models including SVM, logistic regression, and (nonconvex) deep learning problems.
07/09/2019 ∙ by Kenji Kawaguchi, et al. ∙ 5 ∙ shareread it

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes
In this paper, we theoretically prove that gradient descent can find a global minimum for nonlinear deep neural networks of sizes commonly encountered in practice. The theory developed in this paper requires only the number of trainable parameters to increase linearly as the number of training samples increases. This allows the size of the deep neural networks to be several orders of magnitude smaller than that required by the previous theories. Moreover, we prove that the linear increase of the size of the network is the optimal rate and that it cannot be improved, except by a logarithmic factor. Furthermore, deep neural networks with the trainability guarantee are shown to generalize well to unseen test samples with a natural dataset but not a random dataset.
08/05/2019 ∙ by Kenji Kawaguchi, et al. ∙ 3 ∙ shareread it

Streaming Normalization: Towards Simpler and More Biologicallyplausible Normalizations for Online and Recurrent Learning
We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologicallyplausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fullyconnected, convolutional, feedforward, recurrent and mixed  recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is wellperforming, simple to implement, fast to compute, more biologicallyplausible and thus ideal for GPU or hardware implementations.
10/19/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Deep SemiRandom Features for Nonlinear Function Approximation
We propose semirandom features for nonlinear function approximation. The flexibility of semirandom feature lies between the fully adjustable units in deep learning and the random features used in kernel methods. For one hidden layer models with semirandom features, we prove with no unrealistic assumptions that the model classes contain an arbitrarily good function as the width increases (universality), and despite nonconvexity, we can find such a good function (optimization theory) that generalizes to unseen new data (generalization bound). For deep models, with no unrealistic assumptions, we prove universal approximation ability, a lower bound on approximation error, a partial optimization guarantee, and a generalization bound. Depending on the problems, the generalization bound of deep semirandom features can be exponentially better than the known bounds of deep ReLU nets; our generalization error bound can be independent of the depth, the number of trainable weights as well as the input dimensionality. In experiments, we show that semirandom features can match the performance of neural networks by using slightly more units, and it outperforms random features by using significantly fewer units. Moreover, we introduce a new implicit ensemble method by using semirandom features.
02/28/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Depth Creates No Bad Local Minima
In deep learning, depth, as well as nonlinearity, create nonconvex loss surfaces. Then, does depth alone create bad local minima? In this paper, we prove that without nonlinearity, depth alone does not create bad local minima, although it induces nonconvex loss surface. Using this insight, we greatly simplify a recently proposed proof to show that all of the local minima of feedforward deep linear neural networks are global minima. Our theoretical results generalize previous results with fewer assumptions, and this analysis provides a method to show similar results beyond square loss in deep linear models.
02/27/2017 ∙ by Haihao Lu, et al. ∙ 0 ∙ shareread it

Global Continuous Optimization with Error Bound and Fast Convergence
This paper considers global optimization with a blackbox unknown objective function that can be nonconvex and nondifferentiable. Such a difficult optimization problem arises in many realworld applications, such as parameter tuning in machine learning, engineering design problem, and planning with a complex physics simulator. This paper proposes a new global optimization algorithm, called Locally Oriented Global Optimization (LOGO), to aim for both fast convergence in practice and finitetime error bound in theory. The advantage and usage of the new algorithm are illustrated via theoretical analysis and an experiment conducted with 11 benchmark test functions. Further, we modify the LOGO algorithm to specifically solve a planning problem via policy search with continuous state/action space and long time horizon while maintaining its finitetime error bound. We apply the proposed planning method to accident management of a nuclear power plant. The result of the application study demonstrates the practical utility of our method.
07/17/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Bounded Optimal Exploration in MDP
Within the framework of probably approximately correct Markov decision processes (PACMDP), much theoretical work has focused on methods to attain near optimality after a relatively long period of learning and exploration. However, practical concerns require the attainment of satisfactory behavior within a short period of time. In this paper, we relax the PACMDP conditions to reconcile theoretically driven exploration methods and practical needs. We propose simple algorithms for discrete and continuous state spaces, and illustrate the benefits of our proposed relaxation via theoretical analyses and numerical examples. Our algorithms also maintain anytime error bounds and average loss bounds. Our approach accommodates both Bayesian and nonBayesian methods.
04/05/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Generalization in Deep Learning
This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve stateoftheart performance on MNIST and CIFAR10 benchmarks. Moreover, this paper presents both datadependent and dataindependent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.
10/16/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Deep Learning without Poor Local Minima
In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is nonconvex and nonconcave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the nonconvexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and nonconvex optimization, there is still a gap between theory and practice.
05/23/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Bayesian Optimization with Exponential Convergence
This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the deltacover sampling. Most Bayesian optimization methods require auxiliary optimization: an additional nonconvex global optimization problem, which can be timeconsuming and hard to implement in practice. Also, the existing Bayesian optimization method with exponential convergence requires access to the deltacover sampling, which was considered to be impractical. Our approach eliminates both requirements and achieves an exponential convergence rate.
04/05/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it