
Every Local Minimum is a Global Minimum of an Induced Model
For nonconvex optimization in machine learning, this paper proves that every local minimum achieves the global optimality of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this paper improves or complements several stateoftheart theoretical results in the literature with a simple and unified proof technique.
04/07/2019 ∙ by Kenji Kawaguchi, et al. ∙ 16 ∙ shareread it

Streaming Normalization: Towards Simpler and More Biologicallyplausible Normalizations for Online and Recurrent Learning
We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologicallyplausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fullyconnected, convolutional, feedforward, recurrent and mixed  recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is wellperforming, simple to implement, fast to compute, more biologicallyplausible and thus ideal for GPU or hardware implementations.
10/19/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Deep SemiRandom Features for Nonlinear Function Approximation
We propose semirandom features for nonlinear function approximation. The flexibility of semirandom feature lies between the fully adjustable units in deep learning and the random features used in kernel methods. For one hidden layer models with semirandom features, we prove with no unrealistic assumptions that the model classes contain an arbitrarily good function as the width increases (universality), and despite nonconvexity, we can find such a good function (optimization theory) that generalizes to unseen new data (generalization bound). For deep models, with no unrealistic assumptions, we prove universal approximation ability, a lower bound on approximation error, a partial optimization guarantee, and a generalization bound. Depending on the problems, the generalization bound of deep semirandom features can be exponentially better than the known bounds of deep ReLU nets; our generalization error bound can be independent of the depth, the number of trainable weights as well as the input dimensionality. In experiments, we show that semirandom features can match the performance of neural networks by using slightly more units, and it outperforms random features by using significantly fewer units. Moreover, we introduce a new implicit ensemble method by using semirandom features.
02/28/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Depth Creates No Bad Local Minima
In deep learning, depth, as well as nonlinearity, create nonconvex loss surfaces. Then, does depth alone create bad local minima? In this paper, we prove that without nonlinearity, depth alone does not create bad local minima, although it induces nonconvex loss surface. Using this insight, we greatly simplify a recently proposed proof to show that all of the local minima of feedforward deep linear neural networks are global minima. Our theoretical results generalize previous results with fewer assumptions, and this analysis provides a method to show similar results beyond square loss in deep linear models.
02/27/2017 ∙ by Haihao Lu, et al. ∙ 0 ∙ shareread it

Global Continuous Optimization with Error Bound and Fast Convergence
This paper considers global optimization with a blackbox unknown objective function that can be nonconvex and nondifferentiable. Such a difficult optimization problem arises in many realworld applications, such as parameter tuning in machine learning, engineering design problem, and planning with a complex physics simulator. This paper proposes a new global optimization algorithm, called Locally Oriented Global Optimization (LOGO), to aim for both fast convergence in practice and finitetime error bound in theory. The advantage and usage of the new algorithm are illustrated via theoretical analysis and an experiment conducted with 11 benchmark test functions. Further, we modify the LOGO algorithm to specifically solve a planning problem via policy search with continuous state/action space and long time horizon while maintaining its finitetime error bound. We apply the proposed planning method to accident management of a nuclear power plant. The result of the application study demonstrates the practical utility of our method.
07/17/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Bounded Optimal Exploration in MDP
Within the framework of probably approximately correct Markov decision processes (PACMDP), much theoretical work has focused on methods to attain near optimality after a relatively long period of learning and exploration. However, practical concerns require the attainment of satisfactory behavior within a short period of time. In this paper, we relax the PACMDP conditions to reconcile theoretically driven exploration methods and practical needs. We propose simple algorithms for discrete and continuous state spaces, and illustrate the benefits of our proposed relaxation via theoretical analyses and numerical examples. Our algorithms also maintain anytime error bounds and average loss bounds. Our approach accommodates both Bayesian and nonBayesian methods.
04/05/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Generalization in Deep Learning
This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobustness, and sharp minima, effectively addressing an open problem in the literature. Based on our theoretical insight, this paper also proposes a family of new regularization methods. Its simplest member was empirically shown to improve base models and achieve stateoftheart performance on MNIST and CIFAR10 benchmarks. Moreover, this paper presents both datadependent and dataindependent generalization guarantees with improved convergence rates. Our results suggest several new open areas of research.
10/16/2017 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Deep Learning without Poor Local Minima
In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is nonconvex and nonconcave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the nonconvexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and nonconvex optimization, there is still a gap between theory and practice.
05/23/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Bayesian Optimization with Exponential Convergence
This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the deltacover sampling. Most Bayesian optimization methods require auxiliary optimization: an additional nonconvex global optimization problem, which can be timeconsuming and hard to implement in practice. Also, the existing Bayesian optimization method with exponential convergence requires access to the deltacover sampling, which was considered to be impractical. Our approach eliminates both requirements and achieves an exponential convergence rate.
04/05/2016 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

A Greedy Approximation of Bayesian Reinforcement Learning with Probably Optimistic Transition Model
Bayesian Reinforcement Learning (RL) is capable of not only incorporating domain knowledge, but also solving the explorationexploitation dilemma in a natural way. As Bayesian RL is intractable except for special cases, previous work has proposed several approximation methods. However, these methods are usually too sensitive to parameter values, and finding an acceptable parameter setting is practically impossible in many applications. In this paper, we propose a new algorithm that greedily approximates Bayesian RL to achieve robustness in parameter space. We show that for a desired learning behavior, our proposed algorithm has a polynomial sample complexity that is lower than those of existing algorithms. We also demonstrate that the proposed algorithm naturally outperforms other existing algorithms when the prior distributions are not significantly misleading. On the other hand, the proposed algorithm cannot handle greatly misspecified priors as well as the other algorithms can. This is a natural consequence of the fact that the proposed algorithm is greedier than the other algorithms. Accordingly, we discuss a way to select an appropriate algorithm for different tasks based on the algorithms' greediness. We also introduce a new way of simplifying Bayesian planning, based on which future work would be able to derive new algorithms.
03/13/2013 ∙ by Kenji Kawaguchi, et al. ∙ 0 ∙ shareread it

Theory of Deep Learning III: explaining the nonoverfitting puzzle
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for "low noise" datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.
12/30/2017 ∙ by Tomaso Poggio, et al. ∙ 0 ∙ shareread it