I Eliminating all bad local minima
Take a loss function
, with parameters , and with a global minimum . Consider the modified loss function(1) |
where are auxiliary parameters, and
is a regularization hyperparameter. The specific form of Equation
1 was chosen to emphasize the similarity to the approach in Liang et al. [2] and Kawaguchi and Kaelbling [1], but without involving auxiliary units.As can be seen by inspection, the gradient with respect to the auxiliary parameters is only zero for finite when and . Otherwise, will tend to shrink towards zero to satisfy the regularizer, will tend to grow towards infinity so that can remain approximately 1, and no fixed point will be achieved for finite . Thus, all non-global local minima of are transformed into minima at of . Recall that minima at infinity do not qualify as local minima in . Therefore, any local minimum of is a global minimum of , and has no bad local minima.
Ii Is this significant?
By eliminating the auxiliary neurons which play a central role in
Kawaguchi and Kaelbling [1] and Liang et al. [2] we hope to provide more clarity into the mechanism by which bad local minima are removed from the augmented loss. We leave it to the reader to judge whether removing local minima in this fashion is trivial, deep, or both.We also note that there is extensive discussion in Section 5 of Kawaguchi and Kaelbling [1] of situations in which their auxiliary variable (which plays a qualitatively similar role to in Section I above) diverges to infinity. So, it has been previously observed that pathologies can continue to exist in loss landscapes modified in a fashion similar to above.
Acknowledgments
We thank Leslie Kaelbling, Andrey Zhmoginov, and Hossein Mobahi for feedback on a draft of the manuscript.
References
- Kawaguchi and Kaelbling [2019] Kenji Kawaguchi and Leslie Pack Kaelbling. Elimination of all bad local minima in deep learning. arXiv preprint arXiv:1901.00279, 2019.
- Liang et al. [2018] Shiyu Liang, Ruoyu Sun, Jason D Lee, and R Srikant. Adding one neuron can eliminate all bad local minima. Neural Information Processing Systems, 2018.
Appendix A All critical points of are global minima of
At critical points of , and , which together imply that . Substituting in , we must have at any critical point of with respect to . This can only be satisfied by . Therefore at every critical point of (including every local minimum), , and thus is a global minimum of .