
Maximum Principle Based Algorithms for Deep Learning
The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradientbased methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate periteration, provided Hamiltonian maximization can be efficiently carried out  a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradientbased methods for discrete trainable variables.
10/26/2017 ∙ by Qianxiao Li, et al. ∙ 0 ∙ shareread it

Stochastic modified equations and adaptive stochastic gradient algorithms
We develop the method of stochastic modified equations (SME), in which stochastic gradient algorithms are approximated in the weak sense by continuoustime stochastic differential equations. We exploit the continuous formulation together with optimal control theory to derive novel adaptive hyperparameter adjustment policies. Our algorithms have competitive performance with the added benefit of being robust to varying models and datasets. This provides a general methodology for the analysis and design of stochastic gradient algorithms.
11/19/2015 ∙ by Qianxiao Li, et al. ∙ 0 ∙ shareread it

Convolutional neural networks with lowrank regularization
Large CNNs have delivered impressive performance in various computer vision applications. But the storage and computation requirements make it problematic for deploying these models on mobile devices. Recently, tensor decompositions have been used for speeding up CNNs. In this paper, we further develop the tensor decomposition technique. We propose a new algorithm for computing the lowrank tensor decomposition for removing the redundancy in the convolution kernels. The algorithm finds the exact global optimizer of the decomposition and is more effective than iterative methods. Based on the decomposition, we further propose a new method for training lowrank constrained CNNs from scratch. Interestingly, while achieving a significant speedup, sometimes the lowrank constrained CNNs delivers significantly better performance than their nonconstrained counterparts. On the CIFAR10 dataset, the proposed lowrank NIN model achieves 91.31% accuracy (without data augmentation), which also improves upon stateoftheart result. We evaluated the proposed method on CIFAR10 and ILSVRC12 datasets for a variety of modern CNNs, including AlexNet, NIN, VGG and GoogleNet with success. For example, the forward time of VGG16 is reduced by half while the performance is still comparable. Empirical success suggests that lowrank tensor decompositions can be a very useful tool for speeding up large CNNs.
11/19/2015 ∙ by Cheng Tai, et al. ∙ 0 ∙ shareread it

Multiscale Adaptive Representation of Signals: I. The Basic Framework
We introduce a framework for designing multiscale, adaptive, shiftinvariant frames and biframes for representing signals. The new framework, called AdaFrame, improves over dictionary learningbased techniques in terms of computational efficiency at inference time. It improves classical multiscale basis such as wavelet frames in terms of coding efficiency. It provides an attractive alternative to dictionary learningbased techniques for low level signal processing tasks, such as compression and denoising, as well as high level tasks, such as feature extraction for object recognition. Connections with deep convolutional networks are also discussed. In particular, the proposed framework reveals a drawback in the commonly used approach for visualizing the activations of the intermediate layers in convolutional networks, and suggests a natural alternative.
07/17/2015 ∙ by Cheng Tai, et al. ∙ 0 ∙ shareread it

Understanding and Enhancing the Transferability of Adversarial Examples
Stateoftheart deep neural networks are known to be vulnerable to adversarial examples, formed by applying small but malicious perturbations to the original inputs. Moreover, the perturbations can transfer across models: adversarial examples generated for a specific model will often mislead other unseen models. Consequently the adversary can leverage it to attack deployed systems without any query, which severely hinder the application of deep learning, especially in the areas where security is crucial. In this work, we systematically study how two classes of factors that might influence the transferability of adversarial examples. One is about modelspecific factors, including network architecture, model capacity and test accuracy. The other is the local smoothness of loss function for constructing adversarial examples. Based on these understanding, a simple but effective strategy is proposed to enhance transferability. We call it variancereduced attack, since it utilizes the variancereduced gradient to generate adversarial example. The effectiveness is confirmed by a variety of experiments on both CIFAR10 and ImageNet datasets.
02/27/2018 ∙ by Lei Wu, et al. ∙ 0 ∙ shareread it

Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations
We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuoustime approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discretetime setting.
11/05/2018 ∙ by Qianxiao Li, et al. ∙ 0 ∙ shareread it
Cheng Tai
is this you? claim profile