
ODE Analysis of Stochastic Gradient Methods with Optimism and Anchoring for Minimax Problems and GANs
Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze lastiterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convexconcavity, guided by a continuoustime analysis with differential equations. First, we show that simGD, as is, converges with stochastic subgradients under strict convexity in the primal variable. Second, we generalize optimistic simGD to accommodate an optimism rate separate from the learning rate and show its convergence with full gradients. Finally, we present anchored simGD, a new method, and show convergence with stochastic subgradients.
05/26/2019 ∙ by Ernest K. Ryu, et al. ∙ 10 ∙ shareread it

PlugandPlay Methods Provably Converge with Properly Trained Denoisers
Plugandplay (PnP) is a nonconvex framework that integrates modern denoising priors, such as BM3D or deep learningbased denoisers, into ADMM or other proximal algorithms. An advantage of PnP is that one can use pretrained denoisers when there is not sufficient data for endtoend training. Although PnP has been recently studied extensively with great empirical success, theoretical analysis addressing even the most basic question of convergence has been insufficient. In this paper, we theoretically establish convergence of PnPFBS and PnPADMM, without using diminishing stepsizes, under a certain Lipschitz condition on the denoisers. We then propose real spectral normalization, a technique for training deep learningbased denoisers to satisfy the proposed Lipschitz condition. Finally, we present experimental results validating the theory.
05/14/2019 ∙ by Ernest K. Ryu, et al. ∙ 9 ∙ shareread it

On Markov Chain Gradient Descent
Stochastic gradient methods are the workhorse (algorithms) of largescale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Markov chain gradient descent, a variant of stochastic gradient descent where the random samples are taken on the trajectory of a Markov chain. Existing results of this method assume convex objectives and a reversible Markov chain and thus have their limitations. We establish new nonergodic convergence under wider step sizes, for nonconvex problems, and for nonreversible finitestate Markov chains. Nonconvexity makes our method applicable to broader problem classes. Nonreversible finitestate Markov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Markov chains. The reported numerical results validate our contributions.
09/12/2018 ∙ by Tao Sun, et al. ∙ 8 ∙ shareread it

Theoretical Linear Convergence of Unfolded ISTA and its Practical Weights and Thresholds
In recent years, unfolding iterative algorithms as neural networks has become an empirical success in solving sparse recovery problems. However, its theoretical understanding is still immature, which prevents us from fully utilizing the power of neural networks. In this work, we study unfolded ISTA (Iterative Shrinkage Thresholding Algorithm) for sparse signal recovery. We introduce a weight structure that is necessary for asymptotic convergence to the true sparse signal. With this structure, unfolded ISTA can attain a linear convergence, which is better than the sublinear convergence of ISTA/FISTA in general cases. Furthermore, we propose to incorporate thresholding in the network to perform support selection, which is easy to implement and able to boost the convergence rate both theoretically and empirically. Extensive simulations, including sparse vector recovery and a compressive sensing experiment on real image data, corroborate our theoretical results and demonstrate their practical usefulness.
08/29/2018 ∙ by Xiaohan Chen, et al. ∙ 5 ∙ shareread it

Online Convolutional Dictionary Learning
Convolutional sparse representations are a form of sparse representation with a structured, translation invariant dictionary. Most convolutional dictionary learning algorithms to date operate in batch mode, requiring simultaneous access to all training images during the learning process, which results in very high memory usage and severely limits the training data that can be used. Very recently, however, a number of authors have considered the design of online convolutional dictionary learning algorithms that offer far better scaling of memory and computational cost with training set size than batch methods. This paper extends our prior work, improving a number of aspects of our previous algorithm; proposing an entirely new one, with better performance, and that supports the inclusion of a spatial mask for learning from incomplete data; and providing a rigorous theoretical analysis of these methods.
08/31/2017 ∙ by Jialin Liu, et al. ∙ 0 ∙ shareread it

A Primer on Coordinate Descent Algorithms
This monograph presents a class of algorithms called coordinate descent algorithms for mathematicians, statisticians, and engineers outside the field of optimization. This particular class of algorithms has recently gained popularity due to their effectiveness in solving largescale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing. Avoiding detailed technicalities and proofs, this monograph gives relevant theory and examples for practitioners to effectively apply coordinate descent to modern problems in data science and engineering.
09/30/2016 ∙ by HaoJun Michael Shi, et al. ∙ 0 ∙ shareread it

Coordinate Friendly Structures, Algorithms and Applications
This paper focuses on coordinate update methods, which are useful for solving problems involving large or highdimensional datasets. They decompose a problem into simple subproblems, where each updates one, or a small block of, variables while fixing others. These methods can deal with linear and nonlinear mappings, smooth and nonsmooth functions, as well as convex and nonconvex problems. In addition, they are easy to parallelize. The great performance of coordinate update methods depends on solving simple subproblems. To derive simple subproblems for several new classes of applications, this paper systematically studies coordinatefriendly operators that perform lowcost coordinate updates. Based on the discovered coordinate friendly operators, as well as operator splitting techniques, we obtain new coordinate update algorithms for a variety of problems in machine learning, image processing, as well as subareas of optimization. Several problems are treated with coordinate update for the first time in history. The obtained algorithms are scalable to large instances through parallel and even asynchronous computing. We present numerical examples to illustrate how effective these algorithms are.
01/05/2016 ∙ by Zhimin Peng, et al. ∙ 0 ∙ shareread it

ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates
Finding a fixed point to a nonexpansive operator, i.e., x^*=Tx^*, abstracts many problems in numerical linear algebra, optimization, and other areas of scientific computing. To solve fixedpoint problems, we propose ARock, an algorithmic framework in which multiple agents (machines, processors, or cores) update x in an asynchronous parallel fashion. Asynchrony is crucial to parallel computing since it reduces synchronization wait, relaxes communication bottleneck, and thus speeds up computing significantly. At each step of ARock, an agent updates a randomly selected coordinate x_i based on possibly outofdate information on x. The agents share x through either global memory or communication. If writing x_i is atomic, the agents can read and write x without memory locks. Theoretically, we show that if the nonexpansive operator T has a fixed point, then with probability one, ARock generates a sequence that converges to a fixed points of T. Our conditions on T and step sizes are weaker than comparable work. Linear convergence is also obtained. We propose special cases of ARock for linear systems, convex optimization, machine learning, as well as distributed and decentralized consensus problems. Numerical experiments of solving sparse logistic regression problems are presented.
06/08/2015 ∙ by Zhimin Peng, et al. ∙ 0 ∙ shareread it

Sparse Recovery via Differential Inclusions
In this paper, we recover sparse signals from their noisy linear measurements by solving nonlinear differential inclusions, which is based on the notion of inverse scale space (ISS) developed in applied mathematics. Our goal here is to bring this idea to address a challenging problem in statistics, i.e. finding the oracle estimator which is unbiased and signconsistent using dynamics. We call our dynamics Bregman ISS and Linearized Bregman ISS. A wellknown shortcoming of LASSO and any convex regularization approaches lies in the bias of estimators. However, we show that under proper conditions, there exists a biasfree and signconsistent point on the solution paths of such dynamics, which corresponds to a signal that is the unbiased estimate of the true signal and whose entries have the same signs as those of the true signs, i.e. the oracle estimator. Therefore, their solution paths are regularization paths better than the LASSO regularization path, since the points on the latter path are biased when signconsistency is reached. We also show how to efficiently compute their solution paths in both continuous and discretized settings: the full solution paths can be exactly computed piece by piece, and a discretization leads to Linearized Bregman iteration, which is a simple iterative thresholding rule and easy to parallelize. Theoretical guarantees such as signconsistency and minimax optimal l_2error bounds are established in both continuous and discrete settings for specific points on the paths. Earlystopping rules for identifying these points are given. The key treatment relies on the development of differential inequalities for differential inclusions and their discretizations, which extends the previous results and leads to exponentially fast recovering of sparse signals before selecting wrong ones.
06/30/2014 ∙ by Stanley Osher, et al. ∙ 0 ∙ shareread it

A fast patchdictionary method for whole image recovery
Various algorithms have been proposed for dictionary learning. Among those for image processing, many use image patches to form dictionaries. This paper focuses on wholeimage recovery from corrupted linear measurements. We address the open issue of representing an image by overlapping patches: the overlapping leads to an excessive number of dictionary coefficients to determine. With very few exceptions, this issue has limited the applications of imagepatch methods to the local kind of tasks such as denoising, inpainting, cartoontexture decomposition, superresolution, and image deblurring, for which one can process a few patches at a time. Our focus is global imaging tasks such as compressive sensing and medical image recovery, where the whole image is encoded together, making it either impossible or very ineffective to update a few patches at a time. Our strategy is to divide the sparse recovery into multiple subproblems, each of which handles a subset of nonoverlapping patches, and then the results of the subproblems are averaged to yield the final recovery. This simple strategy is surprisingly effective in terms of both quality and speed. In addition, we accelerate computation of the learned dictionary by applying a recent block proximalgradient method, which not only has a lower periteration complexity but also takes fewer iterations to converge, compared to the current stateoftheart. We also establish that our algorithm globally converges to a stationary point. Numerical results on synthetic data demonstrate that our algorithm can recover a more faithful dictionary than two stateoftheart methods. Combining our wholeimage recovery and dictionarylearning methods, we numerically simulate image inpainting, compressive sensing recovery, and deblurring. Our recovery is more faithful than those of a total variation method and a method based on overlapping patches.
08/16/2014 ∙ by Yangyang Xu, et al. ∙ 0 ∙ shareread it

Video Compressive Sensing for Dynamic MRI
We present a video compressive sensing framework, termed ktCSLDS, to accelerate the image acquisition process of dynamic magnetic resonance imaging (MRI). We are inspired by a stateoftheart model for video compressive sensing that utilizes a linear dynamical system (LDS) to model the motion manifold. Given compressive measurements, the state sequence of an LDS can be first estimated using system identification techniques. We then reconstruct the observation matrix using a joint structured sparsity assumption. In particular, we minimize an objective function with a mixture of wavelet sparsity and joint sparsity within the observation matrix. We derive an efficient convex optimization algorithm through alternating direction method of multipliers (ADMM), and provide a theoretical guarantee for global convergence. We demonstrate the performance of our approach for video compressive sensing, in terms of reconstruction accuracy. We also investigate the impact of various sampling strategies. We apply this framework to accelerate the acquisition process of dynamic MRI and show it achieves the best reconstruction accuracy with the least computational time compared with existing algorithms in the literature.
01/30/2014 ∙ by Jianing V. Shi, et al. ∙ 0 ∙ shareread it
Wotao Yin
is this you? claim profile
Applied mathematician and professor in the Mathematics department at the University of California, Los Angeles