
Deep Factors for Forecasting
Producing probabilistic forecasts for large collections of similar and/or dependent time series is a practically relevant and challenging task. Classical time series models fail to capture complex patterns in the data, and multivariate techniques struggle to scale to large problem sizes. Their reliance on strong structural assumptions makes them dataefficient, and allows them to provide uncertainty estimates. The converse is true for models based on deep neural networks, which can learn complex patterns and dependencies given enough data. In this paper, we propose a hybrid model that incorporates the benefits of both approaches. Our new method is datadriven and scalable via a latent, global, deep component. It also handles uncertainty through a local classical model. We provide both theoretical and empirical evidence for the soundness of our approach through a necessary and sufficient decomposition of exchangeable time series into a global and a local part. Our experiments demonstrate the advantages of our model both in term of data efficiency, accuracy and computational complexity.
05/28/2019 ∙ by Yuyang Wang, et al. ∙ 6 ∙ shareread it

Efficient Multitask Feature and Relationship Learning
In this paper we propose a multiconvex framework for multitask learning that improves predictions by learning relationships both between tasks and between features. Our framework is a generalization of related methods in multitask learning, that either learn task relationships, or feature relationships, but not both. We start with a hierarchical Bayesian model, and use the empirical Bayes method to transform the underlying inference problem into a multiconvex optimization problem. We propose a coordinatewise minimization algorithm that has a closed form solution for each block subproblem. Naively these solutions would be expensive to compute, but by using the theory of doubly stochastic matrices, we are able to reduce the underlying matrix optimization subproblem into a minimum weight perfect matching problem on a complete bipartite graph, and solve it analytically and efficiently. To solve the weight learning subproblem, we propose three different strategies, including a gradient descent method with linear convergence guarantee when the instances are not shared by multiple tasks, and a numerical solution based on Sylvester equation when instances are shared. We demonstrate the efficiency of our method on both synthetic datasets and realworld datasets. Experiments show that the proposed optimization method is orders of magnitude faster than an offtheshelf projected gradient method, and our model is able to exploit the correlation structures among multiple tasks and features.
02/14/2017 ∙ by Han Zhao, et al. ∙ 0 ∙ shareread it

Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy
We propose a method to optimize the representation and distinguishability of samples from two probability distributions, by maximizing the estimated power of a statistical test based on the maximum mean discrepancy (MMD). This optimized MMD is applied to the setting of unsupervised learning by generative adversarial networks (GAN), in which a model attempts to generate realistic samples, and a discriminator attempts to tell these apart from data samples. In this context, the MMD may be used in two roles: first, as a discriminator, either directly on the samples, or on features of the samples. Second, the MMD can be used to evaluate the performance of a generative model, by testing the model's samples against a reference data set. In the latter role, the optimized MMD is particularly helpful, as it gives an interpretable indication of how the model and data distributions differ, even in cases where individual model samples are not easily distinguished either by eye or by classifier.
11/14/2016 ∙ by Dougal J. Sutherland, et al. ∙ 0 ∙ shareread it

AIDE: Fast and Communication Efficient Distributed Optimization
In this paper, we present two new communicationefficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice. It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the communication lower bounds but can also be implemented using a purely firstorder oracle. Our empirical results show that AIDE is superior to other communication efficient algorithms in settings that naturally arise in machine learning applications.
08/24/2016 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Stochastic FrankWolfe Methods for Nonconvex Optimization
We study FrankWolfe methods for nonconvex stochastic and finitesum optimization problems. FrankWolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projectionfree property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic FrankWolfe methods and analyze their convergence properties. For objective functions that decompose into a finitesum, we leverage ideas from variance reduction techniques for convex optimization to obtain new variance reduced nonconvex FrankWolfe methods that have provably faster convergence than the classical FrankWolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.
07/27/2016 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finitesum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle this issue, we develop fast stochastic algorithms that provably converge to a stationary point for constant minibatches. Furthermore, using a variant of these algorithms, we show provably faster convergence than batch proximal gradient descent. Finally, we prove global linear convergence rate for an interesting subclass of nonsmooth nonconvex functions, that subsumes several recent works. This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions.
05/23/2016 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Stochastic Variance Reduction for Nonconvex Optimization
We study nonconvex finitesum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove nonasymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to minibatch variants of SVRG, showing (theoretical) linear speedup due to minibatching in parallel settings.
03/19/2016 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Fast Incremental Method for Nonconvex Optimization
We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form _x ∑_i f_i(x). Specifically, we analyze the SAGA algorithm within an Incremental Firstorder Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconvex problems for which SAGA converges at a linear rate to the global optimum. Finally, we analyze the practically valuable regularized and minibatch variants of SAGA. To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems.
03/19/2016 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Data Driven Resource Allocation for Distributed Learning
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an indepth analysis of this model, providing new algorithms with provable worstcase guarantees, analysis proving existing scalable heuristics perform well in natural non worstcase conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.
12/15/2015 ∙ by Travis Dick, et al. ∙ 0 ∙ shareread it

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants
We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithmsa crucial requirement for modern largescale applicationshave not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG.
06/23/2015 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo
We consider the problem of Bayesian learning on sensitive datasets and present two simple but somewhat surprising results that connect Bayesian learning to "differential privacy:, a cryptographic approach to protect individuallevel privacy while permiting databaselevel utility. Specifically, we show that that under standard assumptions, getting one single sample from a posterior distribution is differentially private "for free". We will see that estimator is statistically consistent, near optimal and computationally tractable whenever the Bayesian model of interest is consistent, optimal and tractable. Similarly but separately, we show that a recent line of works that use stochastic gradient for Hybrid Monte Carlo (HMC) sampling also preserve differentially privacy with minor or no modifications of the algorithmic procedure at all, these observations lead to an "anytime" algorithm for Bayesian learning under privacy constraint. We demonstrate that it performs much better than the stateoftheart differential private methods on synthetic and real datasets.
02/26/2015 ∙ by YuXiang Wang, et al. ∙ 0 ∙ shareread it
Alex Smola
is this you? claim profile
Director, Machine Learning at Amazon, Professor at Carnegie Mellon University