Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

by   Hao Chen, et al.

Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparameter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate O(1/K) for K iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.



There are no comments yet.


page 1

page 2

page 3

page 4


The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

The mini-batch stochastic gradient descent (SGD) algorithm is widely use...

On the Performance of Preconditioned Stochastic Gradient Descent

This paper studies the performance of preconditioned stochastic gradient...

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Stochastic Gradient Descent (SGD) with small mini-batch is a key compone...

Optimizing Spectral Sums using Randomized Chebyshev Expansions

The trace of matrix functions, often called spectral sums, e.g., rank, l...

Stochastic Gradient Descent Meets Distribution Regression

Stochastic gradient descent (SGD) provides a simple and efficient way to...

A Stochastic Interpretation of Stochastic Mirror Descent: Risk-Sensitive Optimality

Stochastic mirror descent (SMD) is a fairly new family of algorithms tha...

Reducing Runtime by Recycling Samples

Contrary to the situation with stochastic gradient descent, we argue tha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.