Fast and Sample Near-Optimal Algorithms for Learning Multidimensional Histograms
We study the problem of robustly learning multi-dimensional histograms. A d-dimensional function h: D →R is called a k-histogram if there exists a partition of the domain D ⊆R^d into k axis-aligned rectangles such that h is constant within each such rectangle. Let f: D →R be a d-dimensional probability density function and suppose that f is OPT-close, in L_1-distance, to an unknown k-histogram (with unknown partition). Our goal is to output a hypothesis that is O(OPT) + ϵ close to f, in L_1-distance. We give an algorithm for this learning problem that uses n = Õ_d(k/ϵ^2) samples and runs in time Õ_d(n). For any fixed dimension, our algorithm has optimal sample complexity, up to logarithmic factors, and runs in near-linear time. Prior to our work, the time complexity of the d=1 case was well-understood, but significant gaps in our understanding remained even for d=2.
READ FULL TEXT