Tail bounds for volume sampled linear regression
The n × d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to observe only a small number k ≪ n of the responses, and then produce a weight vector whose sum of square loss over all points is at most 1+ϵ times the minimum. A standard approach to this problem is to use i.i.d. leverage score sampling, but this approach is known to perform poorly when k is small (e.g., k = d); in such cases, it is dominated by volume sampling, a joint sampling method that explicitly promotes diversity. How these methods compare for larger k was not previously understood. We prove that volume sampling can have poor behavior for large k - indeed worse than leverage score sampling. We also show how to repair volume sampling using a new padding technique. We prove that padded volume sampling has at least as good a tail bound as leverage score sampling: sample size k=O(d d + d/ϵ) suffices to guarantee total loss at most 1+ϵ times the minimum with high probability. The main technical challenge is proving tail bounds for the sums of dependent random matrices that arise from volume sampling.
READ FULL TEXT