Outlier-robust sparse/low-rank least-squares regression and robust matrix completion
We consider high-dimensional least-squares regression when a fraction ϵ of the labels are contaminated by an arbitrary adversary. We analyze such problem in the statistical learning framework with a subgaussian distribution and linear hypothesis class on the space of d_1× d_2 matrices. As such, we allow the noise to be heterogeneous. This framework includes sparse linear regression and low-rank trace-regression. For a p-dimensional s-sparse parameter, we show that a convex regularized M-estimator using a sorted Huber-type loss achieves the near-optimal subgaussian rate √(slog(ep/s))+√(log(1/δ)/n)+ϵlog(1/ϵ), with probability at least 1-δ. For a (d_1× d_2)-dimensional parameter with rank r, a nuclear-norm regularized M-estimator using the same sorted Huber-type loss achieves the subgaussian rate √(rd_1/n)+√(rd_2/n)+√(log(1/δ)/n)+ϵlog(1/ϵ), again optimal up to a log factor. In a second part, we study the trace-regression problem when the parameter is the sum of a matrix with rank r plus a s-sparse matrix assuming the "low-spikeness" condition. Unlike multivariate regression studied in previous work, the design in trace-regression lacks positive-definiteness in high-dimensions. Still, we show that a regularized least-squares estimator achieves the subgaussian rate √(rd_1/n)+√(rd_2/n)+√(slog(d_1d_2)/n) +√(log(1/δ)/n). Lastly, we consider noisy matrix completion with non-uniform sampling when a fraction ϵ of the sampled low-rank matrix is corrupted by outliers. If only the low-rank matrix is of interest, we show that a nuclear-norm regularized Huber-type estimator achieves, up to log factors, the optimal rate adaptively to the corruption level. The above mentioned rates require no information on (s,r,ϵ).
READ FULL TEXT