Robust covariance estimation with missing values and cell-wise contamination
Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, most robust procedures designed to address this problem are not effective on high-dimensional data as they rely crucially on invertibility of the covariance operator. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
READ FULL TEXT