Efficient and Adaptive Linear Regression in Semi-Supervised Settings

01/17/2017
by   Abhishek Chakrabortty, et al.
0

We consider the linear regression problem under semi-supervised settings wherein the available data typically consists of: (i) a small or moderate sized 'labeled' data, and (ii) a much larger sized 'unlabeled' data. Such data arises naturally from settings where the outcome, unlike the covariates, is expensive to obtain, a frequent scenario in modern studies involving large databases like electronic medical records (EMR). Supervised estimators like the ordinary least squares (OLS) estimator utilize only the labeled data. It is often of interest to investigate if and when the unlabeled data can be exploited to improve estimation of the regression parameter in the adopted linear model. In this paper, we propose a class of 'Efficient and Adaptive Semi-Supervised Estimators' (EASE) to improve estimation efficiency. The EASE are two-step estimators adaptive to model mis-specification, leading to improved (optimal in some cases) efficiency under model mis-specification, and equal (optimal) efficiency under a linear model. This adaptive property, often unaddressed in the existing literature, is crucial for advocating 'safe' use of the unlabeled data. The construction of EASE primarily involves a flexible 'semi-non-parametric' imputation, including a smoothing step that works well even when the number of covariates is not small; and a follow up 'refitting' step along with a cross-validation (CV) strategy both of which have useful practical as well as theoretical implications towards addressing two important issues: under-smoothing and over-fitting. We establish asymptotic results including consistency, asymptotic normality and the adaptive properties of EASE. We also provide influence function expansions and a 'double' CV strategy for inference. The results are further validated through extensive simulations, followed by application to an EMR study on auto-immunity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings

We consider quantile estimation in a semi-supervised setting, characteri...
research
11/28/2020

Optimal Semi-supervised Estimation and Inference for High-dimensional Linear Regression

There are many scenarios such as the electronic health records where the...
research
10/19/2020

Efficient Estimation and Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling

In many contemporary applications, large amounts of unlabeled data are r...
research
11/15/2017

Semi-Supervised Approaches to Efficient Evaluation of Model Prediction Performance

In many modern machine learning applications, the outcome is expensive o...
research
08/16/2022

Semi-supervised Transfer Learning for Evaluation of Model Classification Performance

In modern machine learning applications, frequent encounters of covariat...
research
04/14/2021

Double Robust Semi-Supervised Inference for the Mean: Selection Bias under MAR Labeling with Decaying Overlap

Semi-supervised (SS) inference has received much attention in recent yea...
research
03/04/2022

Adaptive Semi-Supervised Inference for Optimal Treatment Decisions with Electronic Medical Record Data

A treatment regime is a rule that assigns a treatment to patients based ...

Please sign up or login with your details

Forgot password? Click here to reset