Learning Individual Models for Imputation (Technical Report)

04/07/2020
by   Aoqian Zhang, et al.
0

Missing numerical values are prevalent, e.g., owing to unreliable sensor reading, collection and transmission among heterogeneous sources. Unlike categorized data imputation over a limited domain, the numerical values suffer from two issues: (1) sparsity problem, the incomplete tuple may not have sufficient complete neighbors sharing the same/similar values for imputation, owing to the (almost) infinite domain; (2) heterogeneity problem, different tuples may not fit the same (regression) model. In this study, enlightened by the conditional dependencies that hold conditionally over certain tuples rather than the whole relation, we propose to learn a regression model individually for each complete tuple together with its neighbors. Our IIM, Imputation via Individual Models, thus no longer relies on sharing similar values among the k complete neighbors for imputation, but utilizes their regression results by the aforesaid learned individual (not necessary the same) models. Remarkably, we show that some existing methods are indeed special cases of our IIM, under the extreme settings of the number l of learning neighbors considered in individual learning. In this sense, a proper number l of neighbors is essential to learn the individual models (avoid over-fitting or under-fitting). We propose to adaptively learn individual models over various number l of neighbors for different complete tuples. By devising efficient incremental computation, the time complexity of learning a model reduces from linear to constant. Experiments on real data demonstrate that our IIM with adaptive learning achieves higher imputation accuracy than the existing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2020

Distance-based Data Cleaning: A Survey (Technical Report)

With the rapid development of the internet technology, dirty data are co...
research
06/01/2021

What's a good imputation to predict with missing values?

How to learn a good predictor on data with missing values? Most efforts ...
research
11/17/2015

Optimized Linear Imputation

Often in real-world datasets, especially in high dimensional data, some ...
research
06/03/2022

Estimation of Over-parameterized Models via Fitting to Future Observations

From a model-building perspective, in this paper we propose a paradigm s...
research
08/13/2016

An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

Techniques such as clusterization, neural networks and decision making u...
research
04/08/2022

Controllable Missingness from Uncontrollable Missingness: Joint Learning Measurement Policy and Imputation

Due to the cost or interference of measurement, we need to control measu...
research
09/09/2022

Boosting Sensitivity of Large-scale Online Experimentation via Dropout Buyer Imputation

Metrics provide strong evidence to support hypotheses in online experime...

Please sign up or login with your details

Forgot password? Click here to reset