Generating Synthetic Data with The Nearest Neighbors Algorithm

10/03/2022
by   Ali Furkan Kalay, et al.
0

The k nearest neighbor algorithm (kNN) is one of the most popular nonparametric methods used for various purposes, such as treatment effect estimation, missing value imputation, classification, and clustering. The main advantage of kNN is its simplicity of hyperparameter optimization. It often produces favorable results with minimal effort. This paper proposes a generic semiparametric (or nonparametric if required) approach named Local Resampler (LR). LR utilizes kNN to create subsamples from the original sample and then generates synthetic values that are drawn from locally estimated distributions. LR can accurately create synthetic samples, even if the original sample has a non-convex distribution. Moreover, LR shows better or similar performance to other popular synthetic data methods with minimal model optimization with parametric distributional assumptions.

READ FULL TEXT
research
06/30/2014

Rates of Convergence for Nearest Neighbor Classification

Nearest neighbor methods are a popular class of nonparametric estimators...
research
06/29/2023

Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

Numerical data imputation algorithms replace missing values by estimates...
research
02/17/2015

Nonparametric Nearest Neighbor Descent Clustering based on Delaunay Triangulation

In our physically inspired in-tree (IT) based clustering algorithm and t...
research
01/28/2022

Heterogeneous Treatment Effect Estimation based on a Partially Linear Nonparametric Bayes Model

Recently, conditional average treatment effect (CATE) estimation has bee...
research
08/25/2018

DNN: A Two-Scale Distributional Tale of Heterogeneous Treatment Effect Inference

Heterogeneous treatment effects are the center of gravity in many modern...
research
08/03/2023

Minimax Optimal Q Learning with Nearest Neighbors

Q learning is a popular model free reinforcement learning method. Most o...
research
07/12/2020

Multiple Imputation and Synthetic Data Generation with the R package NPBayesImputeCat

In many contexts, missing data and disclosure control are ubiquitous and...

Please sign up or login with your details

Forgot password? Click here to reset