Statistical matching of non-Gaussian data

03/29/2019
by   Daniel Ahfock, et al.
0

The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest X, Y and Z. Variables X and Y are observed in dataset A and variables X and Z are observed in dataset B. Statistical inference is complicated by the absence of joint (Y, Z) observations. Parametric modelling can be challenging due to identifiability issues and the difficulty of parameter estimation. We develop computationally feasible procedures for the statistical matching of non-Gaussian data using suitable data augmentation schemes and identifiability constraints. Nearest-neighbour imputation is a common alternative technique due to its ease of use and generality. Nearest-neighbour matching is based on a conditional independence assumption that may be inappropriate for non-Gaussian data. The violation of the conditional independence assumption can lead to improper imputations. We compare model based approaches to nearest-neighbour imputation on a number of flow cytometry datasets and find that the model based approach can address some of the weaknesses of the nonparametric nearest-neighbour technique.

READ FULL TEXT
research
04/14/2020

A logic-based resampling with matching approach to multiple imputation of missing data

Researchers often use model-based multiple imputation to handle missing ...
research
11/30/2020

Data Fusion for Joining Income and Consumption Information Using Different Donor-Recipient Distance Metrics

Data fusion describes the method of combining data from (at least) two i...
research
03/18/2021

Dynamic Kernel Matching for Non-conforming Data: A Case Study of T-cell Receptor Datasets

Most statistical classifiers are designed to find patterns in data where...
research
10/26/2022

Nonparametric Copula Models for Mixed Data with Informative Missingness

Modern datasets commonly feature both substantial missingness and variab...
research
09/24/2021

Correcting Conditional Mean Imputation for Censored Covariates and Improving Usability

Analysts are often confronted with censoring, wherein some variables are...
research
12/02/2017

Efficient Bayesian Nonparametric Inference for Categorical Data with General High Missingness

Missingness in categorical data is a common problem in various real appl...
research
08/05/2020

A flexible and efficient algorithm for joint imputation of general data

Imputation of data with general structures (e.g., data with continuous, ...

Please sign up or login with your details

Forgot password? Click here to reset