Regularization for Shuffled Data Problems via Exponential Family Priors on the Permutation Group

11/02/2021
by   Zhenbang Wang, et al.
0

In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption is that each pair corresponds to the same observation unit. If, however, such pairs are obtained via record linkage of two files, this assumption can be violated as a result of mismatch error rooting, for example, in the lack of reliable identifiers in the two files. Recently, there has been a surge of interest in this setting under the term "Shuffled data" in which the underlying correct pairing of (X, Y)-pairs is represented via an unknown index permutation. Explicit modeling of the permutation tends to be associated with substantial overfitting, prompting the need for suitable methods of regularization. In this paper, we propose a flexible exponential family prior on the permutation group for this purpose that can be used to integrate various structures such as sparse and locally constrained shuffling. This prior turns out to be conjugate for canonical shuffled data problems in which the likelihood conditional on a fixed permutation can be expressed as product over the corresponding (X,Y)-pairs. Inference is based on the EM algorithm in which the intractable E-step is approximated by the Fisher-Yates algorithm. The M-step is shown to admit a significant reduction from n^2 to n terms if the likelihood of (X,Y)-pairs has exponential family form as in the case of generalized linear models. Comparisons on synthetic and real data show that the proposed approach compares favorably to competing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/24/2019

Normalizers and permutational isomorphisms in simply-exponential time

We show that normalizers and permutational isomorphisms of permutation g...
research
01/13/2019

On the method of likelihood-induced priors

We demonstrate that the functional form of the likelihood contains a suf...
research
10/01/2020

Estimation in exponential family Regression based on linked data contaminated by mismatch error

Identification of matching records in multiple files can be a challengin...
research
06/01/2023

A General Framework for Regression with Mismatched Data Based on Mixture Modeling

Data sets obtained from linking multiple files are frequently affected b...
research
07/16/2019

A Two-Stage Approach to Multivariate Linear Regression with Sparsely Mismatched Data

A tacit assumption in linear regression is that (response, predictor)-pa...
research
06/23/2022

Regression with Label Permutation in Generalized Linear Model

The assumption that response and predictor belong to the same statistica...
research
10/03/2019

A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data

Recently, there has been significant interest in linear regression in th...

Please sign up or login with your details

Forgot password? Click here to reset