ANOVA exemplars for understanding data drift

06/24/2020
by   Sinead A. Williamson, et al.
0

The distributions underlying complex datasets, such as images, text or tabular data, are often difficult to visualize in terms of summary statistics such as the mean or the marginal standard deviations. Instead, a small set of exemplars or prototypes—real or synthetic data points that are in some sense representative of the entire distribution—can be used to provide a human-interpretable summary of the distribution. In many situations, we are interested in understanding the difference between two distributions. For example, we may be interested in identifying and characterizing data drift over time, or the difference between two related datasets. While exemplars are often more easily understood than high-dimensional summary statistics, they are harder to compare. To solve this problem, we introduce ANOVA exemplars. Rather than independently find exemplars S_X and S_Y for two datasets X and Y, we aim to find exemplars that are both representative of X and Y, and that maximize the overlap |S_X∩ S_Y| between the two sets of exemplars. We can then use the differences between the two sets of exemplars to describe the difference between the distributions of X and Y, in a concise, interpretable manner.

READ FULL TEXT

page 2

page 6

page 12

page 13

page 14

research
03/17/2023

A statistical framework for GWAS of high dimensional phenotypes using summary statistics, with application to metabolite GWAS

The recent explosion of genetic and high dimensional biobank and 'omic' ...
research
03/30/2015

A Parzen-based distance between probability measures as an alternative of summary statistics in Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) are likelihood-free Monte Carlo m...
research
06/03/2020

One Step to Efficient Synthetic Data

We propose a general method of producing synthetic data, which is widely...
research
10/01/2013

Inference of Network Summary Statistics Through Network Denoising

Consider observing an undirected network that is `noisy' in the sense th...
research
03/16/2022

Context-Aware Drift Detection

When monitoring machine learning systems, two-sample tests of homogeneit...
research
10/18/2021

Learning Prototype-oriented Set Representations for Meta-Learning

Learning from set-structured data is a fundamental problem that has rece...

Please sign up or login with your details

Forgot password? Click here to reset