Compressing Tabular Data via Latent Variable Estimation

02/20/2023
by   Andrea Montanari, et al.
0

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.

READ FULL TEXT

page 7

page 20

page 21

research
09/08/2018

Typed Table Transformations

Spreadsheet tables are often labeled, and these labels effectively const...
research
01/16/2023

Ae^2I: A Double Autoencoder for Imputation of Missing Values

The most common strategy of imputing missing values in a table is to stu...
research
01/15/2019

Practical Lossless Compression with Latent Variables using Bits Back Coding

Deep latent variable models have seen recent success in many data domain...
research
05/19/2022

TransTab: Learning Transferable Tabular Transformers Across Tables

Tabular data (or tables) are the most widely used data format in machine...
research
01/13/2020

Identifying Table Structure in Documents using Conditional Generative Adversarial Networks

In many industries, as well as in academic research, information is prim...
research
05/09/2023

Multiscale Augmented Normalizing Flows for Image Compression

Most learning-based image compression methods lack efficiency for high i...
research
09/10/2021

Neural Networks for Latent Budget Analysis of Compositional Data

Compositional data are non-negative data collected in a rectangular matr...

Please sign up or login with your details

Forgot password? Click here to reset