Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Redundant Data

09/08/2023
by   Skyler Ruiter, et al.
0

Compressed Sparse Column (CSC) and Coordinate (COO) are popular compression formats for sparse matrices. However, both CSC and COO are general purpose and cannot take advantage of any of the properties of the data other than sparsity, such as data redundancy. Highly redundant sparse data is common in many machine learning applications, such as genomics, and is often too large for in-core computation using conventional sparse storage formats. In this paper, we present two extensions to CSC: (1) Value-Compressed Sparse Column (VCSC) and (2) Index- and Value-Compressed Sparse Column (IVCSC). VCSC takes advantage of high redundancy within a column to further compress data up to 3-fold over COO and 2.25-fold over CSC, without significant negative impact to performance characteristics. IVCSC extends VCSC by compressing index arrays through delta encoding and byte-packing, achieving a 10-fold decrease in memory usage over COO and 7.5-fold decrease over CSC. Our benchmarks on simulated and real data show that VCSC and IVCSC can be read in compressed form with little added computational cost. These two novel compression formats offer a broadly useful solution to encoding and reading redundant sparse data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/05/2022

Spatial Parquet: A Column File Format for Geospatial Data Lakes [Extended Version]

Modern data analytics applications prefer to use column-storage formats ...
research
03/28/2022

Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices

As nowadays Machine Learning (ML) techniques are generating huge data co...
research
03/16/2018

Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers

We provide an algorithm that speeds up polynomial and interaction featur...
research
05/18/2021

LEA: A Learned Encoding Advisor for Column Stores

Data warehouses organize data in a columnar format to enable faster scan...
research
09/01/2022

ByteStore: Hybrid Layouts for Main-Memory Column Stores

The performance of main memory column stores highly depends on the scan ...
research
02/12/2020

EncDBDB: Searchable Encrypted, Fast, Compressed, In-Memory Database using Enclaves

Data confidentiality is an important requirement for clients when outsou...
research
03/07/2023

Efficient Computation of Redundancy Matrices for Moderately Redundant Truss and Frame Structures

Large statically indeterminate truss and frame structures exhibit comple...

Please sign up or login with your details

Forgot password? Click here to reset