Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory

12/06/2022
by   Raffaele Giancarlo, et al.
0

Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an area does not have the maturity of genomic compression, lacking an homogeneous and methodologically sound experimental foundation that allows to fairly compare the relative merits of the available solutions, and that takes into account also the rich choices of compression methods that can be used. Results: We provide such a foundation here, supporting it with an extensive set of experiments that use reference datasets and a carefully selected set of representative data compressors. Our results highlight the spectrum of compressor choices one has in terms of Pareto Optimality of compression vs. post-processing, this latter being important when the Dictionary needs to be decompressed many times. In addition to the useful indications, not available elsewhere, that this study offers to the researchers interested in storing k-mer dictionaries in compressed form, a software system that can be readily used to explore the Pareto Optimal solutions available r a given Dictionary is also provided. Availability: The software system is available at https://github.com/GenGrim76/Pareto-Optimal-GDC, together with user manuals and installation instructions. Contact: raffaele.giancarlo@unipa.it Supplementary information: Additional data are available in the Supplementary Material.

READ FULL TEXT

page 23

page 26

page 28

page 29

page 30

research
04/17/2023

Lossy Compressor preserving variant calling through Extended BWT

A standard format used for storing the output of high-throughput sequenc...
research
04/23/2022

LitMind Dictionary: An Open-Source Online Dictionary

Dictionaries can help language learners to learn vocabulary by providing...
research
12/22/2015

A Novel Approach to Compress Centralized Text Data using Indexed Dictionary

Data compression is very important feature in terms of saving the memory...
research
02/08/2022

The Weights can be Harmful: Pareto Search versus Weighted Search in Multi-Objective Search-Based Software Engineering

In presence of multiple objectives to be optimized in Search-Based Softw...
research
10/18/2017

A complete characterization of optimal dictionaries for least squares representation

Dictionaries are collections of vectors used for representations of elem...
research
02/01/2017

Dominance Move: A Measure of Comparing Solution Sets in Multiobjective Optimization

One of the most common approaches for multiobjective optimization is to ...

Please sign up or login with your details

Forgot password? Click here to reset