Fast ordered sampling of DNA sequence variants

11/16/2017
by   Anthony J. Greenberg, et al.
0

Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.

READ FULL TEXT

page 8

page 9

research
03/17/2021

On the Distribution of "Simple Stupid Bugs" in Unit Test Files: An Exploratory Study

A key aspect of ensuring the quality of a software system is the practic...
research
04/01/2022

A Large-scale Dataset of (Open Source) License Text Variants

We introduce a large-scale dataset of the complete texts of free/open so...
research
08/12/2017

SigViewer: Visualizing Multimodal Signals Stored in XDF (Extensible Data Format) Files

Multimodal biosignal acquisition is facilitated by recently introduced s...
research
05/17/2019

Parallel decompression of gzip-compressed files and random access to DNA sequences

Decompressing a file made by the gzip program at an arbitrary location i...
research
06/14/2021

CodeLabeller: A Web-based Code Annotation Tool for Java Design Patterns and Summaries

The appropriate use of design patterns in code is a vital measurement of...
research
07/17/2019

Improved Algorithms for Time Decay Streams

In the time-decay model for data streams, elements of an underlying data...
research
07/27/2021

Abordagem probabilística para análise de confiabilidade de dados gerados em sequenciamentos multiplex na plataforma ABI SOLiD

The next-generation sequencers such as Illumina and SOLiD platforms gene...

Please sign up or login with your details

Forgot password? Click here to reset