Clustering genomic words in human DNA using peaks and trends of distributions

08/13/2018
by   Ana Helena Tavares, et al.
0

In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2017

Comparing reverse complementary genomic words based on their distance distributions and frequencies

In this work we study reverse complementary genomic word pairs in the hu...
research
08/01/2020

Theta palindromes in theta conjugates

A DNA string is a Watson-Crick (WK) palindrome when the complement of it...
research
07/31/2016

Identification of repeats in DNA sequences using nucleotide distribution uniformity

Repetitive elements are important in genomic structures, functions and r...
research
01/20/2016

Semantic Word Clusters Using Signed Normalized Graph Cuts

Vector space representations of words capture many aspects of word simil...
research
05/11/2021

The explicit formula of the distributions of the nonoverlapping words and its applications to statistical tests for random numbers

Bassino et al. 2010 and Regnier et al. 1998 showed the generating functi...
research
04/08/2018

Dimensionality's Blessing: Clustering Images by Underlying Distribution

Many high dimensional vector distances tend to a constant. This is typic...
research
02/20/2018

Unsupervised Phase Mapping of X-ray Diffraction Data by Nonnegative Matrix Factorization Integrated with Custom Clustering

Analyzing large X-ray diffraction (XRD) datasets is a key step in high-t...

Please sign up or login with your details

Forgot password? Click here to reset