Computing the rearrangement distance of natural genomes

01/07/2020
by   Leonard Bohnenkämper, et al.
0

The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos et al. in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed. In this paper we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao, Lin and Moret to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation. The evaluation of our approach shows that it can be used to analyze genomes with up to a few ten thousand markers, which we demonstrate on simulated and real data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/07/2020

Natural family-free genomic distance

A classical problem in comparative genomics is to compute the rearrangem...
research
06/12/2019

The Tandem Duplication Distance is NP-hard

In computational biology, tandem duplication is an important biological ...
research
09/27/2019

Computing the Inversion-Indel Distance

The inversion distance, that is the distance between two unichromosomal ...
research
03/07/2023

Investigating the complexity of the double distance problems

Two genomes over the same set of gene families form a canonical pair whe...
research
05/03/2023

Complexity and Enumeration in Models of Genome Rearrangement

In this paper, we examine the computational complexity of enumeration in...
research
02/21/2018

A framework for cost-constrained genome rearrangement under Double Cut and Join

The study of genome rearrangement has many flavours, but they all are so...
research
04/29/2020

An Almost Exact Linear Complexity Algorithm of the Shortest Transformation of Chain-Cycle Graphs

A "genome structure" is a labeled directed graph with vertices of degree...

Please sign up or login with your details

Forgot password? Click here to reset