Seven clusters in genomic triplet distributions

05/29/2003
by   Alexander N. Gorban, et al.
0

In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring learning dataset of already known genes. The fact that unsupervised gene-detection is possible closely connected to existence of a cluster structure in oligomer frequency distributions. In this paper we study cluster structure of several genomes in the space of their triplet frequencies, using pure data exploration strategy. Several complete genomic sequences were analyzed, using visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90 allows to analyze effectively performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes. The information content of the triplet distributions and the validity of the mean-field models are analysed.

READ FULL TEXT
research
03/17/2015

ProtVec: A Continuous Distributed Representation of Biological Sequences

We introduce a new representation and feature extraction method for biol...
research
05/07/2018

GeneVis - An interactive visualization tool for combining cross-discipline datasets within genetics

GeneVis is a web-based tool to visualize complementary data sets of diff...
research
07/07/2020

Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees

We define a new problem in comparative genomics, denoted PQ-Tree Search,...
research
07/19/2023

ProtiGeno: a prokaryotic short gene finder using protein language models

Prokaryotic gene prediction plays an important role in understanding the...
research
08/09/2021

Classification of Influenza Hemagglutinin Protein Sequences using Convolutional Neural Networks

The Influenza virus can be considered as one of the most severe viruses ...
research
07/31/2018

K-medoids Clustering of Data Sequences with Composite Distributions

This paper studies clustering of data sequences using the k-medoids algo...

Please sign up or login with your details

Forgot password? Click here to reset