A word recurrence based algorithm to extract genomic dictionaries

by   Vincenzo Bonnici, et al.

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.


Extracting Synonyms from Bilingual Dictionaries

We present our progress in developing a novel algorithm to extract synon...

Rice-Marlin Codes: Tiny and Efficient Variable-to-Fixed Codes

Marlin is a Variable-to-Fixed (VF) codec optimized for high decoding spe...

Hinted Dictionaries: Efficient Functional Ordered Sets and Maps

This article introduces hinted dictionaries for expressing efficient ord...

Look It Up: Bilingual and Monolingual Dictionaries Improve Neural Machine Translation

Despite advances in neural machine translation (NMT) quality, rare words...

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Many important forms of data are stored digitally in XML format. Errors ...

Emergence of functional information from multivariate correlations

The information content of symbolic sequences (such as nucleic- or amino...

Generating Information Extraction Patterns from Overlapping and Variable Length Annotations using Sequence Alignment

Sequence alignments are used to capture patterns composed of elements re...