A word recurrence based algorithm to extract genomic dictionaries

09/22/2020
by   Vincenzo Bonnici, et al.
0

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.

READ FULL TEXT
12/01/2020

Extracting Synonyms from Bilingual Dictionaries

We present our progress in developing a novel algorithm to extract synon...
11/14/2018

Rice-Marlin Codes: Tiny and Efficient Variable-to-Fixed Codes

Marlin is a Variable-to-Fixed (VF) codec optimized for high decoding spe...
06/09/2022

Hinted Dictionaries: Efficient Functional Ordered Sets and Maps

This article introduces hinted dictionaries for expressing efficient ord...
10/12/2020

Look It Up: Bilingual and Monolingual Dictionaries Improve Neural Machine Translation

Despite advances in neural machine translation (NMT) quality, rare words...
02/25/2016

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Many important forms of data are stored digitally in XML format. Errors ...
09/16/2021

Emergence of functional information from multivariate correlations

The information content of symbolic sequences (such as nucleic- or amino...
08/09/2019

Generating Information Extraction Patterns from Overlapping and Variable Length Annotations using Sequence Alignment

Sequence alignments are used to capture patterns composed of elements re...