Learning quantitative sequence-function relationships from massively parallel experiments

05/30/2015
by   Gurinder S. Atwal, et al.
0

A fundamental aspect of biological information processing is the ubiquity of sequence-function relationships -- functions that map the sequence of DNA, RNA, or protein to a biochemically relevant activity. Most sequence-function relationships in biology are quantitative, but only recently have experimental techniques for effectively measuring these relationships been developed. The advent of such "massively parallel" experiments presents an exciting opportunity for the concepts and methods of statistical physics to inform the study of biological systems. After reviewing these recent experimental advances, we focus on the problem of how to infer parametric models of sequence-function relationships from the data produced by these experiments. Specifically, we retrace and extend recent theoretical work showing that inference based on mutual information, not the standard likelihood-based approach, is often necessary for accurately learning the parameters of these models. Closely connected with this result is the emergence of "diffeomorphic modes" -- directions in parameter space that are far less constrained by data than likelihood-based inference would suggest. Analogous to Goldstone modes in physics, diffeomorphic modes arise from an arbitrarily broken symmetry of the inference problem. An analytically tractable model of a massively parallel experiment is then described, providing an explicit demonstration of these fundamental aspects of statistical inference. This paper concludes with an outlook on the theoretical and computational challenges currently facing studies of quantitative sequence-function relationships.

READ FULL TEXT

page 5

page 6

research
04/06/2023

Biological Sequence Kernels with Guaranteed Flexibility

Applying machine learning to biological sequences - DNA, RNA and protein...
research
12/01/2020

Artificial intelligence techniques for integrative structural biology of intrinsically disordered proteins

We outline recent developments in artificial intelligence (AI) and machi...
research
06/13/2023

Simulation-Based Frequentist Inference with Tractable and Intractable Likelihoods

High-fidelity simulators that connect theoretical models with observatio...
research
09/16/2021

PDBench: Evaluating Computational Methods for Protein Sequence Design

Proteins perform critical processes in all living systems: converting so...
research
07/08/2021

Network and Sequence-Based Prediction of Protein-Protein Interactions

Background:Typically, proteins perform key biological functions by inter...
research
09/25/2021

Statistical Inference for Data Integration

In the age of big data, data integration is a critical step especially i...
research
08/27/2021

Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function

DNA-encoded library (DEL) screening and quantitative structure-activity ...

Please sign up or login with your details

Forgot password? Click here to reset