Forecasting labels under distribution-shift for machine-guided sequence design

11/18/2022
by   Lauren Berk Wheelock, et al.
0

The ability to design and optimize biological sequences with specific functionalities would unlock enormous value in technology and healthcare. In recent years, machine learning-guided sequence design has progressed this goal significantly, though validating designed sequences in the lab or clinic takes many months and substantial labor. It is therefore valuable to assess the likelihood that a designed set contains sequences of the desired quality (which often lies outside the label distribution in our training data) before committing resources to an experiment. Forecasting, a prominent concept in many domains where feedback can be delayed (e.g. elections), has not been used or studied in the context of sequence design. Here we propose a method to guide decision-making that forecasts the performance of high-throughput libraries (e.g. containing 10^5 unique variants) based on estimates provided by models, providing a posterior for the distribution of labels in the library. We show that our method outperforms baselines that naively use model scores to estimate library performance, which are the only tool available today for this purpose.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2022

Bandit Theory and Thompson Sampling-Guided Directed Evolution for Sequence Optimization

Directed Evolution (DE), a landmark wet-lab method originated in 1960s, ...
research
09/04/2023

Blind Biological Sequence Denoising with Self-Supervised Set Learning

Biological sequence analysis relies on the ability to denoise the imprec...
research
07/07/2022

HierarchicalForecast: A Reference Framework for Hierarchical Forecasting in Python

Large collections of time series data are commonly organized into cross-...
research
08/10/2022

Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning

Inverse design of short single-stranded RNA and DNA sequences (aptamers)...
research
04/06/2023

Biological Sequence Kernels with Guaranteed Flexibility

Applying machine learning to biological sequences - DNA, RNA and protein...
research
03/08/2018

SentRNA: Improving computational RNA design by incorporating a prior of human design strategies

Designing RNA sequences that fold into specific structures and perform d...
research
10/05/2020

AdaLead: A simple and robust adaptive greedy search algorithm for sequence design

Efficient design of biological sequences will have a great impact across...

Please sign up or login with your details

Forgot password? Click here to reset