Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

06/05/2023
by   Minsu Kim, et al.
0

We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: \href{https://github.com/kaist-silab/bootgen}{https://github.com/kaist-silab/bootgen}.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/07/2023

Bidirectional Learning for Offline Model-based Biological Sequence Design

Offline model-based optimization aims to maximize a black-box objective ...
research
09/15/2022

Bidirectional Learning for Offline Infinite-width Model-based Optimization

In offline model-based optimization, we strive to maximize a black-box o...
research
05/08/2023

MGR: Multi-generator based Rationalization

Rationalization is to employ a generator and a predictor to construct a ...
research
06/28/2022

Joint Generator-Ranker Learning for Natural Language Generation

Due to exposure bias, most existing natural language generation (NLG) mo...
research
09/04/2019

Mixture Content Selection for Diverse Sequence Generation

Generating diverse sequences is important in many NLP applications such ...
research
11/27/2020

TaylorGAN: Neighbor-Augmented Policy Update for Sample-Efficient Natural Language Generation

Score function-based natural language generation (NLG) approaches such a...
research
09/20/2023

Parallel-mentoring for Offline Model-based Optimization

We study offline model-based optimization to maximize a black-box object...

Please sign up or login with your details

Forgot password? Click here to reset