Translation between Molecules and Natural Language

04/25/2022
by   Carl Edwards, et al.
0

Joint representations between images and text have been deeply investigated in the literature. In computer vision, the benefits of incorporating natural language have become clear for enabling semantic-level control of images. In this work, we present MolT5-a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. MolT5 allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Furthermore, since MolT5 pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Additionally, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. By interfacing molecules with natural language, we enable a higher semantic level of control over molecule discovery and understanding–a critical task for scientific domains such as drug discovery and material design. Our results show that MolT5-based models are able to generate outputs, both molecule and text, which in many cases are high quality and match the input modality. On molecule generation, our best model achieves 30 (i.e., it generates the correct structure for about one-third of the captions in our held-out test set).

READ FULL TEXT

page 3

page 5

research
06/10/2021

ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation

Automatic evaluations for natural language generation (NLG) conventional...
research
06/11/2023

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Molecule discovery plays a crucial role in various scientific fields, ad...
research
02/14/2022

I-Tuning: Tuning Language Models with Image for Caption Generation

Recently, tuning the pre-trained language model (PLM) in a parameter-eff...
research
06/20/2017

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

Recent work in computer vision has yielded impressive results in automat...
research
08/14/2023

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Large language models have made significant strides in natural language ...
research
10/14/2021

Predictive models of RNA degradation through dual crowdsourcing

Messenger RNA-based medicines hold immense potential, as evidenced by th...
research
09/13/2022

A new Reinforcement Learning framework to discover natural flavor molecules

The flavor is the focal point in the flavor industry, which follows soci...

Please sign up or login with your details

Forgot password? Click here to reset