Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design

05/06/2021
by   Ryan-Rhys Griffiths, et al.
17

Datasets in the Natural Sciences are often curated with the goal of aiding scientific understanding and hence may not always be in a form that facilitates the application of machine learning. In this paper, we identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction. First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner. Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only. Lastly, we discuss the problem of reagent prediction, in addition to reactant prediction, in order to solve the full synthesis design problem, highlighting the mismatch between what machine learning solves and what a lab chemist would need. Our critiques are also relevant to the burgeoning field of using machine learning to accelerate progress in experimental Natural Sciences, where datasets are often split in a biased way, are highly noisy, and contextual variables that are not evident from the data strongly influence the outcome of experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2019

Judging Chemical Reaction Practicality From Positive Sample only Learning

Chemical reaction practicality is the core task among all symbol intelli...
research
02/20/2020

Autonomous Discovery of Unknown Reaction Pathways from Data by Chemical Reaction Neural Network

The inference of chemical reaction networks is an important task in unde...
research
11/06/2018

Molecular Transformer for Chemical Reaction Prediction and Uncertainty Estimation

Organic synthesis is one of the key stumbling blocks in medicinal chemis...
research
04/25/2022

Predicting Real-time Scientific Experiments Using Transformer models and Reinforcement Learning

Life and physical sciences have always been quick to adopt the latest ad...
research
09/19/2023

Information geometric bound on general chemical reaction networks

We investigate the dynamics of chemical reaction networks (CRNs) with th...
research
04/27/2022

Multimodal Transformer-based Model for Buchwald-Hartwig and Suzuki-Miyaura Reaction Yield Prediction

Predicting the yield percentage of a chemical reaction is useful in many...
research
09/20/2021

Programming and Training Rate-Independent Chemical Reaction Networks

Embedding computation in biochemical environments incompatible with trad...

Please sign up or login with your details

Forgot password? Click here to reset