MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction

by   Tanishq Gupta, et al.

An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.


page 11

page 12

page 15

page 16


Structured information extraction from complex scientific text with fine-tuned large language models

Intelligently extracting and linking complex scientific information from...

Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks

Leveraging new data sources is a key step in accelerating the pace of ma...

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

The amount of data has growing significance in exploring cutting-edge ma...

BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials

The study of biological materials and bio-inspired materials science is ...

Accelerated materials language processing enabled by GPT

Materials language processing (MLP) is one of the key facilitators of ma...

Looking Through Glass: Knowledge Discovery from Materials Science Literature using Natural Language Processing

Most of the knowledge in materials science literature is in the form of ...

Galactica: A Large Language Model for Science

Information overload is a major obstacle to scientific progress. The exp...

Please sign up or login with your details

Forgot password? Click here to reset