MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction

by   Tanishq Gupta, et al.

An overwhelmingly large amount of knowledge in the materials domain is generated and stored as text published in peer-reviewed scientific literature. Recent developments in natural language processing, such as bidirectional encoder representations from transformers (BERT) models, provide promising tools to extract information from these texts. However, direct application of these models in the materials domain may yield suboptimal results as the models themselves may not be trained on notations and jargon that are specific to the domain. Here, we present a materials-aware language model, namely, MatSciBERT, which is trained on a large corpus of scientific literature published in the materials domain. We further evaluate the performance of MatSciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction, on different materials datasets. We show that MatSciBERT outperforms SciBERT, a language model trained on science corpus, on all the tasks. Further, we discuss some of the applications of MatSciBERT in the materials domain for extracting information, which can, in turn, contribute to materials discovery or optimization. Finally, to make the work accessible to the larger materials community, we make the pretrained and finetuned weights and the models of MatSciBERT freely accessible.



page 11

page 12

page 15

page 16


TourBERT: A pretrained language model for the tourism industry

The Bidirectional Encoder Representations from Transformers (BERT) is cu...

Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks

Leveraging new data sources is a key step in accelerating the pace of ma...

Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning

Scientific literature is one of the most significant resources for shari...

The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

This paper presents a new challenging information extraction task in the...

Looking Through Glass: Knowledge Discovery from Materials Science Literature using Natural Language Processing

Most of the knowledge in materials science literature is in the form of ...

Analyzing Research Trends in Inorganic Materials Literature Using NLP

In the field of inorganic materials science, there is a growing demand t...

EXSCLAIM! – An automated pipeline for the construction of labeled materials imaging datasets from literature

Due to recent improvements in image resolution and acquisition speed, ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.