Semantic maps and metrics for science Semantic maps and metrics for science using deep transformer encoders

04/13/2021
by   Brendan Chambers, et al.
11

The growing deluge of scientific publications demands text analysis tools that can help scientists and policy-makers navigate, forecast and beneficially guide scientific research. Recent advances in natural language understanding driven by deep transformer networks offer new possibilities for mapping science. Because the same surface text can take on multiple and sometimes contradictory specialized senses across distinct research communities, sensitivity to context is critical for infometric applications. Transformer embedding models such as BERT capture shades of association and connotation that vary across the different linguistic contexts of any particular word or span of text. Here we report a procedure for encoding scientific documents with these tools, measuring their improvement over static word embeddings in a nearest-neighbor retrieval task. We find discriminability of contextual representations is strongly influenced by choice of pooling strategy for summarizing the high-dimensional network activations. Importantly, we note that fundamentals such as domain-matched training data are more important than state-of-the-art NLP tools. Yet state-of-the-art models did offer significant gains. The best approach we investigated combined domain-matched pretraining, sound pooling, and state-of-the-art deep transformer network encoders. Finally, with the goal of leveraging contextual representations from deep encoders, we present a range of measurements for understanding and forecasting research communities in science.

READ FULL TEXT
research
09/23/2019

Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings

Contextualized word embeddings (CWE) such as provided by ELMo (Peters et...
research
11/30/2021

A Comparative Study of Transformers on Word Sense Disambiguation

Recent years of research in Natural Language Processing (NLP) have witne...
research
01/01/2023

Leveraging Semantic Representations Combined with Contextual Word Representations for Recognizing Textual Entailment in Vietnamese

RTE is a significant problem and is a reasonably active research communi...
research
08/17/2022

Transformer Encoder for Social Science

High-quality text data has become an important data source for social sc...
research
03/19/2019

Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer m...
research
09/08/2023

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

Many useful tasks on scientific documents, such as topic classification ...
research
10/13/2020

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Recent advances in automatic evaluation metrics for text have shown that...

Please sign up or login with your details

Forgot password? Click here to reset