Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

09/08/2023
by   Ronald Seoh, et al.
0

Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding obtained from a Transformer's single CLS token. In this paper, we argue that using multiple CLS tokens could make a Transformer better specialize to multiple scientific domains. We present Multi2SPE: it encourages each of multiple CLS tokens to learn diverse ways of aggregating token embeddings, then sums them up together to create a single vector representation. We also propose our new multi-domain benchmark, Multi-SciDocs, to test scientific paper vector encoders under multi-domain settings. We show that Multi2SPE reduces error by up to 25 percent in multi-domain citation prediction, while requiring only a negligible amount of computation in addition to one BERT forward pass.

READ FULL TEXT

page 1

page 2

research
04/15/2020

Document-level Representation Learning using Citation-informed Transformers

Representation learning is a critical ingredient for natural language pr...
research
04/15/2020

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Representation learning is a critical ingredient for natural language pr...
research
09/12/2022

Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

Recently introduced transformer-based article encoders (TAEs) designed t...
research
02/08/2022

Counterfactual Multi-Token Fairness in Text Classification

The counterfactual token generation has been limited to perturbing only ...
research
11/19/2022

TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer

In this paper, we introduce a set of effective TOken REduction (TORE) st...
research
09/19/2023

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Highly specific datasets of scientific literature are important for both...
research
04/13/2021

Semantic maps and metrics for science Semantic maps and metrics for science using deep transformer encoders

The growing deluge of scientific publications demands text analysis tool...

Please sign up or login with your details

Forgot password? Click here to reset