Unsupervised Term Extraction for Highly Technical Domains

10/24/2022
by   Francesco Fusco, et al.
0

Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

Extracting Text Representations for Terms and Phrases in Technical Domains

Extracting dense representations for terms and phrases is a task of grea...
research
10/14/2021

Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence

This paper proposes a transformer over transformer framework, called Tra...
research
01/22/2021

Unsupervised Technical Domain Terms Extraction using Term Extractor

Terminology extraction, also known as term extraction, is a subtask of i...
research
05/13/2022

PathologyBERT – Pre-trained Vs. A New Transformer Language Model for Pathology Domain

Pathology text mining is a challenging task given the reporting variabil...
research
06/16/2020

Weakly-supervised Domain Adaption for Aspect Extraction via Multi-level Interaction Transfer

Fine-grained aspect extraction is an essential sub-task in aspect based ...
research
06/13/2022

INDIGO: Intrinsic Multimodality for Domain Generalization

For models to generalize under unseen domains (a.k.a domain generalizati...
research
06/15/2020

Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform

Many platforms collect crowdsourced information primarily from volunteer...

Please sign up or login with your details

Forgot password? Click here to reset