Are Abstracts Enough for Hypothesis Generation?

04/13/2018
by   Justin Sybrandt, et al.
0

The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? What effect do corpus size and document length have on HG results? To answer these questions we train multiple versions of knowledge network-based HG system, Moliere, on varying corpora in order to compare challenges and tradeoffs in terms of result quality and computational requirements. Moliere generalizes main principles of similar knowledge network-based HG systems and reinforces them with topic modeling components. The corpora include the abstract and full-text versions of PubMed Central, as well as iterative halves of MEDLINE, which allows us to compare the effect document length and count has on the results. We find that corpora with a higher median document length result in higher quality results, yet require substantially longer to process. Additionally, we find that the effect of document length is greater than that of document count, even if both sets contain only paper abstracts. Reproducibility: Our code can be found at github.com/JSybrandt/moliere, and our data is hosted at bit.ly/2GxghpM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2021

Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

Topic evolution modeling has received significant attentions in recent d...
research
04/28/2021

Evaluating Document Representations for Content-based Legal Literature Recommendations

Recommender systems assist legal professionals in finding relevant liter...
research
10/07/2022

Longtonotes: OntoNotes with Longer Coreference Chains

Ontonotes has served as the most important benchmark for coreference res...
research
08/22/2018

Reproducible data citations for computational research

The general purpose of a scientific publication is the exchange and spre...
research
07/23/2019

Overview and Results: CL-SciSumm Shared Task 2019

The CL-SciSumm Shared Task is the first medium-scale shared task on scie...
research
10/13/2020

Aspect-based Document Similarity for Research Papers

Traditional document similarity measures provide a coarse-grained distin...
research
06/01/2022

Assessing the trade-off between prediction accuracy and interpretability for topic modeling on energetic materials corpora

As the amount and variety of energetics research increases, machine awar...

Please sign up or login with your details

Forgot password? Click here to reset