Large Language Models for Automated Open-domain Scientific Hypotheses Discovery

by   Zonglin Yang, et al.

Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction has a limited setting that (1) the observation annotations of the dataset are not raw web corpus but are manually selected sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses annotations are mostly commonsense knowledge, making the task less challenging. In this work, we propose the first NLP dataset for social science academic hypotheses discovery, consisting of 50 recent papers published in top social science journals. Raw web corpora that are necessary for developing hypotheses in the published papers are also collected in the dataset, with the final goal of creating a system that automatically generates valid, novel, and helpful (to human researchers) hypotheses, given only a pile of raw web corpora. The new dataset can tackle the previous problems because it requires to (1) use raw web corpora as observations; and (2) propose hypotheses even new to humanity. A multi-module framework is developed for the task, as well as three different feedback mechanisms that empirically show performance gain over the base framework. Finally, our framework exhibits high performance in terms of both GPT-4 based evaluation and social science expert evaluation.


page 1

page 2

page 3

page 4


A tale of two databases: The use of Web of Science and Scopus in academic papers

Web of Science and Scopus are two world-leading and competing citation d...

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

The need for raw large raw corpora has dramatically increased in recent ...

Can Large Language Models Discern Evidence for Scientific Hypotheses? Case Studies in the Social Sciences

Hypothesis formulation and testing are central to empirical research. A ...

Validation and Topic-driven Ranking for Biomedical Hypothesis Generation Systems

Literature underpins research, providing the foundation for new ideas. B...

Causal Knowledge Extraction from Scholarly Papers in Social Sciences

The scale and scope of scholarly articles today are overwhelming human r...

Machine Reading of Hypotheses for Organizational Research Reviews and Pre-trained Models via R Shiny App for Non-Programmers

The volume of scientific publications in organizational research becomes...

Goal Driven Discovery of Distributional Differences via Language Descriptions

Mining large corpora can generate useful discoveries but is time-consumi...

Please sign up or login with your details

Forgot password? Click here to reset